This audits your LLM evaluation pipeline and tells you what's broken. It walks through six diagnostic areas: whether you did error analysis on actual traces, if your judges are binary or using noisy Likert scales, whether judges are validated against human labels with TPR/TNR, if you're using similarity metrics like ROUGE as primary evals, how your human review process works, and if you have enough labeled data. The output is a prioritized findings report with concrete next steps linked to other skills. Most useful when you inherit an eval system and don't trust it, or when you have evals running but suspect they're missing real failures. The checks are opinionated but grounded in what actually breaks in production.
npx skills add https://github.com/hamelsmu/evals-skills --skill eval-audit