This walks you through building LLM-as-judge evaluators for things code can't check: tone, faithfulness, relevance. The core insight is binary pass/fail only, one failure mode per judge, with detailed critique before verdict. You need 20+ labeled examples per outcome, and the guide is firm about exhausting regex and keyword checks first before reaching for semantic evaluation. The structured approach (task definition, pass/fail criteria, few-shot examples, forced JSON output) is practical, and the anti-patterns section saves time by calling out common mistakes like using Likert scales or skipping validation. It assumes you've already done error analysis and have labeled data ready.
npx skills add https://github.com/hamelsmu/evals-skills --skill write-judge-prompt