If you're building LLM-as-judge systems, this gives you the production patterns that actually matter: position bias mitigation through double-pass evaluation, pairwise comparison versus direct scoring trade-offs, and rubric design that reduces variance. The taxonomy is clean (direct scoring for objective criteria, pairwise for subjective preference), and the bias landscape section names the gotchas you'll hit (length bias, self-enhancement, verbosity rewards). What I appreciate is the metric selection framework that matches statistical measures to task structure, plus the honest guidance on confidence calibration. It assumes you're past "can I use an LLM to evaluate outputs" and into "how do I make this reliable enough to ship."
npx -y skills add muratcankoylan/agent-skills-for-context-engineering --skill advanced-evaluation --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
metabase/metabase
github/awesome-copilot
UKGovernmentBEIS/inspect_evals
addyosmani/agent-skills