This walks you through calibrating an LLM judge against human labels using proper train/dev/test splits and TPR/TNR metrics. You'd use it after writing a judge prompt when you need to verify it actually agrees with human judgment before trusting it in production. The workflow is methodical: split your labeled data, iterate on the dev set until you hit 90% TPR and TNR, then measure once on the held-out test set. It includes the Rogan-Gladen bias correction formula for estimating true success rates from biased judge scores, plus bootstrap confidence intervals. The anti-pattern section is worth reading since most people skip validation entirely and just assume judges work.
npx skills add https://github.com/hamelsmu/evals-skills --skill validate-evaluator