Gives you the patterns for using LLMs to evaluate other LLM outputs, which is trickier than it sounds. Covers direct scoring versus pairwise comparison (pairwise is more reliable for subjective stuff), plus the biases that'll mess you up: position bias, length bias, self-enhancement when models grade themselves. The decision framework is simple: if there's ground truth, use direct scoring. If it's subjective, use pairwise with position swapping. Most useful when you're building eval pipelines or trying to figure out why your automated scoring keeps giving weird results. The bias mitigation table alone will save you debugging time.
npx skills add https://github.com/shipshitdev/library --skill advanced-evaluation