You know those moments when you deploy an LLM change and wonder if you just made things better or worse? This helps you actually measure that with real numbers. It sets up automated scoring using metrics like BLEU, ROUGE, and BERTScore, plus LLM-as-judge patterns where you use Claude to evaluate outputs for quality, accuracy, and safety. You get comparison frameworks for A/B testing different prompts, groundedness checks against source material, and toxicity detection. Really useful when you're iterating on prompts, comparing models, or need to catch regressions before they hit production. Beats guessing based on a few cherry-picked examples.
npx skills add https://github.com/wshobson/agents --skill llm-evaluation