This is for when you need to actually test agent reliability before production, not just run them through benchmarks once and hope. It covers statistical test evaluation (running tests multiple times to analyze distributions), behavioral contract testing for invariants, and adversarial testing to actively break things. The sharp edges table is the most useful part: it calls out real problems like agents scoring well on benchmarks but failing in production, flaky tests that pass sometimes, and accidental data leakage. The core insight is right: evaluating LLM agents isn't like testing traditional software because the same input produces different outputs. You'll want this if you've ever watched an agent that passed all your tests completely fall apart with real users.
npx skills add https://github.com/davila7/claude-code-templates --skill agent-evaluation