Agent Evaluation

Editor's Note

This is for when you need to actually test agent reliability before production, not just run them through benchmarks once and hope. It covers statistical test evaluation (running tests multiple times to analyze distributions), behavioral contract testing for invariants, and adversarial testing to actively break things. The sharp edges table is the most useful part: it calls out real problems like agents scoring well on benchmarks but failing in production, flaky tests that pass sometimes, and accidental data leakage. The core insight is right: evaluating LLM agents isn't like testing traditional software because the same input produces different outputs. You'll want this if you've ever watched an agent that passed all your tests completely fall apart with real users.

Install

npx skills add https://github.com/davila7/claude-code-templates --skill agent-evaluation

Votes

Installs520

GitHub Stars27.3k

Install

npx skills add https://github.com/davila7/claude-code-templates --skill agent-evaluation

Agent Evaluation

Install

Agent Evaluation

Install

Related Backend & APIs Skills

Related Backend & APIs Skills