This is a framework for running evals on Claude Code sessions before, during, and after development. You define pass/fail criteria upfront (capability evals for new features, regression evals to catch breakage), then measure reliability with pass@k metrics like "success within 3 attempts." It supports code-based graders for deterministic checks, model graders for open-ended evaluation, and human review flags for security-critical changes. The workflow is straightforward: define evals in `.claude/evals/`, run checks during implementation, generate reports. If you're shipping anything non-trivial with AI assistance and want reproducible quality gates instead of vibes-based shipping, this gives you the scaffolding. The pass@3 threshold of 90% is reasonable for most features.
npx -y skills add affaan-m/everything-claude-code --skill eval-harness --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills