This is a benchmarking harness for comparing coding agents on your actual codebase instead of relying on vibes. You define tasks as YAML files with prompts and pass/fail criteria (pytest, grep patterns, or LLM judges), then run multiple agents in isolated git worktrees and collect pass rates, cost, and timing. The real value is reproducibility: pinned commits, multiple trials to capture variance, and structured task definitions you can version control. Best for teams evaluating which agent to adopt or checking for regressions after model updates. The metrics table is straightforward, three runs minimum recommended since agents are non-deterministic.
npx -y skills add affaan-m/everything-claude-code --skill agent-eval --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
sickn33/antigravity-awesome-skills
moizibnyousaf/ai-agent-skills
github/awesome-copilot