A structured approach to evaluating AI agents that covers the full lifecycle from initial design to production monitoring. You get an 8-step roadmap starting with task creation and moving through environment isolation, grading strategies, and saturation detection. It breaks down three grader types (code-based, model-based, human) with practical trade-offs for coding, conversational, research, and computer-use agents. The skill references real benchmarks like SWE-bench and WebArena, plus includes CI/CD patterns and A/B testing templates. Most useful when you're moving past ad-hoc testing and need a systematic way to measure whether your agent is actually getting better or just different.
npx -y skills add supercent-io/skills-template --skill agent-evaluation --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
sickn33/antigravity-awesome-skills
kubesphere/kubesphere
supercent-io/skills-template