This handles the full lifecycle of testing agent systems: deterministic checks before you burn tokens on LLM judges, multi-dimensional rubrics that catch specific failure modes instead of hiding them behind single scores, and regression suites that account for non-deterministic paths. The guidance on browsing-agent research is practical: token budget and tool call count explain most performance variance, so evaluate with production-realistic limits rather than unlimited resources. It's opinionated about evaluating outcomes over execution paths, which makes sense given agents take different valid routes to goals. Use this when building quality gates or measuring whether context engineering changes actually work. Switch to advanced-evaluation if you're designing the judge itself or dealing with pairwise comparison calibration.
npx -y skills add muratcankoylan/agent-skills-for-context-engineering --skill evaluation --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
sickn33/antigravity-awesome-skills
kubesphere/kubesphere
supercent-io/skills-template