This is a meta-skill for evaluating your Claude Code agents and commands. Use it when you're testing whether your prompt changes actually work or trying to measure if your context engineering is effective. The framework covers LLM-as-judge evaluation, multi-dimensional rubrics that weigh instruction following against tool efficiency, and test set design across complexity levels. One interesting detail: research shows token usage explains 80% of agent performance variance, so your evaluation needs realistic token constraints. The rubric templates and evaluation prompts are ready to adapt, which beats starting from scratch when you're trying to catch regressions or validate improvements.
npx skills add https://github.com/neolabhq/context-engineering-kit --skill agent-evaluation