This one takes a codebase with active skills and measures whether they're actually working. It picks a file with legacy anti-patterns (hardcoded secrets, logic in UI components, missing design tokens), refactors it using your active skills, then scores before and after using eval patterns from evals.json. The output is a compliance delta showing how much the skills improved things, plus a breakdown of where skills failed to catch issues. Honest take: it's meta tooling for tuning your skill library. If you're running multiple Claude Code skills and want data on which ones are pulling their weight versus creating noise, this gives you that signal. The iteration table helps you decide whether to refine triggers, tighten rules, or just exclude skills that don't fit your stack.
npx -y skills add hoangnguyen0403/agent-skills-standard --skill skill-benchmark --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
metabase/metabase
github/awesome-copilot
UKGovernmentBEIS/inspect_evals
addyosmani/agent-skills