This measures whether your agents actually follow the skills and rules you've defined by auto-generating test scenarios at three strictness levels, running the agent, and checking if the tool call sequence matches what you'd expect. It's honest about prompt independence: does your agent still follow testing.md when the user prompt doesn't mention testing? You get self-contained reports with full timelines and LLM-based classification of each tool call against your spec. Most useful when you've just added new rules or skills and want to know if they're actually changing behavior, or when you suspect agents are ignoring something. The dry run mode lets you preview scenarios without burning tokens.
npx -y skills add affaan-m/everything-claude-code --skill skill-comply --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills