A structured harness for testing LLM agents before they break in production. You define 10 representative tasks your agent must ace, 5 refusal cases it must decline gracefully, then score each run across six dimensions: task success, safety, reliability, latency, debuggability, and factual grounding. The workflow enforces determinism controls, tool tracing, and baseline comparisons so you can gate deploys on actual thresholds instead of vibes. Includes copy-paste templates for day-0 setup, a scoring CLI, and guides for prompt injection tests, multi-agent coordination, and flake quarantine. Honest take: if you're shipping agents that call tools or handle user data, this is the scaffolding you should have built last month.
npx skills add https://github.com/vasilyu1983/ai-agents-public --skill qa-agent-testing