This is the framework you need when your agent works great in demos but falls apart in production. It gives you statistical evaluation patterns that run tests multiple times to catch stochastic behavior, plus behavioral contract testing to enforce hard boundaries on what agents can and cannot do. The statistical evaluator calculates pass rates with confidence intervals, tracks behavior consistency across runs, and flags concerns like high variance or unstable outputs. The behavioral contract pattern is especially useful for production agents where you need guarantees about tone, scope, or safety. Remember that even top agents score under 50% on real world benchmarks, so this focuses on the metrics that actually matter for reliability.
npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill agent-evaluation