This is about testing AI agents the right way, focusing on outcomes instead of execution paths since agents are non-deterministic and can find different valid routes to the same goal. You'd reach for this when building quality gates for agent pipelines, catching regressions, or validating that your context engineering actually works as intended. It's forked from shipshitdev/library and has passed security audits from Gen Agent Trust Hub, Socket, and Snyk. The framework acknowledges that agents don't have single correct answers like traditional software, so it provides evaluation methods that account for dynamic decision making and give you actionable feedback on whether your agent achieved the right outcome through reasonable means.
npx skills add https://github.com/flora131/atomic --skill evaluation