This is about building evaluation frameworks for agent systems where the normal rules don't apply. The core insight here is that agents are non-deterministic and can take completely different valid paths to the same goal, so you need outcome-focused rubrics instead of checking specific steps. There's a notable finding from BrowseComp research: token usage explains 80% of performance variance, which means your evaluation needs realistic token budgets, not unlimited resources. The framework covers LLM-as-judge for scale, human evaluation for edge cases, and multi-dimensional scoring across accuracy, completeness, and tool efficiency. Use this when you need systematic testing before shipping changes or want to catch regressions in production agent systems.
npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill evaluation