This gives you the scaffolding to build repeatable LLM evaluations with golden datasets, scoring rubrics, and regression tracking. You define test cases with expected outputs, run your model against them, score with exact match or semantic similarity or LLM-as-judge, then check if results meet your thresholds. The regression report compares baseline runs to current runs so you catch when a prompt tweak breaks something that used to work. It's structured enough to drop into CI but flexible on the scoring functions. Best for teams that have moved past vibes-based testing and need systematic quality gates before shipping model changes.
npx skills add https://github.com/patricio0312rev/skills --skill evaluation-harnessjuliusbrussee/caveman
mattpocock/skills
mertbuilds/skills
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills