Skills

Evaluation Harness

137 installs43 stars

Summary

This gives you the scaffolding to build repeatable LLM evaluations with golden datasets, scoring rubrics, and regression tracking. You define test cases with expected outputs, run your model against them, score with exact match or semantic similarity or LLM-as-judge, then check if results meet your thresholds. The regression report compares baseline runs to current runs so you catch when a prompt tweak breaks something that used to work. It's structured enough to drop into CI but flexible on the scoring functions. Best for teams that have moved past vibes-based testing and need systematic quality gates before shipping model changes.

Install

npx skills add https://github.com/patricio0312rev/skills --skill evaluation-harness

First SeenJun 3, 2026

View on GitHub