Llm Evaluation

Editor's Note

You know those moments when you deploy an LLM change and wonder if you just made things better or worse? This helps you actually measure that with real numbers. It sets up automated scoring using metrics like BLEU, ROUGE, and BERTScore, plus LLM-as-judge patterns where you use Claude to evaluate outputs for quality, accuracy, and safety. You get comparison frameworks for A/B testing different prompts, groundedness checks against source material, and toxicity detection. Really useful when you're iterating on prompts, comparing models, or need to catch regressions before they hit production. Beats guessing based on a few cherry-picked examples.

Install

npx skills add https://github.com/wshobson/agents --skill llm-evaluation

Votes

Installs5k

GitHub Stars33.7k

Llm Evaluation

Install

Llm Evaluation

Install

Related Testing & QA Skills

Related Testing & QA Skills