This is a solid reference for measuring LLM output quality across automated metrics (BLEU, ROUGE, BERTScore), human evaluation rubrics, and LLM-as-judge patterns. You get working Python examples for scoring translations and summaries, comparing model outputs pairwise, and building custom metrics like groundedness checks. The automated metrics are fast but often miss nuance, so the guide walks through when to layer in human ratings or use a stronger model as a judge. Most useful when you're trying to catch regressions before shipping prompt changes or need to justify which of two models actually performs better on your specific use case.
npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill llm-evaluation