This walks you through evaluating RAG systems by separating retrieval from generation, which is the right approach but easy to skip when you're just trying to ship. It pushes you to do error analysis on traces first, then build eval datasets with synthetic QA pairs or manual curation. The chunking optimization grid search is genuinely useful since chunk size and overlap can swing your Recall@k by 10+ points on the same retriever. The faithfulness versus relevance breakdown for generation eval is clear, and the anti-pattern list (don't use ROUGE, don't skip traces) will save you from common mistakes. Good for when your RAG pipeline works sometimes but you need systematic measurement to find the actual bottleneck.
npx skills add https://github.com/hamelsmu/evals-skills --skill evaluate-rag