This walks you through the unglamorous work of figuring out how your LLM system actually fails. You review about 100 traces, note what went wrong in each one, then group similar failures into 5-10 categories you can measure and fix. The process is deliberately manual at first because pre-defined failure lists cause confirmation bias. It pushes you to distinguish root causes (missing a filter in the SQL) from symptoms (wrong results downstream) and only build evaluators for failures that warrant the effort. Use it when starting evals, after big pipeline changes, or when production metrics tank and you need to know why.
npx skills add https://github.com/hamelsmu/evals-skills --skill error-analysis