This is a structured approach to creating test data for LLM pipelines when you don't have enough real examples. You define dimensions where your system might fail (like query ambiguity or user persona), generate tuples of those dimensions, then convert them to natural language queries. The two-step separation between tuples and queries is the key move here. It avoids the repetitive, happy-path garbage you get from just prompting "give me test queries." Target is around 100 diverse traces to find failure patterns. Skip this if you already have 100+ real queries, use stratified sampling instead. Also skip it for specialized domains where LLMs can't generate realistic content, like legal filings or low-resource languages.
npx skills add https://github.com/hamelsmu/evals-skills --skill generate-synthetic-data