This handles the full lifecycle of building and maintaining evaluation datasets for AI systems. You get patterns for curating content with multi-agent annotation pipelines, versioning datasets as JSON (not SQL dumps), and validating quality with duplicate detection and drift analysis. The standout piece is the 9-phase workflow for adding new documents that includes parallel quality scoring and bias detection. It's opinionated about thresholds: 0.70 quality score minimum, at least 2 domain tags per entry, and specific difficulty distribution requirements. If you're serious about eval datasets rather than just throwing examples in a folder, the structure here will save you from the usual mess of placeholder URLs and missing referential integrity.
npx skills add https://github.com/yonatangross/orchestkit --skill golden-dataset