NVIDIA's GPU-accelerated toolkit for cleaning and preparing LLM training data at scale. The fuzzy deduplication is legitimately 16× faster than CPU alternatives when processing massive datasets like Common Crawl, and it handles the full curation pipeline: quality filtering with 30+ heuristics, exact and semantic deduplication, PII redaction, and NSFW detection. It's not just text either, there's solid support for images, video, and audio with models like CLIP and ASR inference. If you're building training datasets from web scrapes or need to deduplicate terabytes of data, this is the right tool. For smaller datasets or CPU-only environments, you're probably better off with something like datatrove.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill nemo-curator