NVIDIA's enterprise benchmarking platform that runs your LLMs through 100+ evaluation tasks from 18+ harnesses including MMLU, HumanEval, and GSM8K. Works with any OpenAI-compatible endpoint and handles execution across local Docker, Slurm HPC clusters, or cloud platforms. The containerized approach means reproducible results, and you get built-in exports to MLflow and Weights & Biases. If you're running evals on a single machine with simpler needs, lm-evaluation-harness is lighter weight. But if you're benchmarking at scale across infrastructure or need that full harness coverage in one tool, this delivers the industrial-grade setup.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill nemo-evaluator-sdk