This wraps the LM Evaluation Harness, the standard tool for benchmarking language models across 60+ academic datasets like MMLU, GSM8K, and HellaSwag. You point it at any HuggingFace model and get apples-to-apples comparisons using consistent prompts and metrics. It's what most researchers use when they publish benchmark numbers, so if you're fine-tuning models or need to justify which one to deploy, this gives you the same numbers everyone else is citing. The skill makes it available in Claude Code workflows, though you're mostly just getting a thin wrapper around lm_eval's CLI. Useful if you're already doing AI research work and want benchmarks without leaving your agent environment.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill evaluating-llms-harness