A wrapper around the lm-evaluation-harness library that lets you benchmark language models against 60+ academic tests right from your terminal. You point it at any HuggingFace model and specify which benchmarks to run (MMLU, GSM8k, HellaSwag, etc.), and it handles the standardized prompts and scoring. Useful when you're comparing models or trying to get objective numbers on how a fine-tuned model stacks up. The skill itself is a template that simplifies the CLI invocation, though the real heavy lifting happens in the underlying Eleuther AI library. If you're evaluating models regularly, this saves you from remembering all the command flags and argument patterns.
npx skills add https://github.com/davila7/claude-code-templates --skill evaluating-llms-harness