Need to compare multiple LLMs or agents but don't have test data ready? This generates test queries from your task description, collects responses from all your endpoints, creates evaluation rubrics on the fly, then runs pairwise comparisons with position-bias swapping to produce win-rate rankings. It's built on OpenJudge's AutoArenaPipeline and handles the full workflow from query generation through final reports with charts. Works entirely via config YAML or Python API, supports checkpointing so you can resume interrupted runs, and lets you swap judge models mid-evaluation without rerunning everything. Solid choice when you want arena-style benchmarking without manually curating datasets or evaluation criteria first.
npx -y skills add agentscope-ai/openjudge --skill auto-arena --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills