This benchmark verifies whether LLMs are citing real academic papers or making them up. It hits Crossref, PubMed, arXiv, and DBLP to check every reference your model returns, then scores hallucination rate, per-field accuracy (title, author, year, DOI), and discipline breakdown. You can run it with or without tool augmentation (ReAct plus web search). The pipeline saves checkpoints, generates markdown reports with charts, and supports year constraints in queries. Use it when you need hard numbers on citation reliability instead of vibes. Honestly, the fact that this needs to exist tells you something about current LLM behavior with references, but at least now you can measure the damage.
npx -y skills add agentscope-ai/openjudge --skill ref-hallucination-arena --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills