This server brings RAGScore's evaluation toolkit into Claude, letting you generate synthetic QA datasets from documents and benchmark RAG systems without switching contexts. You can create tailored question-answer pairs targeting specific audiences (developers, customers, auditors), run multi-metric evaluations across correctness, completeness, and faithfulness, and diagnose failure modes (retriever miss vs generator hallucination). It works with any LLM provider, including local Ollama models for fully private workflows. The detailed evaluation mode gives you five diagnostic dimensions per answer in a single call, making it practical for iterating on retrieval strategies or prompt engineering. Use it when you're building or debugging a RAG pipeline and need systematic test coverage rather than manual spot checking.
claude mcp add --transport stdio hzyai-ragscore uvx ragscore