Regression testing infrastructure for AI agents that works like Playwright for tool-calling systems. This server connects golden baseline snapshots with CI/CD workflows, tracking behavioral drift across LangGraph, CrewAI, OpenAI, and Claude implementations. It exposes operations for creating test snapshots, running regression checks, and generating verdicts that classify changes by confidence level. The approach separates provider model drift from actual system regressions, replays tool calls deterministically using cassettes, and surfaces multi-level verdicts from safe-to-ship through block-release. You'd reach for this when unit tests pass but your agent's behavior has quietly degraded, or when you need to know whether a model provider update changed your system's outputs without touching your code.
claude mcp add --transport stdio io.github.hidai25-evalview-mcp uvx evalview-mcp