Evalview Mcp

114authSTDIOregistry active

Summary

Regression testing infrastructure for AI agents that works like Playwright for tool-calling systems. This server connects golden baseline snapshots with CI/CD workflows, tracking behavioral drift across LangGraph, CrewAI, OpenAI, and Claude implementations. It exposes operations for creating test snapshots, running regression checks, and generating verdicts that classify changes by confidence level. The approach separates provider model drift from actual system regressions, replays tool calls deterministically using cassettes, and surfaces multi-level verdicts from safe-to-ship through block-release. You'd reach for this when unit tests pass but your agent's behavior has quietly degraded, or when you need to know whether a model provider update changed your system's outputs without touching your code.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Snapshot testing for AI agents.
Record what your agent does today. Get told when it silently changes.

Your agent returns 200 and looks fine. But a model update, a provider change, or a one-line prompt edit just made it skip a clarification, call the wrong tool, or quietly drop output quality. Your tests still pass. Your users notice before you do.

EvalView snapshots your agent's behavior — the tools it calls, in what order, with what output — and tells you the moment that behavior changes. Like Jest snapshots, but for tool-calling, multi-turn agents.

_{↑ 30-second live demo — no API key needed}

Quick Start

pip install evalview

evalview snapshot    # Record your agent's current behavior as the baseline
evalview check       # After any change, diff against the baseline

That's the whole loop. check returns one of:

  ✓ login-flow        PASSED          behavior matches baseline
  ⚠ refund-request    TOOLS_CHANGED   called a different tool, or in a different order
  ✗ billing-dispute   REGRESSION      score dropped — output quality fell

It diffs the whole trajectory — tool names, parameters, and order — not just the final string. The deterministic tool + sequence diff runs offline, with no API key. Add an LLM judge only when you want output-quality scoring.

No agent yet? See it work in 30 seconds:

evalview demo

Why snapshot testing (and not assertions)?

Most eval tools ask you to write down what "good" looks like — assertions, metrics, rubrics. That's a lot of upfront work, and you can only catch the failures you thought to assert.

EvalView inverts it: it records what your agent actually does now, and flags any drift from that. You catch regressions you never anticipated, with zero assertions written. When the new behavior is correct, evalview snapshot accepts it as the new baseline — same as updating a snapshot in Jest.

	EvalView	Assertion-based eval tools
Setup	Record current behavior	Write assertions/metrics first
Catches	Any drift from baseline	Only what you asserted
Non-determinism	Multi-variant baselines (up to 5 valid paths)	You handle it
Unit of comparison	Full tool-call trajectory	Usually final output

This makes EvalView a merge-time regression gate, which is a different job from observability (Langfuse, LangSmith) or metric scoring (promptfoo, DeepEval, Braintrust). Many teams run one of those for visibility and EvalView as the gate. Honest comparisons →

EvalView tests itself in public, every day

The badge at the top is live. Every day at 09:00 UTC, a GitHub Action runs EvalView against EvalView — including a regression check where the tool snapshots a live agent and diffs it with the same snapshot / check loop this README asks you to trust. It also runs the full test suite, type checks, evalview demo, the end-to-end flows, an evalview monitor smoke test, and chat-mode self-tests.

When something breaks, the run opens a single rolling 🐕 dogfood issue and keeps updating it until the tool is green again — so failures are public, not quietly patched.

Live dogfood runs → · How it works →

CI: block regressions in every PR

# .github/workflows/evalview.yml
name: EvalView
on: [pull_request]
jobs:
  agent-check:
    runs-on: ubuntu-latest
    permissions: { pull-requests: write }
    steps:
      - uses: actions/checkout@v4
      - uses: hidai25/eval-view@v0.8.0
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

You get a PR comment with the diff, cost/latency deltas, and a pass/fail gate. CI/CD guide →

Works with your stack

LangGraph · CrewAI · OpenAI · Claude · Mistral · Ollama · MCP · any HTTP API.

evalview check --agent http://localhost:8000/invoke

Framework details →

Use it as a library

from evalview import gate

result = gate(test_dir="tests/")
result.passed   # bool
result.diffs    # per-test scores and tool diffs

Python API →

EvalView also does multi-turn testing, statistical/pass@k runs, record/replay cassettes, model-drift canaries, production monitoring with Slack alerts, and auto-generated regression tests from incidents. These are power-user features — start with snapshot and check, reach for the rest when you need them.

→ Full feature reference · Getting Started · FAQ

Contributing

This is a young project built mostly by one developer. Issues, PRs, and "I tried it and X was confusing" feedback are all genuinely valuable.

Open an issue · Discussions · CONTRIBUTING.md

License: Apache 2.0

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Configuration

OPENAI_API_KEYsecret

OpenAI API key for LLM-as-judge output quality scoring. Optional — deterministic tool/sequence evaluation works without it.

Quick Start

pip install evalview

evalview snapshot    # Record your agent's current behavior as the baseline
evalview check       # After any change, diff against the baseline

That's the whole loop. check returns one of:

  ✓ login-flow        PASSED          behavior matches baseline
  ⚠ refund-request    TOOLS_CHANGED   called a different tool, or in a different order
  ✗ billing-dispute   REGRESSION      score dropped — output quality fell

No agent yet? See it work in 30 seconds:

evalview demo

Why snapshot testing (and not assertions)?

Most eval tools ask you to write down what "good" looks like — assertions, metrics, rubrics. That's a lot of upfront work, and you can only catch the failures you thought to assert.

	EvalView	Assertion-based eval tools
Setup	Record current behavior	Write assertions/metrics first
Catches	Any drift from baseline	Only what you asserted
Non-determinism	Multi-variant baselines (up to 5 valid paths)	You handle it
Unit of comparison	Full tool-call trajectory	Usually final output

EvalView tests itself in public, every day

When something breaks, the run opens a single rolling 🐕 dogfood issue and keeps updating it until the tool is green again — so failures are public, not quietly patched.

Live dogfood runs → · How it works →

CI: block regressions in every PR

# .github/workflows/evalview.yml
name: EvalView
on: [pull_request]
jobs:
  agent-check:
    runs-on: ubuntu-latest
    permissions: { pull-requests: write }
    steps:
      - uses: actions/checkout@v4
      - uses: hidai25/eval-view@v0.8.0
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}

You get a PR comment with the diff, cost/latency deltas, and a pass/fail gate. CI/CD guide →

Works with your stack

LangGraph · CrewAI · OpenAI · Claude · Mistral · Ollama · MCP · any HTTP API.

evalview check --agent http://localhost:8000/invoke

Framework details →

Use it as a library

from evalview import gate

result = gate(test_dir="tests/")
result.passed   # bool
result.diffs    # per-test scores and tool diffs

Python API →

→ Full feature reference · Getting Started · FAQ

Contributing

This is a young project built mostly by one developer. Issues, PRs, and "I tried it and X was confusing" feedback are all genuinely valuable.

Open an issue · Discussions · CONTRIBUTING.md

License: Apache 2.0

Evalview Mcp

Quick Start

Why snapshot testing (and not assertions)?

EvalView tests itself in public, every day

CI: block regressions in every PR

Works with your stack

Use it as a library

More

Contributing

Configuration

Evalview Mcp

Quick Start

Why snapshot testing (and not assertions)?

EvalView tests itself in public, every day

CI: block regressions in every PR

Works with your stack

Use it as a library

More

Contributing

Configuration

Related AI & LLM Tools MCP Servers

Related AI & LLM Tools MCP Servers