Gives Claude programmatic access to ForgeJudge's autonomous coding agent evaluation infrastructure. You get operations to run the solver against tasks from the golden set, grade patches with the deterministic harness, fetch leaderboard results, and query per-run traces from the Langfuse observability backend. The golden set is 18 contamination-resistant make-CI-green tasks, each mutation-hardened to catch wrong fixes. Useful when you're building or tuning agentic coding workflows and need reproducible benchmarking with execution-as-judge grading, or when you want Claude to help analyze regression patterns across model swaps or seed sweeps. The harness runs patches in sandboxed GitHub Actions VMs and returns strict RESOLVED verdicts based on FAIL_TO_PASS and PASS_TO_PASS test outcomes.
claude mcp add --transport stdio ahmedeid1-forgejudge -- uvx forgejudge