This is Anthropic's March 2026 research on adversarial agent loops turned into a practical harness. You get three agents: a Planner that expands your one-liner into a full product spec, a Generator that builds it, and an Evaluator that runs Playwright tests against the live app and scores it on design, originality, craft, and functionality. The evaluator is engineered to be ruthlessly strict because agents are terrible at critiquing their own work. It's expensive (budget $50-200) and overkill for simple tasks, but if you're building a complete frontend or full-stack app where "AI slop" aesthetics aren't acceptable, this feedback loop can push quality way beyond what a single agent produces.
npx -y skills add affaan-m/everything-claude-code --skill gan-style-harness --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
metabase/metabase
metabase/metabase
telagod/code-abyss
github/awesome-copilot
DietrichGebert/ponytail
UKGovernmentBEIS/inspect_evals