This is a meta-skill for stress-testing the Vercel plugin itself, not for building apps. It walks you through spawning multiple Claude Code sessions in WezTerm panes, each running a different project prompt, then checking debug logs and claim directories to verify which skills got injected. The whole point is validating that PreToolUse hooks fire correctly, that PostToolUse catches validation errors, and that the plugin's skill injection logic works across complex scenarios involving Workflow DevKit, AI Gateway, MCP, and multi-agent orchestration. You run the evals manually in your conversation using bash tool calls, no scripts allowed, because only interactive sessions trigger the plugin hooks. Useful if you're iterating on the plugin's skill catalog or debugging why certain patterns aren't getting picked up.
npx skills add https://github.com/vercel-labs/vercel-plugin --skill benchmark-agents