Agent Evaluation

823 installs1.2k stars

Summary

This is a meta-skill for evaluating your Claude Code agents and commands. Use it when you're testing whether your prompt changes actually work or trying to measure if your context engineering is effective. The framework covers LLM-as-judge evaluation, multi-dimensional rubrics that weigh instruction following against tool efficiency, and test set design across complexity levels. One interesting detail: research shows token usage explains 80% of agent performance variance, so your evaluation needs realistic token constraints. The rubric templates and evaluation prompts are ready to adapt, which beats starting from scratch when you're trying to catch regressions or validate improvements.

Install to Claude Code

npx -y skills add neolabhq/context-engineering-kit --skill agent-evaluation --agent claude-code

Installs into .claude/skills of the current project.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Files

SKILL.md

Select a file.

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

AI notepad for back-to-back meetings

Notes, actions and memory. Without a meeting bot. First month 100% off.

Download for free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Email for Agents: Free tier available

Give your AI agent a complete email layer—sending, inbound inboxes, and sandbox testing.

Get 4K emails/month free →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

CodeScene MCP Server

Your agent targets a perfect 10 Code Health score. Deterministic. Every commit.

Try For Free →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Agent Evaluation

Install to Claude Code

Agent Evaluation

Install to Claude Code

Recommended

Recommended