CLAUDE CODE MARKETPLACES
SkillsMarketplacesMCPDigestLearnJobsAdvertise

This week in Claude

Every Monday: Claude Code, Agent SDK, MCP, and the Anthropic platform moves worth your time.

Skills by Category
Frontend DevelopmentBackend & APIsTesting & QASecurityDevOps & CI/CDGit & Pull RequestsDocumentationCode Review & QualityAI & Agent BuildingSkill Development
MCP Servers by Category
Web & Browser AutomationDatabasesAI & LLM ToolsCloud & InfrastructureCommunication & MessagingDeveloper ToolsDesign & CreativeDocuments & KnowledgeSearch & Web CrawlingAutomation & Workflows
Marketplaces by Category
AI Agents & OrchestrationLLM IntegrationDevelopment ToolsFrontend & UIBackend & APIsDatabasesTesting & Code QualityDevOps & CloudSecurity & ComplianceGit & Version Control

Claude Code Marketplaces

Discover Claude Code plugins, extensions, and tools. Automatically updated directory of Anthropic Claude AI marketplaces with development tools, productivity plugins, and integrations.

Resources

  • Browse Skills
  • Browse MCP Servers
  • Browse Marketplaces
  • Plugins Reference

Community

  • About
  • Learn
  • Feedback
  • Privacy Policy
  • Jobs
  • Advertise

Built for the Claude Code community with Claude Code by @mertduzgun

Independent project, not affiliated with Anthropic
  1. Skills
  2. /
  3. davila7
  4. /
  5. claude-code-templates
  6. /
  7. Agent Evaluation

Agent Evaluation

Editor's Note

This is for when you need to actually test agent reliability before production, not just run them through benchmarks once and hope. It covers statistical test evaluation (running tests multiple times to analyze distributions), behavioral contract testing for invariants, and adversarial testing to actively break things. The sharp edges table is the most useful part: it calls out real problems like agents scoring well on benchmarks but failing in production, flaky tests that pass sometimes, and accidental data leakage. The core insight is right: evaluating LLM agents isn't like testing traditional software because the same input produces different outputs. You'll want this if you've ever watched an agent that passed all your tests completely fall apart with real users.

Install

npx skills add https://github.com/davila7/claude-code-templates --skill agent-evaluation
Votes
0
Installs520
GitHub Stars27.3k
Categories
Backend & APIsTesting & QAGit & Pull RequestsAI & Agent BuildingDesign & UI/UX
First SeenMay 16, 2026
View on GitHub

Comments

Login to comment

Related Backend & APIs Skills

View all →
vercel-react-best-practices

vercel-labs/agent-skills

5
402.7k
26.6k
3
React and Next.js performance optimization guide with 64 prioritized rules across 8 categories.
azure-storage

microsoft/azure-skills

0
320.2k
964
Unified access to Azure blob storage, file shares, queues, tables, and data lake services.
entra-app-registration

microsoft/azure-skills

0
320k
964
Microsoft Entra ID app registration, OAuth 2.0 configuration, and MSAL integration for secure application authentication.
azure-resource-visualizer

microsoft/azure-skills

0
319.7k
964
Transform Azure resource groups into detailed architecture diagrams showing resource relationships and configurations.
azure-aigateway

microsoft/azure-skills

0
319.7k
964
Configure Azure API Management as an AI Gateway for models, MCP tools, and agents with built-in governance policies.
remotion-best-practices

remotion-dev/skills

0
312.3k
3.2k
Domain-specific knowledge base for building videos with Remotion and React.