CCM
/Skills
SkillsMCPMarketplacesDigestLearnAdvertise

This week in Claude

Every Monday: Claude Code, Agent SDK, MCP, and the Anthropic platform moves worth your time.

Skills by Category
Frontend DevelopmentBackend & APIsTesting & QASecurityDevOps & CI/CDGit & Pull RequestsDocumentationCode Review & QualityAI & Agent BuildingSkill Development
MCP Servers by Category
Sales & MarketingWeb & Browser AutomationDatabasesAI & LLM ToolsCloud & InfrastructureCommunication & MessagingDeveloper ToolsDesign & CreativeDocuments & KnowledgeSearch & Web Crawling
Marketplaces by Category
AI Agents & OrchestrationLLM IntegrationDevelopment ToolsFrontend & UIBackend & APIsDatabasesTesting & Code QualityDevOps & CloudSecurity & ComplianceGit & Version Control

Claude Code Marketplaces

Discover Claude Code plugins, extensions, and tools. Automatically updated directory of Anthropic Claude AI marketplaces with development tools, productivity plugins, and integrations.

Resources

  • Browse Skills
  • Browse MCP Servers
  • Browse Marketplaces
  • Plugins Reference

Community

  • About
  • Learn
  • Feedback
  • Privacy Policy
  • Advertise

Built for the Claude Code community with Claude Code by @mertduzgun

Independent project, not affiliated with Anthropic
  1. Skills
  2. /
  3. patricio0312rev
  4. /
  5. skills
  6. /
  7. Evaluation Harness

Evaluation Harness

patricio0312rev/skills
137 installs43 stars
Summary

This gives you the scaffolding to build repeatable LLM evaluations with golden datasets, scoring rubrics, and regression tracking. You define test cases with expected outputs, run your model against them, score with exact match or semantic similarity or LLM-as-judge, then check if results meet your thresholds. The regression report compares baseline runs to current runs so you catch when a prompt tweak breaks something that used to work. It's structured enough to drop into CI but flexible on the scoring functions. Best for teams that have moved past vibes-based testing and need systematic quality gates before shipping model changes.

Install

npx skills add https://github.com/patricio0312rev/skills --skill evaluation-harness
First SeenJun 3, 2026
View on GitHub

Recommended

caveman

juliusbrussee/caveman

Ultra-compressed communication mode cutting token usage ~75% while preserving technical accuracy.
203.4k
67.8k
grill-me

mattpocock/skills

Relentless interviewing skill that stress-tests plans and designs through systematic questioning.
250.9k
114.5k
implement

mertbuilds/skills

Build a feature in an isolated worktrunk worktree, commit, and open a draft PR. Use whenever the user asks to implement, build, add, or fix something that touches code — e.g. '/implement', 'implement X', 'build X', 'add X', 'fix X', 'let's build it', 'ship it', 'drop this in', 'go code it up'. Skip for questions, explanations, research, and trivial one-line docs/config edits. Skip if already inside a worktrunk worktree (just implement directly).
systematic-debugging

obra/superpowers

Structured debugging methodology that mandates root cause investigation before attempting any fixes.
124.6k
215.9k
karpathy-guidelines

forrestchang/andrej-karpathy-skills

Behavioral guidelines to reduce common LLM coding mistakes through explicit assumptions, simplicity, and verifiable success criteria.
13.9k
165.4k
find-skills

vercel-labs/skills

Discover and install specialized agent skills from the open ecosystem when users need extended capabilities.
1.8M
21.1k