CLAUDE CODE MARKETPLACES
SkillsMarketplacesMCPDigestLearnAdvertise

This week in Claude

Every Monday: Claude Code, Agent SDK, MCP, and the Anthropic platform moves worth your time.

Skills by Category
Frontend DevelopmentBackend & APIsTesting & QASecurityDevOps & CI/CDGit & Pull RequestsDocumentationCode Review & QualityAI & Agent BuildingSkill Development
MCP Servers by Category
Web & Browser AutomationDatabasesAI & LLM ToolsCloud & InfrastructureCommunication & MessagingDeveloper ToolsDesign & CreativeDocuments & KnowledgeSearch & Web CrawlingAutomation & Workflows
Marketplaces by Category
AI Agents & OrchestrationLLM IntegrationDevelopment ToolsFrontend & UIBackend & APIsDatabasesTesting & Code QualityDevOps & CloudSecurity & ComplianceGit & Version Control

Claude Code Marketplaces

Discover Claude Code plugins, extensions, and tools. Automatically updated directory of Anthropic Claude AI marketplaces with development tools, productivity plugins, and integrations.

Resources

  • Browse Skills
  • Browse MCP Servers
  • Browse Marketplaces
  • Plugins Reference

Community

  • About
  • Learn
  • Feedback
  • Privacy Policy
  • Advertise

Built for the Claude Code community with Claude Code by @mertduzgun

Independent project, not affiliated with Anthropic
  1. Skills
  2. /
  3. orchestra-research
  4. /
  5. ai-research-skills
  6. /
  7. Evaluating Code Models

Evaluating Code Models

Editor's Note

This wraps the BigCode Evaluation Harness, which benchmarks code generation models across 15+ standardized tests including HumanEval, MBPP, and MultiPL-E supporting 18 programming languages. You'd reach for this when you need to compare model performance systematically rather than relying on vibes, or when you're fine-tuning a code model and want real numbers on whether your changes actually help. The harness is widely used in research (9.2K stars), so your benchmarks will be comparable to published results. Setup requires cloning the repo and running accelerate config, so it's not quite as plug-and-play as some skills, but the tradeoff is you get the full evaluation suite that serious ML teams use.

Install

npx skills add https://github.com/orchestra-research/ai-research-skills --skill evaluating-code-models
Votes
0
Installs245
GitHub Stars9.2k
Categories
AI & Agent BuildingData Science & ML
First SeenJun 3, 2026
View on GitHub

Comments

Login to comment

Related AI & Agent Building Skills

View all →
langchain-rag

langchain-ai/langchain-skills

0
6.9k
689
Complete RAG pipeline for document ingestion, embedding, retrieval, and LLM-powered response generation.
agentic-eval

github/awesome-copilot

0
9.3k
33.1k
Iterative evaluation and refinement patterns for improving AI agent outputs through self-critique loops.
ai-prompt-engineering-safety-review

github/awesome-copilot

0
9.3k
33.1k
Comprehensive safety analysis and improvement framework for AI prompts with detailed assessment methodologies.
finalize-agent-prompt

github/awesome-copilot

0
8.6k
33.1k
Polish and refine agent prompt files against proven best practices.
mcp-deploy-manage-agents

github/awesome-copilot

0
8.5k
33.1k
Deploy and manage MCP-based declarative agents across Microsoft 365 with admin center governance, role-based access, and organizational distribution.
mcp-create-declarative-agent

github/awesome-copilot

0
8.4k
33.1k
Scaffold a declarative agent for Microsoft 365 Copilot integrated with an MCP server.