CCM
/Skills
SkillsMCPMarketplacesDigestLearnAdvertise

This week in Claude

Every Monday: Claude Code, Agent SDK, MCP, and the Anthropic platform moves worth your time.

Skills by Category
Frontend DevelopmentBackend & APIsTesting & QASecurityDevOps & CI/CDGit & Pull RequestsDocumentationCode Review & QualityAI & Agent BuildingSkill Development
MCP Servers by Category
Sales & MarketingWeb & Browser AutomationDatabasesAI & LLM ToolsCloud & InfrastructureCommunication & MessagingDeveloper ToolsDesign & CreativeDocuments & KnowledgeSearch & Web Crawling
Marketplaces by Category
AI Agents & OrchestrationLLM IntegrationDevelopment ToolsFrontend & UIBackend & APIsDatabasesTesting & Code QualityDevOps & CloudSecurity & ComplianceGit & Version Control

Claude Code Marketplaces

Discover Claude Code plugins, extensions, and tools. Automatically updated directory of Anthropic Claude AI marketplaces with development tools, productivity plugins, and integrations.

Resources

  • Browse Skills
  • Browse MCP Servers
  • Browse Marketplaces
  • Plugins Reference

Community

  • About
  • Learn
  • Feedback
  • Privacy Policy
  • Advertise

Built for the Claude Code community with Claude Code by @mertduzgun

Independent project, not affiliated with Anthropic

Release It

wondelai/skills
1.9k installs1.2k stars
Summary

This is a comprehensive framework for building production-ready systems based on Michael Nygard's "Release It!" principles. It covers stability anti-patterns like cascading failures and blocked threads, then counters them with patterns like circuit breakers, bulkheads, and timeouts with specific implementation guidance. Use it when designing resilient microservices, investigating outages, or planning zero-downtime deployments. The scoring system (0-10) helps you systematically evaluate production readiness. What stands out is the depth: it doesn't just say "use a circuit breaker," it explains threshold tuning, state machines, and why slow responses are worse than failures. If you're moving something to production or debugging why your system falls over under load, this gives you the playbook.

Install to Claude Code

npx -y skills add wondelai/skills --skill release-it --agent claude-code

Installs into .claude/skills of the current project.

CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →
Vibe Prospecting MCPVibe Prospecting MCP
Vibe Prospecting MCP
Connect Claude to +800M contacts, +150M companies. Find & Enrich leads in chat.
Try For Free →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →
Vibe Prospecting MCPVibe Prospecting MCP
Vibe Prospecting MCP
Connect Claude to +800M contacts, +150M companies. Find & Enrich leads in chat.
Try For Free →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Files
SKILL.mdView on GitHub

Release It! Framework

Framework for designing, deploying, and operating production-ready software. The software that passes QA is not the software that survives production — production is hostile, and systems must expect and handle failure at every level.

Core Principle

Every system will eventually be pushed beyond its design limits. The question is not whether failures happen, but whether your system degrades gracefully or collapses catastrophically. Production-ready software is not just correct — it is resilient, observable, and operates through partial failures without human intervention.

Scoring

Goal: 10/10. When reviewing or creating production systems, rate them 0-10 against the principles below — 10/10 means full alignment, lower scores indicate gaps. Always give the current score and the specific improvements needed to reach 10/10.

The Release It! Framework

Six areas that determine whether software survives contact with production:

1. Stability Anti-Patterns

Core concept: Failures propagate through integration points and cascade across system boundaries. The most dangerous patterns are not bugs in your code — they are emergent behaviors when systems interact under stress.

Why it works: Every production outage traces back to one or more of these predictable, recurring patterns; recognizing them lets you eliminate the cracks before production traffic finds them.

Key insights:

  • Integration points are the number-one killer — every socket, HTTP call, or queue is a risk
  • Slow responses are worse than no response: they tie up threads, exhaust pools, and propagate delay up the call chain
  • Unbounded result sets turn a harmless query into an out-of-memory crash once data outgrows test assumptions
  • Users generate load no test predicts — bots, retry storms, flash crowds; self-denial attacks happen when your own marketing overwhelms your infrastructure
  • Blocked threads are the silent killer — deadlocks and contention show no errors until everything stops

Code applications:

ContextPatternExample
HTTP callsAssume every remote call can fail, hang, or return garbageWrap all external calls with timeout + circuit breaker
Database queriesEnforce result set limitsAdd LIMIT; paginate all list endpoints
Thread poolsIsolate pools per dependencySeparate pool for payment gateway vs. search
Marketing eventsCoordinate launches with capacity planningPre-scale before Black Friday; queue coupon redemptions

See: references/anti-patterns.md for each anti-pattern with failure scenarios and detection strategies.

2. Stability Patterns

Core concept: Counter each anti-pattern with a stability pattern: circuit breakers stop cascades, bulkheads isolate blast radius, timeouts reclaim stuck resources. Together they make a system bend under load instead of breaking.

Why it works: These patterns accept failure as inevitable and design the response to it — a circuit breaker that trips is the system working correctly, protecting itself from a downstream failure.

Key insights:

  • Circuit Breaker: three states (closed, open, half-open) — trips after threshold failures, periodically tests recovery
  • Timeouts: every outbound call needs connect AND read timeouts, propagated up the call chain
  • Retry with exponential backoff + jitter prevents thundering herd on recovery
  • Fail Fast: reject requests you know will fail instead of wasting resources; Handshaking lets the server decline work before it's sent
  • Steady State: systems accumulate cruft (logs, sessions, temp files) — design automatic cleanup
  • Let It Crash: a clean restart often beats limping along in an unknown state

Code applications:

ContextPatternExample
Service callsCircuit BreakerOpen after 5 failures in 60s; half-open after 30s
Resource isolationBulkheadDedicated connection pools for critical vs. non-critical
Network callsTimeout with propagationConnect 1s, read 5s; propagate deadline downstream
RetriesBackoff + jitter + budgetBase 100ms, max 3 retries, 20% fleet retry budget
Data cleanupSteady StatePurge sessions >24h; rotate logs at 500MB

See: references/stability-patterns.md for state machines, threshold tuning, and pattern combinations.

3. Capacity and Availability

Core concept: Capacity is not one number — it is a multi-dimensional function of CPU, memory, network, disk I/O, connection pools, and threads. Capacity planning means knowing which resource bottlenecks first, and at what load.

Why it works: Untested systems fail at peak load — the worst possible moment. Knowing actual (not theoretical) limits lets you set realistic SLAs and scale before users hit the wall.

Key insights:

  • Test taxonomy: load test (expected traffic), stress test (beyond limits), soak test (sustained, catches leaks), spike test (sudden bursts)
  • Universal Scalability Law: throughput never scales linearly — contention and coherence costs cause diminishing returns
  • Pool exhaustion looks identical to a database outage from the application's perspective; size pools from measured concurrency, not defaults
  • "The cloud is infinitely scalable" is a myth — auto-scaling has lag, cold starts, and hard limits

Code applications:

ContextPatternExample
Load testingRamp to peak, then 2x, observe degradationIncrease RPS until latency exceeds SLO
Connection poolsSize from measured concurrencySet pool to P99 active connections + 20% headroom
Soak testing80% capacity for 24-72 hoursCatch memory/connection/file-handle leaks
Capacity modelDocument bottleneck per service"Service X is memory-bound at 2000 RPS; 4GB per instance"

See: references/capacity-planning.md for testing methodologies, pool management, and scalability modeling.

4. Deployment and Release

Core concept: Deployment (putting code on servers) and release (exposing it to users) are separate operations that should be decoupled — deploy without risk, release with confidence.

Why it works: Most outages are caused by changes. Decoupling lets you deploy to production, verify, and only then route traffic; if something breaks, you roll back the release, not the deployment.

Key insights:

  • Zero-downtime deployment is non-negotiable: rolling, blue-green, or canary
  • Feature flags dark-launch code and enable it independently of deployment
  • Database migrations must be backward-compatible — old and new code run simultaneously during deploys (expand-contract)
  • Immutable infrastructure: never patch a running server — build a new image, deploy, destroy the old
  • Rollback must be faster than roll-forward; if rollback takes 30 minutes, you will avoid deploying

Code applications:

ContextPatternExample
DeploysBlue-green with health check gateDeploy to green; smoke test; swap router
Progressive rolloutCanary with automated rollback5% traffic to canary; auto-rollback if error rate >1%
Feature launchFlags with emergency off switchShip behind flag; enable for 10%; monitor; ramp
Schema changesExpand-contract migrationAdd column; write both; backfill; drop old

See: references/deployment-strategies.md for deployment patterns, migration strategies, and infrastructure-as-code.

5. Health Checks and Observability

Core concept: You cannot operate what you cannot observe. Health checks, metrics, logs, and traces are the sensory organs of your system in production — a first-class design concern, not an afterthought.

Why it works: Production systems fail invisibly without instrumentation. Done right, observability answers questions about your system that you did not anticipate at design time.

Key insights:

  • Health checks come in two flavors: shallow (process alive) and deep (dependencies reachable, resources available)
  • Three pillars: structured logs (what happened), metrics (how much), distributed traces (where and how long)
  • RED method for services: Rate, Errors, Duration; USE method for resources: Utilization, Saturation, Errors
  • Define SLIs (measure user experience) → SLOs (targets) → SLAs (contracts), in that order
  • Alert on symptoms users feel (error rate, latency), not causes (CPU); dashboards should answer "is the system healthy?" within 5 seconds

Code applications:

ContextPatternExample
Health endpointsDeep health check/health reports DB, cache, queue, disk status
Service metricsRED instrumentationRate, error rate, p50/p95/p99 latency per endpoint
Distributed tracingPropagate trace contextTrace ID in headers; correlate logs across services
AlertingSLO burn rate, not raw thresholds"Error budget burning 10x" vs. "CPU > 80%"

See: references/observability.md for health check design, SLO frameworks, and alerting strategies.

6. Adaptation and Chaos Engineering

Safety note: Chaos engineering experiments are design-time planning activities. The patterns below describe what to test and what to verify, not actions for an AI agent to execute autonomously. All failure injection must be performed by authorized engineers using dedicated tooling (e.g., Gremlin, Litmus, AWS FIS) with proper approvals, rollback plans, and blast radius controls in place.

Core concept: Confidence in resilience comes from testing under realistic failure conditions. Chaos engineering experiments on a system in a controlled way to build confidence it withstands turbulence.

Why it works: You cannot know how a system handles failure until it actually fails; controlled injection turns unknown-unknowns into known-knowns before they cause real outages.

Key insights:

  • Define steady state first — you need a measurable baseline to detect deviation
  • Every experiment has a hypothesis: "We believe that when X fails, the system will Y"
  • Start small in non-production (kill one process, add latency to one call), then escalate gradually with approvals
  • Minimize blast radius: canary populations, feature flags, emergency stop; production experiments require explicit authorization and instant rollback
  • Automate recurring experiments; GameDay exercises test both the system and the team
  • Build a culture where finding weaknesses is celebrated, not punished

Code applications:

ContextPatternExample
Process failureControlled termination via chaos toolingKill one pod with Gremlin/Litmus; verify recovery within SLO
Network failureInject latency/partition via chaos tooling+500ms on DB calls; verify circuit breaker trips
Dependency failureSimulate downstream outage via chaos toolingReturn 503 from payment API; verify graceful degradation
GameDayScheduled team exercise"Primary DB goes read-only at 2pm" — practice response

See: references/chaos-engineering.md for experiment design, blast radius management, and building the practice.

Common Mistakes

MistakeWhy It FailsFix
No timeouts on outbound callsOne slow dependency freezes the systemConnect and read timeouts on every external call
Unbounded retriesRetry storms amplify failuresExponential backoff, jitter, fleet-wide retry budgets
Shared thread/connection poolsOne failing dependency drains everythingBulkhead: isolate pools per dependency
Shallow health checks onlyTraffic routed to instances with broken dependenciesDeep health checks that verify downstream connectivity
Testing only the happy pathWorks perfectly until the first real failureLoad, soak, and chaos test before major releases
Coupling deploy and releaseEvery deployment is all-or-nothing high riskFeature flags, canary, blue-green
Alerting on causes, not symptomsCPU alerts fire while users suffer silentlyAlert on user-facing SLIs: errors, latency, availability
No capacity modelSystem falls over at 2x loadModel bottlenecks; load test to 3x expected peak

Quick Diagnostic

Audit any production system:

QuestionIf NoAction
Does every outbound call have a timeout?Calls hang, blocking threadsAdd connect and read timeouts everywhere
Are circuit breakers on critical dependencies?One failure takes down the systemAdd breakers with tuned thresholds
Are pools isolated per dependency?Failures cross-contaminateImplement bulkheads with dedicated pools
Can you deploy without downtime?Deployments cause outagesRolling, blue-green, or canary deployment
Do health checks verify dependencies?Dead instances receive trafficDeep health checks testing DB, cache, queue
Are logs, metrics, and traces correlated?Debugging means manual log searchesDistributed tracing with correlated IDs
Have you load-tested beyond expected peak?Unknown failure mode under real loadTest to 2-3x peak; document the breaking point
Do you practice failure injection?Resilience is theoreticalStart chaos engineering with low-risk experiments

Reference Files

  • anti-patterns.md: Integration point failures, cascading failures, blocked threads, unbounded result sets, self-denial attacks, slow responses
  • stability-patterns.md: Circuit Breaker, Bulkhead, Timeout, Retry, Fail Fast, Steady State, Let It Crash, Handshaking
  • capacity-planning.md: Load/stress/soak testing, connection pool sizing, thread pool tuning, Universal Scalability Law
  • deployment-strategies.md: Blue-green, canary, rolling deploys, feature flags, database migrations, immutable infrastructure
  • observability.md: Health checks, RED/USE methods, SLIs/SLOs/SLAs, distributed tracing, alerting strategy
  • chaos-engineering.md: Steady state hypothesis, failure injection, GameDay exercises, blast radius management

Further Reading

For the complete methodology, war stories, and implementation details:

  • "Release It! Design and Deploy Production-Ready Software" (2nd Edition) by Michael T. Nygard

About the Author

Michael T. Nygard is a software architect with 30+ years building and operating large-scale production systems handling millions of transactions per day. Release It! (2007; 2nd edition 2018) became a foundational text of the DevOps and site reliability engineering movements, arguing that architects must stay responsible for systems long after the code is written.

Featured
CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →
Vibe Prospecting MCPVibe Prospecting MCP
Vibe Prospecting MCP
Connect Claude to +800M contacts, +150M companies. Find & Enrich leads in chat.
Try For Free →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Categories
Release Management
First SeenApr 16, 2026
View on GitHub

Recommended

More Release Management →
release-manager

finos/morphir

Assists with Morphir release management, including pre-release verification, changelog generation, and release coordination. Use when preparing releases, checking release readiness, or managing version bumps.
177
version-release

lobehub/lobehub

version release
890
78.1k
agent-release-manager

ruvnet/ruflo

agent release manager
588
57.5k
release-manager

alirezarezvani/claude-skills

release manager
519
16.9k
release-changelog

paperclipai/paperclip

release changelog
161
68.7k
deploy-release

hoangnguyen0403/agent-skills-standard

Prepare and verify a staged or production deployment with rollback and smoke checks.
504