This is a comprehensive framework for building production-ready systems based on Michael Nygard's "Release It!" principles. It covers stability anti-patterns like cascading failures and blocked threads, then counters them with patterns like circuit breakers, bulkheads, and timeouts with specific implementation guidance. Use it when designing resilient microservices, investigating outages, or planning zero-downtime deployments. The scoring system (0-10) helps you systematically evaluate production readiness. What stands out is the depth: it doesn't just say "use a circuit breaker," it explains threshold tuning, state machines, and why slow responses are worse than failures. If you're moving something to production or debugging why your system falls over under load, this gives you the playbook.
npx -y skills add wondelai/skills --skill release-it --agent claude-codeInstalls into .claude/skills of the current project.
Framework for designing, deploying, and operating production-ready software. The software that passes QA is not the software that survives production — production is hostile, and systems must expect and handle failure at every level.
Every system will eventually be pushed beyond its design limits. The question is not whether failures happen, but whether your system degrades gracefully or collapses catastrophically. Production-ready software is not just correct — it is resilient, observable, and operates through partial failures without human intervention.
Goal: 10/10. When reviewing or creating production systems, rate them 0-10 against the principles below — 10/10 means full alignment, lower scores indicate gaps. Always give the current score and the specific improvements needed to reach 10/10.
Six areas that determine whether software survives contact with production:
Core concept: Failures propagate through integration points and cascade across system boundaries. The most dangerous patterns are not bugs in your code — they are emergent behaviors when systems interact under stress.
Why it works: Every production outage traces back to one or more of these predictable, recurring patterns; recognizing them lets you eliminate the cracks before production traffic finds them.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| HTTP calls | Assume every remote call can fail, hang, or return garbage | Wrap all external calls with timeout + circuit breaker |
| Database queries | Enforce result set limits | Add LIMIT; paginate all list endpoints |
| Thread pools | Isolate pools per dependency | Separate pool for payment gateway vs. search |
| Marketing events | Coordinate launches with capacity planning | Pre-scale before Black Friday; queue coupon redemptions |
See: references/anti-patterns.md for each anti-pattern with failure scenarios and detection strategies.
Core concept: Counter each anti-pattern with a stability pattern: circuit breakers stop cascades, bulkheads isolate blast radius, timeouts reclaim stuck resources. Together they make a system bend under load instead of breaking.
Why it works: These patterns accept failure as inevitable and design the response to it — a circuit breaker that trips is the system working correctly, protecting itself from a downstream failure.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Service calls | Circuit Breaker | Open after 5 failures in 60s; half-open after 30s |
| Resource isolation | Bulkhead | Dedicated connection pools for critical vs. non-critical |
| Network calls | Timeout with propagation | Connect 1s, read 5s; propagate deadline downstream |
| Retries | Backoff + jitter + budget | Base 100ms, max 3 retries, 20% fleet retry budget |
| Data cleanup | Steady State | Purge sessions >24h; rotate logs at 500MB |
See: references/stability-patterns.md for state machines, threshold tuning, and pattern combinations.
Core concept: Capacity is not one number — it is a multi-dimensional function of CPU, memory, network, disk I/O, connection pools, and threads. Capacity planning means knowing which resource bottlenecks first, and at what load.
Why it works: Untested systems fail at peak load — the worst possible moment. Knowing actual (not theoretical) limits lets you set realistic SLAs and scale before users hit the wall.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Load testing | Ramp to peak, then 2x, observe degradation | Increase RPS until latency exceeds SLO |
| Connection pools | Size from measured concurrency | Set pool to P99 active connections + 20% headroom |
| Soak testing | 80% capacity for 24-72 hours | Catch memory/connection/file-handle leaks |
| Capacity model | Document bottleneck per service | "Service X is memory-bound at 2000 RPS; 4GB per instance" |
See: references/capacity-planning.md for testing methodologies, pool management, and scalability modeling.
Core concept: Deployment (putting code on servers) and release (exposing it to users) are separate operations that should be decoupled — deploy without risk, release with confidence.
Why it works: Most outages are caused by changes. Decoupling lets you deploy to production, verify, and only then route traffic; if something breaks, you roll back the release, not the deployment.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Deploys | Blue-green with health check gate | Deploy to green; smoke test; swap router |
| Progressive rollout | Canary with automated rollback | 5% traffic to canary; auto-rollback if error rate >1% |
| Feature launch | Flags with emergency off switch | Ship behind flag; enable for 10%; monitor; ramp |
| Schema changes | Expand-contract migration | Add column; write both; backfill; drop old |
See: references/deployment-strategies.md for deployment patterns, migration strategies, and infrastructure-as-code.
Core concept: You cannot operate what you cannot observe. Health checks, metrics, logs, and traces are the sensory organs of your system in production — a first-class design concern, not an afterthought.
Why it works: Production systems fail invisibly without instrumentation. Done right, observability answers questions about your system that you did not anticipate at design time.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Health endpoints | Deep health check | /health reports DB, cache, queue, disk status |
| Service metrics | RED instrumentation | Rate, error rate, p50/p95/p99 latency per endpoint |
| Distributed tracing | Propagate trace context | Trace ID in headers; correlate logs across services |
| Alerting | SLO burn rate, not raw thresholds | "Error budget burning 10x" vs. "CPU > 80%" |
See: references/observability.md for health check design, SLO frameworks, and alerting strategies.
Safety note: Chaos engineering experiments are design-time planning activities. The patterns below describe what to test and what to verify, not actions for an AI agent to execute autonomously. All failure injection must be performed by authorized engineers using dedicated tooling (e.g., Gremlin, Litmus, AWS FIS) with proper approvals, rollback plans, and blast radius controls in place.
Core concept: Confidence in resilience comes from testing under realistic failure conditions. Chaos engineering experiments on a system in a controlled way to build confidence it withstands turbulence.
Why it works: You cannot know how a system handles failure until it actually fails; controlled injection turns unknown-unknowns into known-knowns before they cause real outages.
Key insights:
Code applications:
| Context | Pattern | Example |
|---|---|---|
| Process failure | Controlled termination via chaos tooling | Kill one pod with Gremlin/Litmus; verify recovery within SLO |
| Network failure | Inject latency/partition via chaos tooling | +500ms on DB calls; verify circuit breaker trips |
| Dependency failure | Simulate downstream outage via chaos tooling | Return 503 from payment API; verify graceful degradation |
| GameDay | Scheduled team exercise | "Primary DB goes read-only at 2pm" — practice response |
See: references/chaos-engineering.md for experiment design, blast radius management, and building the practice.
| Mistake | Why It Fails | Fix |
|---|---|---|
| No timeouts on outbound calls | One slow dependency freezes the system | Connect and read timeouts on every external call |
| Unbounded retries | Retry storms amplify failures | Exponential backoff, jitter, fleet-wide retry budgets |
| Shared thread/connection pools | One failing dependency drains everything | Bulkhead: isolate pools per dependency |
| Shallow health checks only | Traffic routed to instances with broken dependencies | Deep health checks that verify downstream connectivity |
| Testing only the happy path | Works perfectly until the first real failure | Load, soak, and chaos test before major releases |
| Coupling deploy and release | Every deployment is all-or-nothing high risk | Feature flags, canary, blue-green |
| Alerting on causes, not symptoms | CPU alerts fire while users suffer silently | Alert on user-facing SLIs: errors, latency, availability |
| No capacity model | System falls over at 2x load | Model bottlenecks; load test to 3x expected peak |
Audit any production system:
| Question | If No | Action |
|---|---|---|
| Does every outbound call have a timeout? | Calls hang, blocking threads | Add connect and read timeouts everywhere |
| Are circuit breakers on critical dependencies? | One failure takes down the system | Add breakers with tuned thresholds |
| Are pools isolated per dependency? | Failures cross-contaminate | Implement bulkheads with dedicated pools |
| Can you deploy without downtime? | Deployments cause outages | Rolling, blue-green, or canary deployment |
| Do health checks verify dependencies? | Dead instances receive traffic | Deep health checks testing DB, cache, queue |
| Are logs, metrics, and traces correlated? | Debugging means manual log searches | Distributed tracing with correlated IDs |
| Have you load-tested beyond expected peak? | Unknown failure mode under real load | Test to 2-3x peak; document the breaking point |
| Do you practice failure injection? | Resilience is theoretical | Start chaos engineering with low-risk experiments |
For the complete methodology, war stories, and implementation details:
Michael T. Nygard is a software architect with 30+ years building and operating large-scale production systems handling millions of transactions per day. Release It! (2007; 2nd edition 2018) became a foundational text of the DevOps and site reliability engineering movements, arguing that architects must stay responsible for systems long after the code is written.
finos/morphir
hoangnguyen0403/agent-skills-standard