Handles the mechanical work of SRE: calculating error budgets from SLO targets, writing multiwindow burn rate alerts in Prometheus, and scripting toil away with auto-remediation. Ships with concrete examples like golden signal PromQL queries and a Python script that restarts pods when error rates breach thresholds. The workflow enforces verification checkpoints before proceeding, like confirming SLO targets reflect actual user expectations and validating end-to-end recovery before marking chaos experiments complete. Use it when you need structured output for monitoring configs and runbooks rather than abstract reliability advice. The constraint list is opinionated about what constitutes bad SRE practice, which keeps it focused on actionable patterns.
npx skills add https://github.com/jeffallan/claude-skills --skill sre-engineer