If you're running SageMaker HyperPod clusters on EKS or Slurm and hit creation failures, node replacement issues, or the dreaded "EFA health checks did not run successfully" error, this walks you through the entire diagnostic tree. It maps CloudFormation nested stack failures, capacity problems, lifecycle script timeouts, kubectl auth gaps, and autoscaler conflicts to specific remediation steps. The diagnostic script is read-only and surfaces findings with priority tags and direct pointers to runbook sections. The workflow is sensible: it asks whether your infrastructure is IaC-managed before suggesting any state-changing commands, which saves you from drifting your Terraform stack. Includes a pre-flight validation mode that catches subnet and security group misconfigurations before you create the cluster.
npx -y skills add awslabs/agent-plugins --skill hyperpod-cluster-debugger --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
sickn33/antigravity-awesome-skills
kubesphere/kubesphere
supercent-io/skills-template