When a single node in your HyperPod cluster goes down or starts acting weird, this is your triage tool. It walks you through diagnosing GPU hardware failures (XIDs, ECC errors, row-remap issues), EFA and security group misconfigurations, Slurm node states, disk pressure, SSM agent problems, and pod networking. The workflow is read-only: it runs diagnostics, points you to the relevant section in its reference docs, then shows you the exact commands to fix it. You approve each destructive action yourself. It explicitly won't touch cluster-wide problems, NCCL tuning, or model flop utilization, which have their own dedicated skills. If you're managing p5 or Trn instances at scale and one node is misbehaving, this gives you a structured checklist instead of guessing.
npx -y skills add awslabs/agent-plugins --skill hyperpod-node-debugger --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
sickn33/antigravity-awesome-skills
moizibnyousaf/ai-agent-skills
github/awesome-copilot