When your distributed training job hangs at AllReduce or you see EFA TCP fallback warnings, this is the skill to reach for. It runs a read-only diagnostic script that probes your HyperPod cluster (EKS or Slurm) via AWS APIs, kubectl, and SSM, then prints every issue with a direct pointer to the fix: security group missing self-references, NCCL version mismatches across pods, memlock limits, NetworkPolicy blocks, GPU OOM versus container OOM. The workflow is strict about not running destructive commands without your approval, which matters when training state is on the line. It won't help with single-node hardware faults or cluster creation failures, but for multi-node NCCL and rendezvous problems it cuts straight to root cause.
npx -y skills add awslabs/agent-plugins --skill hyperpod-nccl --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
sickn33/antigravity-awesome-skills
moizibnyousaf/ai-agent-skills
github/awesome-copilot