A read-only diagnostic that walks you through Slurm scheduler and daemon problems on HyperPod clusters. It covers the usual suspects: nodes stuck in down or drain, jobs pending forever despite idle capacity, GPU counts wrong, slurmctld unresponsive, and the "Node unexpectedly rebooted" message that shows up after auto-repair. The procedure runs a bash script that checks slurmd status, disk usage, RPC layer health, and controller state, then maps findings to the official AWS troubleshooting guide without printing any remediation commands. You still have to run fixes yourself. Requires the SSM plugin and unbuffer, which is non-negotiable because start-session will silently fail otherwise. Good for teams managing Slurm on HyperPod who need structured diagnosis before escalating or running state-changing commands.
npx -y skills add awslabs/agent-plugins --skill hyperpod-slurm-debugger --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
supercent-io/skills-template
supercent-io/skills-template
huangjia2019/claude-code-engineering
remotion-dev/remotion
reactjs/react.dev