This handles DeepSpeed's distributed training stack when you're scaling models beyond single GPU limits. You get guidance on ZeRO optimization stages for memory efficiency, pipeline parallelism for splitting models across devices, and mixed precision training with FP16/BF16/FP8. The skill covers DeepNVMe for fast checkpoint I/O to NVMe SSDs using async operations and GPU Direct Storage, plus optimizers like 1-bit Adam for communication reduction. Useful when you're actually implementing multi-GPU training and need to navigate the configuration options or debug memory issues. Generated from Microsoft's official docs, so it tracks their architecture decisions around libaio handles and parallelism patterns.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill deepspeed