If you're training large language models at scale, this gives you access to TorchTitan, PyTorch's official distributed pretraining framework. It handles the complexity of 4D parallelism (FSDP2, TP, PP, CP) without forcing you to cobble together your own infrastructure. The team claims 65%+ speedups over baselines on H100s, which matters when you're burning thousands of dollars per training run. You can install from PyPI for stability or pull from source for bleeding edge features, though the latter requires PyTorch nightly builds. This is production tooling from the PyTorch team, not a research prototype, so it's worth evaluating if you're moving beyond single-GPU experiments.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill distributed-llm-pretraining-torchtitan