This teaches Claude to correctly implement PyTorch's FSDP2 (the DTensor-based sharding API via fully_shard) in training scripts. It covers the full setup: launching with torchrun, applying bottom-up sharding to submodules before the root, creating optimizers after sharding so they see DTensor parameters, and checkpointing with Distributed Checkpoint or the state dict helpers. The skill is opinionated about the right patterns (use model() not model.forward(), never naively torch.save DTensor state dicts) and includes concrete migration steps from meta device initialization through mixed precision config. Use this when your model won't fit on one GPU or you want cleaner per-parameter sharding than FSDP1.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill pytorch-fsdp2