This gives you practical implementations of Wanda and SparseGPT for compressing LLMs without retraining. You get one-shot pruning that can hit 50% sparsity with under 1% accuracy loss, plus N:M structured patterns that actually map to NVIDIA's sparse tensor cores for real 2x speedups on A100s. The Wanda code is straightforward: multiply weight magnitudes by activation statistics, threshold, mask, done. What's useful here is seeing the tradeoff between unstructured pruning (better quality, no speedup) and semi-structured 2:4 patterns (hardware friendly, minimal accuracy hit). If you're deploying large models on constrained hardware or trying to cut serving costs, this walks through the techniques that actually work in production without the usual fine-tuning overhead.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill model-pruning