If you're working with transformer models and attention is your bottleneck, this implements Flash Attention to speed things up 2-4x while cutting memory usage by 10-20x. It uses IO-aware tiling and recomputation tricks under the hood. The skill shows you how to use PyTorch's native scaled_dot_product_attention (which auto-detects Flash Attention support) plus presumably the standalone flash-attn library for more control. Worth noting this has passed multiple security audits and comes from a repo with solid GitHub traction. The performance gains are real and measurable, especially on longer sequences where standard attention becomes prohibitively expensive.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill optimizing-attention-flash