If you're working with transformer models and hitting memory walls or slow attention computations, this skill gives you the Flash Attention implementation that delivers 2-4x speedup and 10-20x memory reduction through IO-aware tiling. The example code shows PyTorch 2.2+ integration using the native scaled_dot_product_attention function, which automatically uses Flash Attention when available. It's straightforward: drop in your query, key, and value tensors on CUDA with float16, and you're done. The skill has passed security audits from Gen Agent Trust Hub, Socket, and Snyk, and with 27.7K GitHub stars on the parent repo, it's clearly battle-tested. Worth having if you're training or running large language models.
npx skills add https://github.com/davila7/claude-code-templates --skill optimizing-attention-flash