This is a CUDA kernel implementation for Kimi's Delta Attention architecture, built on CUTLASS and targeting H100-class GPUs. It's designed as a drop-in backend for flash-linear-attention's chunk_kda operation, handling the recurrent state updates that make delta attention work. You'll need SM90+ hardware and CUDA 12.9 to run it. The implementation is opinionated: it only supports 128-dim keys and values, and works exclusively with bfloat16 precision. The varlen batching support is genuinely useful for production inference where you're packing multiple sequences together. If you're already using flash-linear-attention for KDA models and have the right hardware, this should just plug in and speed things up.
npx skills add https://github.com/aradotso/trending-skills --skill flashkda-delta-attention