This walks you through writing optimized CUDA kernels for HuggingFace diffusers and transformers on H100, A100, and T4 GPUs. You get kernel templates for RMSNorm, attention, RoPE, and activation functions, plus benchmarking scripts that compare against PyTorch baselines. The included RMSNorm kernel hits 2.67x speedup in microbenchmarks but only 6% end-to-end improvement in LTX-Video generation because normalization is a small fraction of total compute. Worth using if you're already profiling and know which kernels are your bottleneck. The skill includes working examples for both diffusers and transformers integration, architecture-specific optimization guides, and support for loading pre-compiled kernels from HuggingFace Hub to skip local compilation entirely.
npx skills add https://github.com/huggingface/kernels --skill cuda-kernels