This gives Claude fluency in TileKernels, a library of hand-optimized GPU kernels for LLM operations like MoE routing, FP8 quantization, and transposes, all written in TileLang for Hopper and Blackwell architectures. You'd reach for this when implementing custom LLM components that need to squeeze performance out of H100s or B200s, like fused SwiGLU quantization or Sinkhorn-normalized Manifold HyperConnections. The library includes both low-level kernels and PyTorch autograd wrappers, plus reference implementations for validation. It's pretty specialized hardware (SM90+ only) and assumes you're comfortable writing or modifying GPU kernels, but if you're building production LLM infrastructure and need to beat standard PyTorch ops, this is the kind of tooling that matters.
npx skills add https://github.com/aradotso/trending-skills --skill tilekernels-gpu-kernels