This is a from-scratch PyTorch implementation of Google's TurboQuant algorithm for compressing LLM key-value caches down to 2-4 bits per coordinate. It uses two-stage vector quantization: random rotation plus Lloyd-Max scalar quantization for the first pass, then QJL residual correction to make inner product estimates unbiased. The trick is it preserves attention scores, not individual vector fidelity. At 3-bit you get 5x compression with 99.5% attention accuracy on Qwen2.5-3B, which is the practical sweet spot. Use it when you're running long-context inference and need to fit more tokens in VRAM. The repo includes both synthetic tests and real model validation, plus production compressors that let you compute attention scores directly from compressed keys without decompressing.
npx skills add https://github.com/aradotso/trending-skills --skill turboquant-pytorch