This is the MLSys 2024 Best Paper winner for 4-bit quantization, and it holds up in practice. AWQ protects about 1% of salient weights identified by activation patterns, which is why you get 3x speedup with under 5% accuracy loss on instruction-tuned models. It's faster than GPTQ and generalizes better to chat models, though GPTQ still has wider tool support. The vLLM integration is native and smooth. Worth noting that AutoAWQ itself is officially deprecated, so you'll want to migrate to vLLM's llm-compressor for new projects. If you're deploying 7B to 70B models on consumer GPUs or need production inference with tight memory budgets, this is still the technique to beat.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill awq-quantization