GPTQ compresses large language models to 4-bit precision with minimal accuracy loss, typically under 2% perplexity degradation. It uses group-wise quantization with Hessian-based error minimization to achieve 4× memory reduction, letting you run 70B+ models on consumer GPUs like an RTX 4090. The real win is deployment: you get 3-4× faster inference compared to FP16 while keeping quality intact. It integrates cleanly with transformers and supports QLoRA fine-tuning, so you can train a 70B model on a single A100. Choose AWQ if you need slightly better accuracy or have newer GPUs with Marlin kernel support, but GPTQ remains the proven choice for aggressive compression without breaking your models.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill gptq