GPTQ: 4-bit post-training quantization making GPT-scale inference practical

In one sentence Frantar et al. (ETH Zurich) publish GPTQ: accurate 4-bit quantization without significant fine-tuning, the first technique to make inference of 175B-parameter models practical on consumer hardware.

Verified Official source

ShareLinkedIn X

Large language models like GPT-3 weigh hundreds of gigabytes and require data-center GPUs to run. Quantization is the idea of compressing the numbers representing model weights: instead of using 16 bits per value, use only 4, reducing the size fourfold.

The problem is that aggressively compressing weights degrades model quality. GPTQ, published by ETH Zurich researchers, finds a mathematically clever way to compensate for quantization errors, calibrating weights layer by layer on a small reference dataset — all without retraining the model.

The result: models with 175 billion parameters can run on a single 24 GB consumer GPU with minimal quality loss. GPTQ paves the way for local inference of large LLMs, preceding by months the explosion of open-source models.