GPTQ: 4-bit post-training quantization making GPT-scale inference practical
In one sentence Frantar et al. (ETH Zurich) publish GPTQ: accurate 4-bit quantization without significant fine-tuning, the first technique to make inference of 175B-parameter models practical on consumer hardware.
Large language models like GPT-3 weigh hundreds of gigabytes and require data-center GPUs to run. Quantization is the idea of compressing the numbers representing model weights: instead of using 16 bits per value, use only 4, reducing the size fourfold.
The problem is that aggressively compressing weights degrades model quality. GPTQ, published by ETH Zurich researchers, finds a mathematically clever way to compensate for quantization errors, calibrating weights layer by layer on a small reference dataset — all without retraining the model.
The result: models with 175 billion parameters can run on a single 24 GB consumer GPU with minimal quality loss. GPTQ paves the way for local inference of large LLMs, preceding by months the explosion of open-source models.
Companies
ETH Zurich
Tools
GPTQ, PyTorch, CUDA
Tags
Sources