Skip to content
AImpact
IT EN
High AI Infrastructure · 1 min read

AWQ: activation-aware 4-bit quantization for edge deployment with accuracy above GPTQ

In one sentence MIT Han Lab publishes AWQ: 4-bit quantization that preserves salient weights identified through activation analysis, achieving better accuracy-throughput than GPTQ for edge deployment.

Verified Official source
ShareLinkedInX
Reading level

Quantizing a model means compressing its numerical values from 16 bits to 4 bits to save memory. The problem is that not all weights are equally important: uniform compression leads to unnecessary quality losses.

AWQ, developed by MIT Han Lab's Song Han, observes that certain weights have a disproportionate impact on model output — and this is visible by looking at input activations, not the weights themselves. Weights receiving large activations are more "important." AWQ protects them by scaling them to reduce quantization error on the critical ones.

The result outperforms GPTQ in accuracy at equal compression, especially on small models for edge devices like smartphones and laptops. TinyChat, MIT Han Lab's inference engine, uses AWQ to run LLaMA at 60+ tokens per second on an M2 MacBook.

Companies

MIT Han Lab

Tools

AWQ, PyTorch, TinyChat, llama.cpp

Tags

AWQQuantizzazione4-bitActivation-awareMIT Han LabEdgeDeploymentLLM

Sources