Inference Intermediate Also known as: Quantizzazione

Quantization

A technique that reduces the numeric precision of model weights (for example from 16 to 4 bits) so it takes less memory and runs faster.

ShareLinkedIn X

In practice

It is what lets you run a Llama 70B on a single GPU or a 7B model on a Mac. You lose a bit of quality but often not much. Typical tools: GGUF, AWQ, GPTQ. Useful for on-prem or edge deployment.

Related terms

Inference compute LoRA

Seen in the wild

11 entries mentioning it

April 30, 2026

Usable 2-bit quantization: frontier reasoning models drop below 32GB RAM

Medium
March 1, 2025

torchao: PyTorch-Native Quantization and Sparsity Without Custom CUDA

Medium
September 10, 2024

KV Cache Quantization FP8/INT8: Double User Density per GPU

High
August 20, 2024

bitsandbytes 0.43: QLoRA and NF4/FP4 quantization for 4-bit fine-tuning

Medium
August 5, 2024

LLM Compressor: unified toolkit for quantization and sparsity with native vLLM integration

Medium
March 25, 2024

GGUF specification: the standard format for local quantized LLM models

Medium
September 15, 2023

ExLlamaV2: high-speed quantized LLM inference on consumer GPUs

Medium
July 5, 2023

llama.cpp K-quants: the intelligent quantization that transformed local models

High
June 22, 2023

AWQ: activation-aware 4-bit quantization for edge deployment with accuracy above GPTQ

High
March 10, 2023

llama.cpp: LLaMA 7B runs 4-bit on MacBook CPU

Landmark
October 12, 2022

GPTQ: 4-bit post-training quantization making GPT-scale inference practical

High

← All terms