In practice
It is what lets you run a Llama 70B on a single GPU or a 7B model on a Mac. You lose a bit of quality but often not much. Typical tools: GGUF, AWQ, GPTQ. Useful for on-prem or edge deployment.
Related terms
Seen in the wild
11 entries mentioning it- MediumUsable 2-bit quantization: frontier reasoning models drop below 32GB RAM
- Mediumtorchao: PyTorch-Native Quantization and Sparsity Without Custom CUDA
- HighKV Cache Quantization FP8/INT8: Double User Density per GPU
- Mediumbitsandbytes 0.43: QLoRA and NF4/FP4 quantization for 4-bit fine-tuning
- MediumLLM Compressor: unified toolkit for quantization and sparsity with native vLLM integration
- MediumGGUF specification: the standard format for local quantized LLM models
- MediumExLlamaV2: high-speed quantized LLM inference on consumer GPUs
- Highllama.cpp K-quants: the intelligent quantization that transformed local models
- HighAWQ: activation-aware 4-bit quantization for edge deployment with accuracy above GPTQ
- Landmarkllama.cpp: LLaMA 7B runs 4-bit on MacBook CPU
- HighGPTQ: 4-bit post-training quantization making GPT-scale inference practical