Inference Intermediate Also known as: K-Quantization · llama.cpp K-Quants · GGUF K-Quants

K-Quants

K-Quants are a family of quantization methods implemented in llama.cpp (from Q2_K to Q8_K) that apply different bit-widths to different model layers based on their sensitivity to precision loss. Attention and embedding layers, being more sensitive, receive more bits; intermediate feed-forward layers, being less critical, receive fewer. This non-uniform quantization produces higher quality than older flat-Q formats (Q4_0, Q5_1) at the same file size. Q4_K_M has become the reference format for local inference, achieving better quality than the old Q5_1 while being more compact. They are the standard format for modern GGUF models downloadable from HuggingFace.

ShareLinkedIn X

In practice

A user wanting to run Llama 3 70B on a PC with 48 GB of RAM downloads the Q4_K_M variant from the GGUF repository on HuggingFace (typically uploaded by TheBloke or bartowski) and runs it with `llama.cpp` or an interface like LM Studio or Ollama. The choice of quantization level follows a practical rule: Q4_K_M for the best quality/size balance, Q5_K_M if there is sufficient RAM and higher fidelity is desired, Q2_K if space is very limited and degraded quality is acceptable. K-Quants are transparent to the end user: the interface loads the GGUF file and handles the format internally.

Related terms

Quantization QLoRA

Seen in the wild

1 entries mentioning it

July 5, 2023

llama.cpp K-quants: the intelligent quantization that transformed local models

High

← All terms