Inference Advanced Also known as: KV Quantization · KV Compression

KV Cache Quantization

KV cache quantization is the technique of compressing the key-value tensors dynamically generated during inference, reducing them from FP16 to FP8 or INT8. Unlike weight quantization, which operates on the model's static parameters, this acts on the cache generated at runtime for each request. It reduces VRAM footprint by 50% or more, enabling longer context windows or more concurrent requests per GPU. It is supported by vLLM, Text Generation Inference (TGI), and TensorRT-LLM.

ShareLinkedIn X

In practice

A sysadmin serving a 70B model on two A100 80GB GPUs and wanting to increase concurrent batch size from 8 to 16 requests enables FP8 KV cache quantization in vLLM by adding `--kv-cache-dtype fp8` to the launch command. It is important to distinguish this from weight quantization: the two approaches are orthogonal and can be combined. In practice, measure quality degradation on long-range tasks (needle-in-haystack, multi-turn) before deploying to production, since precision loss in the cache is more visible over long contexts.

Seen in the wild

1 entries mentioning it

September 10, 2024

KV Cache Quantization FP8/INT8: Double User Density per GPU

High

← All terms

In practice

Related terms

Seen in the wild