KV Cache Quantization FP8/INT8: Double User Density per GPU
In one sentence Quantizing the KV cache from FP16 to FP8 or INT8 reduces serving memory by 50%+, enabling 2x longer contexts or twice the concurrent users per GPU, adopted by vLLM, TGI, and TensorRT-LLM.
During text generation, an LLM must remember all information processed up to that point in the conversation. This data is called the KV cache (keys and values) and grows linearly with generated text length. The problem: it occupies enormous amounts of GPU memory, often more than the model weights themselves when handling many simultaneous conversations.
The solution is to compress this temporary data. Traditionally it was saved in FP16 — 16 bits per number. KV cache quantization converts it to FP8 or INT8 — 8 bits per number — halving the space used. Unlike model weights where precision is crucial for quality, the KV cache has different statistical characteristics that allow this compression with minimal quality impact.
The concrete result for infrastructure operators: on the same GPU you can run the same model with twice the context length (moving from 4096 to 8192 context tokens without adding hardware), or serve twice as many users in parallel. Both translate directly to lower cost per generated response. vLLM, TGI, and TensorRT-LLM have all adopted this technique as a standard option.
Companies
vLLM Team, NVIDIA, HuggingFace
Tools
—
Tags
Sources