Skip to content
AImpact
IT EN
High AI Infrastructure · 1 min read

KV Cache Quantization FP8/INT8: Double User Density per GPU

In one sentence Quantizing the KV cache from FP16 to FP8 or INT8 reduces serving memory by 50%+, enabling 2x longer contexts or twice the concurrent users per GPU, adopted by vLLM, TGI, and TensorRT-LLM.

Needs review Official source
ShareLinkedInX
Reading level

During text generation, an LLM must remember all information processed up to that point in the conversation. This data is called the KV cache (keys and values) and grows linearly with generated text length. The problem: it occupies enormous amounts of GPU memory, often more than the model weights themselves when handling many simultaneous conversations.

The solution is to compress this temporary data. Traditionally it was saved in FP16 — 16 bits per number. KV cache quantization converts it to FP8 or INT8 — 8 bits per number — halving the space used. Unlike model weights where precision is crucial for quality, the KV cache has different statistical characteristics that allow this compression with minimal quality impact.

The concrete result for infrastructure operators: on the same GPU you can run the same model with twice the context length (moving from 4096 to 8192 context tokens without adding hardware), or serve twice as many users in parallel. Both translate directly to lower cost per generated response. vLLM, TGI, and TensorRT-LLM have all adopted this technique as a standard option.

Companies

vLLM Team, NVIDIA, HuggingFace

Tools

Tags

KV cache quantizationFP8INT8inferencememory optimizationserving

Sources