In practice
It is why generating the tenth token costs less than the first: the cache avoids redoing work. It eats a lot of VRAM and grows with context, so it is often the real bottleneck for serving many users in parallel. Optimizing it (paged, quantized) is central to cutting inference cost.
Related terms
Seen in the wild
4 entries mentioning it- MediumFlashInfer 0.2: attention library for LLM serving with paged KV cache and RoPE fusion
- HighKV Cache Quantization FP8/INT8: Double User Density per GPU
- HighAutomatic Prefix Caching in vLLM: Shared KV Cache Across Requests for Near-Zero TTFT
- HighvLLM: 24x LLM throughput with PagedAttention from UC Berkeley