Inference Intermediate Also known as: Key-Value Cache · Cache chiavi-valori

KV Cache

/kay-vee cache/

A temporary GPU memory that stores attention computations for tokens already seen, so the model does not recompute them on every new token generated.

ShareLinkedIn X

In practice

It is why generating the tenth token costs less than the first: the cache avoids redoing work. It eats a lot of VRAM and grows with context, so it is often the real bottleneck for serving many users in parallel. Optimizing it (paged, quantized) is central to cutting inference cost.

Seen in the wild

4 entries mentioning it

January 22, 2025

FlashInfer 0.2: attention library for LLM serving with paged KV cache and RoPE fusion

Medium
September 10, 2024

KV Cache Quantization FP8/INT8: Double User Density per GPU

High
March 20, 2024

Automatic Prefix Caching in vLLM: Shared KV Cache Across Requests for Near-Zero TTFT

High
February 9, 2023

vLLM: 24x LLM throughput with PagedAttention from UC Berkeley

High

← All terms

In practice

Related terms

Seen in the wild