Inference Intermediate Also known as: Automatic Prefix Caching · APC · Prompt Caching

Prefix Caching

Prefix caching is an inference technique that reuses the already-computed KV cache for common prompt prefixes across multiple requests. Rather than recomputing attention keys and values for the same sequences (e.g., an identical system prompt), the system stores these activations in memory and retrieves them directly. This dramatically reduces latency for the shared prefix, bringing it close to zero. It is implemented in vLLM as 'Automatic Prefix Caching' and in Anthropic and OpenAI cloud services as a reduced-cost billed feature.

ShareLinkedIn X

In practice

A developer serving a chatbot with a fixed 2,000-token system prompt benefits immediately from prefix caching: only the first request computes that prefix, and all subsequent ones read it from cache. In vLLM it is enabled with `--enable-prefix-caching`; in the Anthropic API, prefix caching must be explicitly declared with `cache_control`. For RAG applications with shared documents, you structure the prompt by placing the document before the questions to maximize cache reuse.

Seen in the wild

2 entries mentioning it

← All terms

In practice

Related terms

Seen in the wild