Automatic Prefix Caching in vLLM: Shared KV Cache Across Requests for Near-Zero TTFT
In one sentence vLLM v0.3.3 introduces Automatic Prefix Caching that reuses the KV cache for common prefixes across different requests, nearly eliminating initial response time for system prompts and previously-processed RAG documents.
When an LLM responds to a question, before generating a single word it must process all the text given as context — the system prompt, conversation history, retrieved documents. This process is called prefill and can require significant time and memory.
Prefix caching works like a photographic memory for work already done. Imagine answering a thousand different questions about the same long document, where the document is always the same but the questions change. Without prefix caching, the model re-reads and reprocesses the entire document from scratch each time. With prefix caching, the result of processing the document is saved and reused for all subsequent questions.
In practice this means the first time a system prompt, document, or any long prefix is processed, the time is normal. But from the second request onward — even from different users — that portion is instantaneous. For chatbots with long system prompts, RAG systems that repeatedly retrieve the same documents, or applications where many users ask questions about a common context, TTFT (time to first generated token) can drop by 90% or more.
Companies
vLLM Team
Tools
—
Tags
Sources