Inference Advanced Also known as: PagedAttention

Paged Attention

A technique that splits the KV cache into small blocks managed like virtual memory pages, cutting VRAM waste between different requests.

ShareLinkedIn X

In practice

It is the core idea of vLLM and now standard in modern inference servers. It lets the same GPU serve many more users because it avoids reserving large mostly-empty blocks. When picking a self-hosted runtime, support for paged attention is a baseline requirement.

Seen in the wild

4 entries mentioning it

July 2, 2025

vLLM v0.7: chunked prefill by default and a redesigned V1 engine

Medium
April 8, 2025

Continuous Batching for LLM Serving: survey and state of the art of Orca, vLLM, SGLang, TGI

Medium
January 22, 2025

FlashInfer 0.2: attention library for LLM serving with paged KV cache and RoPE fusion

Medium
February 9, 2023

vLLM: 24x LLM throughput with PagedAttention from UC Berkeley

High

← All terms

In practice

Related terms

Seen in the wild