Skip to content
AImpact
IT EN
Inference Advanced Also known as: PagedAttention

Paged Attention

A technique that splits the KV cache into small blocks managed like virtual memory pages, cutting VRAM waste between different requests.

ShareLinkedInX

In practice

It is the core idea of vLLM and now standard in modern inference servers. It lets the same GPU serve many more users because it avoids reserving large mostly-empty blocks. When picking a self-hosted runtime, support for paged attention is a baseline requirement.

Related terms

← All terms