In practice
It is the core idea of vLLM and now standard in modern inference servers. It lets the same GPU serve many more users because it avoids reserving large mostly-empty blocks. When picking a self-hosted runtime, support for paged attention is a baseline requirement.
Related terms
Seen in the wild
4 entries mentioning it- MediumvLLM v0.7: chunked prefill by default and a redesigned V1 engine
- MediumContinuous Batching for LLM Serving: survey and state of the art of Orca, vLLM, SGLang, TGI
- MediumFlashInfer 0.2: attention library for LLM serving with paged KV cache and RoPE fusion
- HighvLLM: 24x LLM throughput with PagedAttention from UC Berkeley