Skip to content
AImpact
IT EN
Medium AI Infrastructure · 1 min read

FlashInfer 0.2: attention library for LLM serving with paged KV cache and RoPE fusion

In one sentence UW + MIT release FlashInfer 0.2: CUDA library for attention in LLM serving with native paged KV cache, variable-length sequences, RoPE fusion, and 1.5x speedup vs vLLM on long prefill on A100.

Verified Official source
ShareLinkedInX
Reading level

Serving language models in production has different needs from training: requests arrive continuously with varying lengths, KV cache must be managed efficiently, and RoPE (the relative position mechanism used by LLaMA and other models) must be computed at every step.

FlashInfer is a CUDA library specialized for these serving-specific access patterns, different from training where FlashAttention excels. Version 0.2 introduces native support for paged KV cache — where memory is divided into non-contiguous blocks like an operating system — and fusion of RoPE computation directly into the attention kernel.

The result is 1.5x speedup compared to the standard vLLM implementation on long prefill (requests with thousands of token prompts) on A100 GPU. Adopted as primary backend by SGLang and optional in vLLM.

Companies

University of Washington, MIT

Tools

FlashInfer, vLLM, SGLang, PyTorch, CUDA

Tags

FlashInferAttentionKV CachePaged AttentionRoPEvLLMSGLangUniversity of WashingtonMITLLM Serving

Sources