FlashInfer 0.2: attention library for LLM serving with paged KV cache and RoPE fusion
In one sentence UW + MIT release FlashInfer 0.2: CUDA library for attention in LLM serving with native paged KV cache, variable-length sequences, RoPE fusion, and 1.5x speedup vs vLLM on long prefill on A100.
Serving language models in production has different needs from training: requests arrive continuously with varying lengths, KV cache must be managed efficiently, and RoPE (the relative position mechanism used by LLaMA and other models) must be computed at every step.
FlashInfer is a CUDA library specialized for these serving-specific access patterns, different from training where FlashAttention excels. Version 0.2 introduces native support for paged KV cache — where memory is divided into non-contiguous blocks like an operating system — and fusion of RoPE computation directly into the attention kernel.
The result is 1.5x speedup compared to the standard vLLM implementation on long prefill (requests with thousands of token prompts) on A100 GPU. Adopted as primary backend by SGLang and optional in vLLM.
Companies
University of Washington, MIT
Tools
FlashInfer, vLLM, SGLang, PyTorch, CUDA
Tags
Sources