SGLang: 6.4x LLM throughput with RadixAttention and shared prefix caching
In one sentence Stanford and LMSYS release SGLang, an LLM runtime introducing RadixAttention to share prefix caching across different requests, achieving 6.4x throughput over vLLM on tasks with common prefixes.
Many AI applications send requests to models that always start the same way: system instructions, document context, conversation history. Each time, the model has to recompute everything from scratch, even if it has already seen that part before.
SGLang solves this with a simple but powerful idea: it stores computations already done for common prefixes and reuses them for subsequent requests. It's like having a memory of work already completed, shared among all system users.
The result on tasks where many requests share a long prefix (like RAG, agents with fixed system prompts, few-shot prompting) is up to 6.4x higher throughput than vLLM. Fewer GPUs needed, same amount of work completed.
Companies
Stanford University, LMSYS
Tools
SGLang, RadixAttention
Tags
Sources