Skip to content
AImpact
IT EN
Medium AI Infrastructure · 1 min read

SGLang: 6.4x LLM throughput with RadixAttention and shared prefix caching

In one sentence Stanford and LMSYS release SGLang, an LLM runtime introducing RadixAttention to share prefix caching across different requests, achieving 6.4x throughput over vLLM on tasks with common prefixes.

Verified Official source
ShareLinkedInX
Reading level

Many AI applications send requests to models that always start the same way: system instructions, document context, conversation history. Each time, the model has to recompute everything from scratch, even if it has already seen that part before.

SGLang solves this with a simple but powerful idea: it stores computations already done for common prefixes and reuses them for subsequent requests. It's like having a memory of work already completed, shared among all system users.

The result on tasks where many requests share a long prefix (like RAG, agents with fixed system prompts, few-shot prompting) is up to 6.4x higher throughput than vLLM. Fewer GPUs needed, same amount of work completed.

Companies

Stanford University, LMSYS

Tools

SGLang, RadixAttention

Tags

SGLangStanfordRadixAttentionPrefix CachingLLM ServingThroughput

Sources