Skip to content
AImpact
IT EN
High AI Infrastructure · 1 min read

Prefill/decode disaggregation: separate GPUs for low TTFT and high throughput

In one sentence The prefill/decode disaggregation technique separates prompt processing and token generation phases onto dedicated GPUs, reducing TTFT while maintaining high throughput, adopted by major cloud providers.

Verified Official source
ShareLinkedInX
Reading level

When you use an AI chatbot, there are two distinct phases: first the model reads and "digests" your question (prefill), then it generates the response word by word (decode). These two phases have very different characteristics — the first is intense but brief, the second is continuous but less demanding.

The problem is that placing them on the same GPU creates tradeoffs: optimizing for generation speed (decode) slows down initial processing (prefill) and vice versa. It's like using the same vehicle for both heavy freight and fast deliveries.

Disaggregation separates these phases onto different GPUs: some "prefill" GPUs only handle processing questions, then pass the result to "decode" GPUs that generate responses. The time the user waits before seeing the first token (TTFT) drops dramatically, while overall throughput stays high.

Companies

Microsoft Research, Google, Bytedance

Tools

vLLM, SGLang, TensorRT-LLM

Tags

PrefillDecodeDisaggregazioneTTFTCloud ScaleLLM ServingArchitettura

Sources