Infrastructure Advanced Also known as: Prefill-Decode Disaggregation · PD Disaggregation

Disaggregated Inference

Disaggregated inference is a serving architecture that physically separates the prefill phase (compute-bound: processes the entire prompt in parallel) from the decode phase (memory-bound: generates one token at a time), assigning them to distinct GPU pools connected via KV cache transfer. This separation eliminates 'prefill-decode interference', the resource contention that occurs when both phases run on the same GPUs and reduces overall throughput. Publicly proposed by Moonshot AI's Mooncake architecture (Kimi), it has yielded throughput improvements of 5x or more in production. It is considered one of the most significant advances in LLM serving infrastructure in 2024-2025.

ShareLinkedIn X

In practice

In a large-scale deployment, the infrastructure engineer configures a cluster of 'prefill-only' GPUs (typically high FLOPS/W, such as H100 SXM) and a separate 'decode-only' cluster (typically high memory bandwidth). An incoming request is routed to the prefill pool, which computes the KV cache and transfers it via NVLink or InfiniBand to the decode pool. Open-source frameworks such as LMDeploy and some advanced vLLM configurations support this mode. Operational cost is higher due to hardware duplication, but TTFT (time-to-first-token) and throughput improve significantly.

Seen in the wild

1 entries mentioning it

November 5, 2024

Mooncake: Disaggregated Prefill-Decode Inference for 525% More Throughput

High

← All terms

In practice

Related terms

Seen in the wild