Mooncake: Disaggregated Prefill-Decode Inference for 525% More Throughput
In one sentence Moonshot AI (Kimi) separates prefill (compute-bound GPU) and decode (memory-bound GPU) phases across dedicated GPU pools with KV cache transfer, achieving 525% throughput improvement in production deployments.
When an LLM responds to a message, it does two very different things in sequence. First it reads and understands the entire question all at once — this phase is called prefill and is very compute-intensive, like a processor running at full capacity. Then it generates the response word by word — this is called decode and is limited by how fast memory can be read, not computational power.
The problem with running both phases on the same GPU is that expensive hardware is used suboptimally: during decode the GPU's computational power is mostly wasted, and during prefill memory bandwidth is underutilized.
Moonshot AI solved this by physically separating the two types of work. One GPU pool (prefill pool) only does the question-understanding work, which is compute-intensive. Another pool (decode pool) only does response generation, which is memory-intensive. Between the two, the intermediate result of the prefill computation is transferred via fast network. This allows sizing the two pools optimally for their type of work, without compromise. The result in Kimi's production deployment was a 5.25-times increase in total system throughput.
Companies
Moonshot AI
Tools
—
Tags
Sources