DeepSpeed-FastGen: Dynamic SplitFuse scheduling for 2.3x throughput over vLLM in production
In one sentence Microsoft DeepSpeed team releases FastGen via MII: Dynamic SplitFuse scheduling for LLM serving achieves 2.3x throughput vs vLLM on production chat workloads, optimized for Azure H100.
When many users use an AI service simultaneously, the server must handle requests of very different lengths: some questions are short, answers sometimes very long. Previous systems like vLLM used continuous batching, which is much better than static batching but still leaves room for improvement.
DeepSpeed-FastGen introduces "Dynamic SplitFuse": instead of handling the prefill phase (prompt processing) and the decoding phase (token generation) as separate blocks, it splits and mixes them dynamically to keep the GPU always at maximum utilization. Long prefill requests are "split" to avoid blocking decoding requests.
The result is 2.3x more throughput compared to vLLM on production chat benchmarks, with reduced latency especially for short requests. DeepSpeed-FastGen is integrated into Microsoft's MII (Model Implementation and Integration) library and specifically optimized for Azure's H100 clusters.
Companies
Microsoft, DeepSpeed Team
Tools
DeepSpeed, DeepSpeed-MII, FastGen, PyTorch
Tags
Sources