Skip to content
AImpact
IT EN
Medium AI Infrastructure · 1 min read

DeepSpeed-FastGen: Dynamic SplitFuse scheduling for 2.3x throughput over vLLM in production

In one sentence Microsoft DeepSpeed team releases FastGen via MII: Dynamic SplitFuse scheduling for LLM serving achieves 2.3x throughput vs vLLM on production chat workloads, optimized for Azure H100.

Verified Official source
ShareLinkedInX
Reading level

When many users use an AI service simultaneously, the server must handle requests of very different lengths: some questions are short, answers sometimes very long. Previous systems like vLLM used continuous batching, which is much better than static batching but still leaves room for improvement.

DeepSpeed-FastGen introduces "Dynamic SplitFuse": instead of handling the prefill phase (prompt processing) and the decoding phase (token generation) as separate blocks, it splits and mixes them dynamically to keep the GPU always at maximum utilization. Long prefill requests are "split" to avoid blocking decoding requests.

The result is 2.3x more throughput compared to vLLM on production chat benchmarks, with reduced latency especially for short requests. DeepSpeed-FastGen is integrated into Microsoft's MII (Model Implementation and Integration) library and specifically optimized for Azure's H100 clusters.

Companies

Microsoft, DeepSpeed Team

Tools

DeepSpeed, DeepSpeed-MII, FastGen, PyTorch

Tags

DeepSpeedFastGenMIILLM ServingDynamic SplitFuseMicrosoftAzureH100Throughput

Sources