Skip to content
AImpact
IT EN
High AI Infrastructure · 1 min read

Medusa: multi-head speculative decoding without a separate draft model, 2.2x speedup

In one sentence Cornell/UIUC introduce Medusa: N additional decoding heads on the main model predict N tokens ahead simultaneously, 2.2x speedup without needing a second draft model.

Verified Official source
ShareLinkedInX
Reading level

Google's original Speculative Decoding required two models: one large and one small. Maintaining and deploying two models is complicated: the small model must be compatible with the large one, takes up additional memory, and can become stale when the main model is updated.

Medusa solves the problem elegantly: instead of a second model, it adds N extra "heads" directly on top of the main model. Each head learns to predict the token k steps ahead in the sequence. During inference, all heads run in parallel in the single model forward pass, then a tree of verifications is used to accept the maximum number of consistent tokens.

The result: 2.2x average speedup on Vicuna-7B and 13B, with no second model to maintain. The heads train in a few hours of fine-tuning. Medusa has been adopted by vLLM and LMDeploy as a native acceleration strategy.

Companies

Cornell University, UIUC, Princeton University

Tools

Medusa, PyTorch, vLLM

Tags

MedusaSpeculative DecodingMulti-HeadInferenceCornellUIUCThroughputLLM Serving

Sources