Inference Advanced Also known as: Decoding speculativo

Speculative Decoding

A technique where a small fast model proposes several tokens ahead and the large model verifies them in a single pass, accepting the correct ones.

ShareLinkedIn X

In practice

It can produce answers 2-3x faster with no change in final quality, because the big model stays the judge. It is used in production by OpenAI, Anthropic, and in self-hosted runtimes. It needs a "draft" model aligned with the main one, so it is not free to set up.

Seen in the wild

3 entries mentioning it

December 18, 2024

llama.cpp: speculative decoding with draft models for 2-3x speedup

High
September 14, 2023

Medusa: multi-head speculative decoding without a separate draft model, 2.2x speedup

High
January 20, 2023

Speculative Decoding: 2-3x LLM inference speedup without changing output

High

← All terms

In practice

Related terms

Seen in the wild