Skip to content
AImpact
IT EN
Inference Advanced Also known as: Decoding speculativo

Speculative Decoding

A technique where a small fast model proposes several tokens ahead and the large model verifies them in a single pass, accepting the correct ones.

ShareLinkedInX

In practice

It can produce answers 2-3x faster with no change in final quality, because the big model stays the judge. It is used in production by OpenAI, Anthropic, and in self-hosted runtimes. It needs a "draft" model aligned with the main one, so it is not free to set up.

Related terms

← All terms