Skip to content
AImpact
IT EN
Inference Advanced Also known as: Batching continuo · In-flight batching

Continuous Batching

A serving strategy where new requests join the running batch at every generation step, instead of waiting for previous ones to finish.

ShareLinkedInX

In practice

It sharply raises the throughput of a GPU serving APIs, because cores are never left idle. It is implemented in vLLM, TensorRT-LLM, and TGI. For anyone pricing per token, it is one of the key ingredients to stay competitive on cost.

Related terms

← All terms