Inference Advanced Also known as: Batching continuo · In-flight batching

Continuous Batching

A serving strategy where new requests join the running batch at every generation step, instead of waiting for previous ones to finish.

ShareLinkedIn X

In practice

It sharply raises the throughput of a GPU serving APIs, because cores are never left idle. It is implemented in vLLM, TensorRT-LLM, and TGI. For anyone pricing per token, it is one of the key ingredients to stay competitive on cost.

Seen in the wild

2 entries mentioning it

April 8, 2025

Continuous Batching for LLM Serving: survey and state of the art of Orca, vLLM, SGLang, TGI

Medium
June 6, 2023

HuggingFace TGI: production-ready Docker container for LLM serving with continuous batching

High

← All terms

In practice

Related terms

Seen in the wild