Skip to content
AImpact
IT EN
Medium AI Infrastructure · 1 min read

Continuous Batching for LLM Serving: survey and state of the art of Orca, vLLM, SGLang, TGI

In one sentence Systematic review of continuous batching strategies for LLM serving: comparing Orca, vLLM, SGLang, and TGI on scheduling, GPU utilization, and TTFT/TPOT metrics. State of the art 2024-2025.

Verified Official source
ShareLinkedInX
Reading level

Handling thousands of concurrent requests to an AI model is non-trivial: each request has a different length and response time must be low. Classic "batching" groups requests into the same batch and waits for all of them to finish before starting the next ones — simple but inefficient because short requests wait for long ones.

"Continuous batching" (introduced by Orca in 2022) changes the paradigm: new requests enter the batch as soon as one finishes, step by step. This maximizes GPU utilization and drastically reduces wait times. Today all major serving frameworks implement it, but in different ways with different trade-offs.

This systematic review compares the four main systems — Orca, vLLM, SGLang, and TGI — on scheduling architecture, KV cache management, latency metrics (TTFT for first response, TPOT for subsequent tokens), and GPU utilization. A consolidated reference for anyone choosing serving infrastructure.

Companies

vLLM Project, Hugging Face, University of California Berkeley, MIT

Tools

vLLM, SGLang, TGI, Orca, PagedAttention

Tags

Continuous BatchingLLM ServingOrcavLLMSGLangTGITTFTTPOTGPU UtilizationSurveyScheduling

Sources