Skip to content
AImpact
IT EN
Medium AI Infrastructure · 1 min read

vLLM v0.7: chunked prefill by default and a redesigned V1 engine

In one sentence vLLM ships v0.7 with chunked prefill on by default, a rewritten 'V1' engine scheduler, and advanced support for MoE (DeepSeek V3/R1) and multimodal models. +1.5-2× throughput on many workloads.

Needs review Official source
ShareLinkedInX
Reading level

vLLM is the most-used open-source engine to serve LLMs in production: it powers Pinterest, IBM, Snowflake, and a huge slice of academia. Version 0.7 ships two big under-the-hood changes.

First: "chunked prefill" is on by default. In plain terms, it splits the initial phase (when the model reads the prompt) into chunks and interleaves it with the generation phase, reducing latency and increasing throughput without changing the model.

Second: a new "V1 engine" rewritten from scratch, simpler and faster. On real workloads you see 50-100% throughput improvements over v0.6, and support for big MoE models (DeepSeek V3/R1) and multimodal models is much more solid.

Companies

vLLM Project, UC Berkeley

Tools

vLLM v0.7, vLLM

Tags

vLLMInferenceChunked PrefillPagedAttentionOpen Source

Sources