vLLM v0.7: chunked prefill by default and a redesigned V1 engine
In one sentence vLLM ships v0.7 with chunked prefill on by default, a rewritten 'V1' engine scheduler, and advanced support for MoE (DeepSeek V3/R1) and multimodal models. +1.5-2× throughput on many workloads.
vLLM is the most-used open-source engine to serve LLMs in production: it powers Pinterest, IBM, Snowflake, and a huge slice of academia. Version 0.7 ships two big under-the-hood changes.
First: "chunked prefill" is on by default. In plain terms, it splits the initial phase (when the model reads the prompt) into chunks and interleaves it with the generation phase, reducing latency and increasing throughput without changing the model.
Second: a new "V1 engine" rewritten from scratch, simpler and faster. On real workloads you see 50-100% throughput improvements over v0.6, and support for big MoE models (DeepSeek V3/R1) and multimodal models is much more solid.
Companies
vLLM Project, UC Berkeley
Tools
vLLM v0.7, vLLM
Tags
Sources