DevOps / LLMOps: inference, stack and production optimization

vLLM, Ollama, quantization and latency: the operational path for those who run LLMs for real.

You are a DevOps, MLOps or LLMOps engineer who needs to run language models in production with real requirements for latency, throughput and cost. This path follows the evolution of inference infrastructure: from the chips that define its physical limits, to the open-source stacks that push them to the maximum, to the quantization techniques that cut hardware requirements without sacrificing quality.

01

Why it matters to you

The Microsoft framework that made distributed training practical across hundreds of GPUs with ZeRO stage 3: understanding how it partitions optimizer state, gradients and parameters is the foundation for any MLOps pipeline on a cluster.

September 9, 2020 High AI Infrastructure

DeepSpeed ZeRO-3: training models beyond 100 billion parameters

Microsoft announces ZeRO Stage 3 in DeepSpeed: by sharding parameters across GPUs in addition to gradients and optimizer states, it enables training of 100B+ parameter models on reasonable-size clusters.
02

Why it matters to you

The first public demonstration that an inference-only chip can outperform GPUs by orders of magnitude in tokens/s: it changes the reference benchmarks for latency SLAs and forces a reassessment of hardware choices in every LLMOps stack.

February 22, 2024 High AI Infrastructure

Groq LPU: 500-tokens-per-second inference goes viral

Groq's public demo on Llama 2 70B generates ~500 tokens/sec, orders of magnitude faster than any GPU. LLM latency stops being a given.
03

Why it matters to you

The wafer-scale engine that brings 70B+ model inference to speeds unthinkable on standard GPUs: it reveals that memory bandwidth is the real bottleneck and guides decisions on batch size and caching strategy.

August 27, 2024 Medium AI Infrastructure

Cerebras Inference: record-breaking LLM inference throughput on the wafer-scale WSE-3

Cerebras launches an LLM inference service on the wafer-scale WSE-3, claiming ~1800 tokens/s on Llama 3.1 8B and ~450 tokens/s on Llama 3.1 70B — 10-20× faster than H100 GPUs.
04

Why it matters to you

Meta standardizes APIs for inference, RAG, safety and agents in a single deployable stack: the reference point for anyone who wants a reproducible LLMOps architecture without depending on a single provider.

September 25, 2024 Medium AI Infrastructure

Llama Stack: Meta proposes a unified API spec for the LLM lifecycle

Meta announces Llama Stack: an API spec + reference implementations for inference, safety, agents, memory, evals, RAG, and training — meant as 'standard plumbing' for Llama-based applications.
05

Why it matters to you

The throughput record that redefines what can be promised in a production SLA: the published numbers become the new benchmark against which every vLLM or Triton deployment configuration is measured.

June 26, 2025 Medium AI Infrastructure

Cerebras hits 2,500+ tok/s on Llama: inference record of the year

Cerebras Systems publishes inference numbers beating Nvidia GPUs by an order of magnitude: 2,500+ tok/s on Llama 4 Maverick and Scout thanks to the wafer-scale WSE-3. Custom ASIC back in the race.
06

Why it matters to you

vLLM 0.7 separates prefill and decode onto distinct instances, cutting TTFT latency without penalizing throughput: the release that makes disaggregated LLM deployment a viable operational pattern.

July 2, 2025 Medium AI Infrastructure

vLLM v0.7: chunked prefill by default and a redesigned V1 engine

vLLM ships v0.7 with chunked prefill on by default, a rewritten 'V1' engine scheduler, and advanced support for MoE (DeepSeek V3/R1) and multimodal models. +1.5-2× throughput on many workloads.
07

Why it matters to you

The breakthrough that fits 70B models into sub-4bit precision with negligible quality loss: it redefines the minimum hardware needed for on-premise deployment and opens edge scenarios that were unthinkable just months before.

April 30, 2026 Medium Local AI

Usable 2-bit quantization: frontier reasoning models drop below 32GB RAM

New quantization techniques (high-quality 2-bit / 3-bit extensions) let frontier-sized reasoning models run on workstations with 32-64GB unified RAM.

DevOps / LLMOps: inference, stack and production optimization

DeepSpeed ZeRO-3: training models beyond 100 billion parameters

Groq LPU: 500-tokens-per-second inference goes viral

Cerebras Inference: record-breaking LLM inference throughput on the wafer-scale WSE-3

Llama Stack: Meta proposes a unified API spec for the LLM lifecycle

Cerebras hits 2,500+ tok/s on Llama: inference record of the year

vLLM v0.7: chunked prefill by default and a redesigned V1 engine

Usable 2-bit quantization: frontier reasoning models drop below 32GB RAM