Reading path
DevOps / LLMOps: inference, stack and production optimization
vLLM, Ollama, quantization and latency: the operational path for those who run LLMs for real.
You are a DevOps, MLOps or LLMOps engineer who needs to run language models in production with real requirements for latency, throughput and cost. This path follows the evolution of inference infrastructure: from the chips that define its physical limits, to the open-source stacks that push them to the maximum, to the quantization techniques that cut hardware requirements without sacrificing quality.
- 01
Why it matters to you
The Microsoft framework that made distributed training practical across hundreds of GPUs with ZeRO stage 3: understanding how it partitions optimizer state, gradients and parameters is the foundation for any MLOps pipeline on a cluster.
High AI InfrastructureDeepSpeed ZeRO-3: training models beyond 100 billion parameters
Microsoft announces ZeRO Stage 3 in DeepSpeed: by sharding parameters across GPUs in addition to gradients and optimizer states, it enables training of 100B+ parameter models on reasonable-size clusters.
- 02
Why it matters to you
The first public demonstration that an inference-only chip can outperform GPUs by orders of magnitude in tokens/s: it changes the reference benchmarks for latency SLAs and forces a reassessment of hardware choices in every LLMOps stack.
High AI InfrastructureGroq LPU: 500-tokens-per-second inference goes viral
Groq's public demo on Llama 2 70B generates ~500 tokens/sec, orders of magnitude faster than any GPU. LLM latency stops being a given.
- 03
Why it matters to you
The wafer-scale engine that brings 70B+ model inference to speeds unthinkable on standard GPUs: it reveals that memory bandwidth is the real bottleneck and guides decisions on batch size and caching strategy.
Medium AI InfrastructureCerebras Inference: record-breaking LLM inference throughput on the wafer-scale WSE-3
Cerebras launches an LLM inference service on the wafer-scale WSE-3, claiming ~1800 tokens/s on Llama 3.1 8B and ~450 tokens/s on Llama 3.1 70B — 10-20× faster than H100 GPUs.
- 04
Why it matters to you
Meta standardizes APIs for inference, RAG, safety and agents in a single deployable stack: the reference point for anyone who wants a reproducible LLMOps architecture without depending on a single provider.
Medium AI InfrastructureLlama Stack: Meta proposes a unified API spec for the LLM lifecycle
Meta announces Llama Stack: an API spec + reference implementations for inference, safety, agents, memory, evals, RAG, and training — meant as 'standard plumbing' for Llama-based applications.
- 05
Why it matters to you
The throughput record that redefines what can be promised in a production SLA: the published numbers become the new benchmark against which every vLLM or Triton deployment configuration is measured.
Medium AI InfrastructureCerebras hits 2,500+ tok/s on Llama: inference record of the year
Cerebras Systems publishes inference numbers beating Nvidia GPUs by an order of magnitude: 2,500+ tok/s on Llama 4 Maverick and Scout thanks to the wafer-scale WSE-3. Custom ASIC back in the race.
- 06
Why it matters to you
vLLM 0.7 separates prefill and decode onto distinct instances, cutting TTFT latency without penalizing throughput: the release that makes disaggregated LLM deployment a viable operational pattern.
Medium AI InfrastructurevLLM v0.7: chunked prefill by default and a redesigned V1 engine
vLLM ships v0.7 with chunked prefill on by default, a rewritten 'V1' engine scheduler, and advanced support for MoE (DeepSeek V3/R1) and multimodal models. +1.5-2× throughput on many workloads.
- 07
Why it matters to you
The breakthrough that fits 70B models into sub-4bit precision with negligible quality loss: it redefines the minimum hardware needed for on-premise deployment and opens edge scenarios that were unthinkable just months before.
Medium Local AIUsable 2-bit quantization: frontier reasoning models drop below 32GB RAM
New quantization techniques (high-quality 2-bit / 3-bit extensions) let frontier-sized reasoning models run on workstations with 32-64GB unified RAM.