Skip to content
AImpact
IT EN
← Reading paths

Reading path

DevOps / LLMOps: inference, stack and production optimization

vLLM, Ollama, quantization and latency: the operational path for those who run LLMs for real.

You are a DevOps, MLOps or LLMOps engineer who needs to run language models in production with real requirements for latency, throughput and cost. This path follows the evolution of inference infrastructure: from the chips that define its physical limits, to the open-source stacks that push them to the maximum, to the quantization techniques that cut hardware requirements without sacrificing quality.

  1. 01

    Why it matters to you

    The Microsoft framework that made distributed training practical across hundreds of GPUs with ZeRO stage 3: understanding how it partitions optimizer state, gradients and parameters is the foundation for any MLOps pipeline on a cluster.

    High AI Infrastructure

    DeepSpeed ZeRO-3: training models beyond 100 billion parameters

    Microsoft announces ZeRO Stage 3 in DeepSpeed: by sharding parameters across GPUs in addition to gradients and optimizer states, it enables training of 100B+ parameter models on reasonable-size clusters.

  2. 02

    Why it matters to you

    The first public demonstration that an inference-only chip can outperform GPUs by orders of magnitude in tokens/s: it changes the reference benchmarks for latency SLAs and forces a reassessment of hardware choices in every LLMOps stack.

    High AI Infrastructure

    Groq LPU: 500-tokens-per-second inference goes viral

    Groq's public demo on Llama 2 70B generates ~500 tokens/sec, orders of magnitude faster than any GPU. LLM latency stops being a given.

  3. 03

    Why it matters to you

    The wafer-scale engine that brings 70B+ model inference to speeds unthinkable on standard GPUs: it reveals that memory bandwidth is the real bottleneck and guides decisions on batch size and caching strategy.

    Medium AI Infrastructure

    Cerebras Inference: record-breaking LLM inference throughput on the wafer-scale WSE-3

    Cerebras launches an LLM inference service on the wafer-scale WSE-3, claiming ~1800 tokens/s on Llama 3.1 8B and ~450 tokens/s on Llama 3.1 70B — 10-20× faster than H100 GPUs.

  4. 04

    Why it matters to you

    Meta standardizes APIs for inference, RAG, safety and agents in a single deployable stack: the reference point for anyone who wants a reproducible LLMOps architecture without depending on a single provider.

    Medium AI Infrastructure

    Llama Stack: Meta proposes a unified API spec for the LLM lifecycle

    Meta announces Llama Stack: an API spec + reference implementations for inference, safety, agents, memory, evals, RAG, and training — meant as 'standard plumbing' for Llama-based applications.

  5. 05

    Why it matters to you

    The throughput record that redefines what can be promised in a production SLA: the published numbers become the new benchmark against which every vLLM or Triton deployment configuration is measured.

    Medium AI Infrastructure

    Cerebras hits 2,500+ tok/s on Llama: inference record of the year

    Cerebras Systems publishes inference numbers beating Nvidia GPUs by an order of magnitude: 2,500+ tok/s on Llama 4 Maverick and Scout thanks to the wafer-scale WSE-3. Custom ASIC back in the race.

  6. 06

    Why it matters to you

    vLLM 0.7 separates prefill and decode onto distinct instances, cutting TTFT latency without penalizing throughput: the release that makes disaggregated LLM deployment a viable operational pattern.

    Medium AI Infrastructure

    vLLM v0.7: chunked prefill by default and a redesigned V1 engine

    vLLM ships v0.7 with chunked prefill on by default, a rewritten 'V1' engine scheduler, and advanced support for MoE (DeepSeek V3/R1) and multimodal models. +1.5-2× throughput on many workloads.

  7. 07

    Why it matters to you

    The breakthrough that fits 70B models into sub-4bit precision with negligible quality loss: it redefines the minimum hardware needed for on-premise deployment and opens edge scenarios that were unthinkable just months before.

    Medium Local AI

    Usable 2-bit quantization: frontier reasoning models drop below 32GB RAM

    New quantization techniques (high-quality 2-bit / 3-bit extensions) let frontier-sized reasoning models run on workstations with 32-64GB unified RAM.