Skip to content
AImpact
IT EN
← Reading paths

Reading path

ML engineer: training, optimization and infrastructure

GPUs, scaling laws, fast inference and quantization: the technical through-line.

You are an ML engineer who wants to understand the architectural and infrastructure decisions that drove the frontier model race. This path connects the foundational scaling papers with the hardware architectures that made them possible, and reaches the ultra-fast inference solutions that define the production cost of an LLM today.

  1. 01

    Why it matters to you

    Chinchilla scaling laws rewrite the optimal ratio between parameters and training tokens: understanding them is the prerequisite for any sensible decision about how long to train a model.

    Landmark Foundation Models

    Chinchilla: the big models were undertrained

    DeepMind publishes the Chinchilla paper and shows that, given equal compute, smaller models trained on far more tokens beat oversized undertrained ones.

  2. 02

    Why it matters to you

    The architecture that made training modern frontier models possible: Transformer Engine, NVLink4 and HBM3 change compute budgets from this point forward.

    Landmark AI Infrastructure

    NVIDIA H100 and Hopper architecture: the foundation-model GPU

    At GTC 2022 NVIDIA unveils the Hopper architecture and the H100 GPU, with FP8 Transformer Engine and NVLink 4. It will become the hardware substrate for nearly every large LLM of the following years.

  3. 03

    Why it matters to you

    The paper documenting the training of a 540B model across 6144 TPUs in parallel: the practical reference for anyone wanting to understand distributed training at scale.

    Medium Foundation Models

    PaLM 540B: Google's GPT-3 answer brings chain-of-thought

    Google publishes PaLM, a 540B-parameter model trained on the new Pathways system. Demonstrates emergent reasoning capabilities when guided with chain-of-thought.

  4. 04

    Why it matters to you

    The language for writing custom GPU kernels without native CUDA: essential for optimizing attention, flash attention, and any latency-critical operation.

    Medium AI Infrastructure

    OpenAI Triton: writing GPU kernels in Python becomes practical

    OpenAI releases Triton, a Python-like language and compiler for writing custom GPU kernels at performance close to hand-written CUDA — dramatically lowering the barrier for model optimization.

  5. 05

    Why it matters to you

    Demonstrates that inference-dedicated hardware architecture can outperform GPUs by two orders of magnitude in per-token throughput: it reframes the make-vs-buy calculus on inference.

    High AI Infrastructure

    Groq LPU: 500-tokens-per-second inference goes viral

    Groq's public demo on Llama 2 70B generates ~500 tokens/sec, orders of magnitude faster than any GPU. LLM latency stops being a given.

  6. 06

    Why it matters to you

    The wafer-scale chip that eliminates memory bottlenecks for large models: reveals that memory bandwidth, not FLOPs, is the real bottleneck in LLM inference.

    Medium AI Infrastructure

    Cerebras Inference: record-breaking LLM inference throughput on the wafer-scale WSE-3

    Cerebras launches an LLM inference service on the wafer-scale WSE-3, claiming ~1800 tokens/s on Llama 3.1 8B and ~450 tokens/s on Llama 3.1 70B — 10-20× faster than H100 GPUs.

  7. 07

    Why it matters to you

    The architecture introducing FP4 and NVLink5 for training and inference: the hardware reference for 2025 ML clusters and low-precision quantization choices.

    Landmark AI Infrastructure

    NVIDIA Blackwell: B200 and GB200 NVL72, the rack-scale AI era

    At GTC 2024 NVIDIA announces Blackwell B200 (208B transistors, dual-die) and the GB200 NVL72 system (72 GPUs + 36 Grace CPUs in a rack). 30x faster inference for frontier LLMs.