Reading path
ML engineer: training, optimization and infrastructure
GPUs, scaling laws, fast inference and quantization: the technical through-line.
You are an ML engineer who wants to understand the architectural and infrastructure decisions that drove the frontier model race. This path connects the foundational scaling papers with the hardware architectures that made them possible, and reaches the ultra-fast inference solutions that define the production cost of an LLM today.
- 01
Why it matters to you
Chinchilla scaling laws rewrite the optimal ratio between parameters and training tokens: understanding them is the prerequisite for any sensible decision about how long to train a model.
Landmark Foundation ModelsChinchilla: the big models were undertrained
DeepMind publishes the Chinchilla paper and shows that, given equal compute, smaller models trained on far more tokens beat oversized undertrained ones.
- 02
Why it matters to you
The architecture that made training modern frontier models possible: Transformer Engine, NVLink4 and HBM3 change compute budgets from this point forward.
Landmark AI InfrastructureNVIDIA H100 and Hopper architecture: the foundation-model GPU
At GTC 2022 NVIDIA unveils the Hopper architecture and the H100 GPU, with FP8 Transformer Engine and NVLink 4. It will become the hardware substrate for nearly every large LLM of the following years.
- 03
Why it matters to you
The paper documenting the training of a 540B model across 6144 TPUs in parallel: the practical reference for anyone wanting to understand distributed training at scale.
Medium Foundation ModelsPaLM 540B: Google's GPT-3 answer brings chain-of-thought
Google publishes PaLM, a 540B-parameter model trained on the new Pathways system. Demonstrates emergent reasoning capabilities when guided with chain-of-thought.
- 04
Why it matters to you
The language for writing custom GPU kernels without native CUDA: essential for optimizing attention, flash attention, and any latency-critical operation.
Medium AI InfrastructureOpenAI Triton: writing GPU kernels in Python becomes practical
OpenAI releases Triton, a Python-like language and compiler for writing custom GPU kernels at performance close to hand-written CUDA — dramatically lowering the barrier for model optimization.
- 05
Why it matters to you
Demonstrates that inference-dedicated hardware architecture can outperform GPUs by two orders of magnitude in per-token throughput: it reframes the make-vs-buy calculus on inference.
High AI InfrastructureGroq LPU: 500-tokens-per-second inference goes viral
Groq's public demo on Llama 2 70B generates ~500 tokens/sec, orders of magnitude faster than any GPU. LLM latency stops being a given.
- 06
Why it matters to you
The wafer-scale chip that eliminates memory bottlenecks for large models: reveals that memory bandwidth, not FLOPs, is the real bottleneck in LLM inference.
Medium AI InfrastructureCerebras Inference: record-breaking LLM inference throughput on the wafer-scale WSE-3
Cerebras launches an LLM inference service on the wafer-scale WSE-3, claiming ~1800 tokens/s on Llama 3.1 8B and ~450 tokens/s on Llama 3.1 70B — 10-20× faster than H100 GPUs.
- 07
Why it matters to you
The architecture introducing FP4 and NVLink5 for training and inference: the hardware reference for 2025 ML clusters and low-precision quantization choices.
Landmark AI InfrastructureNVIDIA Blackwell: B200 and GB200 NVL72, the rack-scale AI era
At GTC 2024 NVIDIA announces Blackwell B200 (208B transistors, dual-die) and the GB200 NVL72 system (72 GPUs + 36 Grace CPUs in a rack). 30x faster inference for frontier LLMs.