← Reading paths

◦

Reading path

ML engineer: training, optimization and infrastructure

GPUs, scaling laws, fast inference and quantization: the technical through-line.

You are an ML engineer who wants to understand the architectural and infrastructure decisions that drove the frontier model race. This path connects the foundational scaling papers with the hardware architectures that made them possible, and reaches the ultra-fast inference solutions that define the production cost of an LLM today.

01

Why it matters to you

Chinchilla scaling laws rewrite the optimal ratio between parameters and training tokens: understanding them is the prerequisite for any sensible decision about how long to train a model.

March 29, 2022 Landmark Foundation Models

Chinchilla: the big models were undertrained

DeepMind publishes the Chinchilla paper and shows that, given equal compute, smaller models trained on far more tokens beat oversized undertrained ones.
02

Why it matters to you

The architecture that made training modern frontier models possible: Transformer Engine, NVLink4 and HBM3 change compute budgets from this point forward.

March 22, 2022 Landmark AI Infrastructure

NVIDIA H100 and Hopper architecture: the foundation-model GPU

At GTC 2022 NVIDIA unveils the Hopper architecture and the H100 GPU, with FP8 Transformer Engine and NVLink 4. It will become the hardware substrate for nearly every large LLM of the following years.
03

Why it matters to you

The paper documenting the training of a 540B model across 6144 TPUs in parallel: the practical reference for anyone wanting to understand distributed training at scale.

April 5, 2022 Medium Foundation Models

PaLM 540B: Google's GPT-3 answer brings chain-of-thought

Google publishes PaLM, a 540B-parameter model trained on the new Pathways system. Demonstrates emergent reasoning capabilities when guided with chain-of-thought.
04

Why it matters to you

The language for writing custom GPU kernels without native CUDA: essential for optimizing attention, flash attention, and any latency-critical operation.

July 28, 2021 Medium AI Infrastructure

OpenAI Triton: writing GPU kernels in Python becomes practical

OpenAI releases Triton, a Python-like language and compiler for writing custom GPU kernels at performance close to hand-written CUDA — dramatically lowering the barrier for model optimization.
05

Why it matters to you

Demonstrates that inference-dedicated hardware architecture can outperform GPUs by two orders of magnitude in per-token throughput: it reframes the make-vs-buy calculus on inference.

February 22, 2024 High AI Infrastructure

Groq LPU: 500-tokens-per-second inference goes viral

Groq's public demo on Llama 2 70B generates ~500 tokens/sec, orders of magnitude faster than any GPU. LLM latency stops being a given.
06

Why it matters to you

The wafer-scale chip that eliminates memory bottlenecks for large models: reveals that memory bandwidth, not FLOPs, is the real bottleneck in LLM inference.

August 27, 2024 Medium AI Infrastructure

Cerebras Inference: record-breaking LLM inference throughput on the wafer-scale WSE-3

Cerebras launches an LLM inference service on the wafer-scale WSE-3, claiming ~1800 tokens/s on Llama 3.1 8B and ~450 tokens/s on Llama 3.1 70B — 10-20× faster than H100 GPUs.
07

Why it matters to you

The architecture introducing FP4 and NVLink5 for training and inference: the hardware reference for 2025 ML clusters and low-precision quantization choices.

March 18, 2024 Landmark AI Infrastructure

NVIDIA Blackwell: B200 and GB200 NVL72, the rack-scale AI era

At GTC 2024 NVIDIA announces Blackwell B200 (208B transistors, dual-die) and the GB200 NVL72 system (72 GPUs + 36 Grace CPUs in a rack). 30x faster inference for frontier LLMs.