Infrastructure Advanced Also known as: PP · Inter-layer Parallelism

Pipeline Parallelism

Pipeline parallelism is a distributed training strategy in which a neural network's layers are split into contiguous blocks, each assigned to a separate GPU. Each GPU processes its block of layers and passes activations to the next GPU, forming a pipeline. It differs from tensor parallelism, which splits individual weight matrices within a single layer. Combined with tensor parallelism and data parallelism it forms '3D parallelism', used by Megatron-LM to train models with hundreds of billions of parameters.

ShareLinkedIn X

In practice

An engineer training a model too large for a single GPU — or even a single multi-GPU node — uses pipeline parallelism to distribute layers across multiple nodes. With DeepSpeed or Megatron-LM you configure the pipeline degree (number of stages) and the number of micro-batches to fill the pipeline and minimize bubble overhead (idle GPU time between micro-batches). In inference, the same approach allows serving very large LLMs by distributing layers across multiple servers.

Related terms

Quantization Inference compute

Seen in the wild

31 entries mentioning it

← All terms