Pipeline Parallelism
Pipeline parallelism is a distributed training strategy in which a neural network's layers are split into contiguous blocks, each assigned to a separate GPU. Each GPU processes its block of layers and passes activations to the next GPU, forming a pipeline. It differs from tensor parallelism, which splits individual weight matrices within a single layer. Combined with tensor parallelism and data parallelism it forms '3D parallelism', used by Megatron-LM to train models with hundreds of billions of parameters.
In practice
An engineer training a model too large for a single GPU — or even a single multi-GPU node — uses pipeline parallelism to distribute layers across multiple nodes. With DeepSpeed or Megatron-LM you configure the pipeline degree (number of stages) and the number of micro-batches to fill the pipeline and minimize bubble overhead (idle GPU time between micro-batches). In inference, the same approach allows serving very large LLMs by distributing layers across multiple servers.
Related terms
Seen in the wild
29 entries mentioning it- MediumUsable 2-bit quantization: frontier reasoning models drop below 32GB RAM
- MediumOpenAI shuts down the Sora app: consumer AI video can't sustain the math
- HighGemini Robotics: DeepMind brings foundation models into the physical world
- MediumLocal AI 2025: Ollama, MLX LM, Apple Foundation Models triple the speed
- MediumKoboldCpp v1.84: native RAG with embedded ChromaDB, no separate servers
- HighAI supply chain attacks: poisoned models, malicious LoRA adapters, and backdoored GGUF files
- HighMicrosoft 365 Copilot Autonomous Agents: Sales, IT, and HR work without constant oversight
- Highllama.cpp: speculative decoding with draft models for 2-3x speedup
- MediumGitHub Spark: from natural language description to deployed web micro-app
- Mediumllama.cpp Vulkan backend: GPU acceleration for AMD, Intel Arc, and beyond CUDA
- MediumPinokio: the App Store for local AI tools
- MediumZendesk AI Suite: autonomous agents for end-to-end customer support
- HighGemma 2: Google's second-gen open model with Gemini distillation
- HighApple Intelligence: Apple's AI plan, on-device + Private Cloud Compute
- MediumKoboldCpp adds integrated RAG: offline all-in-one LLM with documents and character AI
- HighFlashAttention-3: 2.6x speedup over FA2 optimized for H100 Hopper with wgmma, TMA, and FP8
- MediumGGUF specification: the standard format for local quantized LLM models
- MediumMozilla llamafile: LLM in a single portable executable on any OS
- MediumApptronik Apollo: general purpose humanoid with open ROS2 API
- MediumJan.ai: open source desktop app for local LLMs with threads and local server