Skip to content
AImpact
IT EN
Infrastructure Advanced Also known as: PP · Inter-layer Parallelism

Pipeline Parallelism

Pipeline parallelism is a distributed training strategy in which a neural network's layers are split into contiguous blocks, each assigned to a separate GPU. Each GPU processes its block of layers and passes activations to the next GPU, forming a pipeline. It differs from tensor parallelism, which splits individual weight matrices within a single layer. Combined with tensor parallelism and data parallelism it forms '3D parallelism', used by Megatron-LM to train models with hundreds of billions of parameters.

ShareLinkedInX

In practice

An engineer training a model too large for a single GPU — or even a single multi-GPU node — uses pipeline parallelism to distribute layers across multiple nodes. With DeepSpeed or Megatron-LM you configure the pipeline degree (number of stages) and the number of micro-batches to fill the pipeline and minimize bubble overhead (idle GPU time between micro-batches). In inference, the same approach allows serving very large LLMs by distributing layers across multiple servers.

Related terms

Seen in the wild

29 entries mentioning it
  1. Usable 2-bit quantization: frontier reasoning models drop below 32GB RAM
    Medium
  2. OpenAI shuts down the Sora app: consumer AI video can't sustain the math
    Medium
  3. Gemini Robotics: DeepMind brings foundation models into the physical world
    High
  4. Local AI 2025: Ollama, MLX LM, Apple Foundation Models triple the speed
    Medium
  5. KoboldCpp v1.84: native RAG with embedded ChromaDB, no separate servers
    Medium
  6. AI supply chain attacks: poisoned models, malicious LoRA adapters, and backdoored GGUF files
    High
  7. Microsoft 365 Copilot Autonomous Agents: Sales, IT, and HR work without constant oversight
    High
  8. llama.cpp: speculative decoding with draft models for 2-3x speedup
    High
  9. GitHub Spark: from natural language description to deployed web micro-app
    Medium
  10. llama.cpp Vulkan backend: GPU acceleration for AMD, Intel Arc, and beyond CUDA
    Medium
  11. Pinokio: the App Store for local AI tools
    Medium
  12. Zendesk AI Suite: autonomous agents for end-to-end customer support
    Medium
  13. Gemma 2: Google's second-gen open model with Gemini distillation
    High
  14. Apple Intelligence: Apple's AI plan, on-device + Private Cloud Compute
    High
  15. KoboldCpp adds integrated RAG: offline all-in-one LLM with documents and character AI
    Medium
  16. FlashAttention-3: 2.6x speedup over FA2 optimized for H100 Hopper with wgmma, TMA, and FP8
    High
  17. GGUF specification: the standard format for local quantized LLM models
    Medium
  18. Mozilla llamafile: LLM in a single portable executable on any OS
    Medium
  19. Apptronik Apollo: general purpose humanoid with open ROS2 API
    Medium
  20. Jan.ai: open source desktop app for local LLMs with threads and local server
    Medium
← All terms