Megatron-LM v2: 3D Parallelism for 530-Billion-Parameter Models
In one sentence NVIDIA adds interleaved pipeline scheduling and sequence parallelism to Megatron-LM, enabling training of the 530B-parameter MT-NLG on 2240 A100 GPUs with Microsoft.
Imagine building a massive cathedral with many small workers. The solution is to divide the work three ways simultaneously: each worker handles a different section of the building (pipeline), each worker shares their piece with a neighbor (tensor), and groups of workers replicate the same pattern in parallel (data). This is exactly what Megatron-LM v2 does with giant AI models.
Before this version, training models with hundreds of billions of parameters required ad-hoc and often unstable solutions. NVIDIA formalized an approach called 3D parallelism: tensor parallelism splits individual model layers across GPUs, pipeline parallelism splits blocks of layers across GPU groups, and data parallelism replicates everything across multiple copies. The key innovation is interleaved scheduling for the pipeline, which reduces idle time between GPUs.
Together with Microsoft, NVIDIA used this technique to train MT-NLG, a 530-billion-parameter model on 2240 A100 GPUs, demonstrating that large-scale training was finally systematic. This blueprint became the reference point for every large model training framework that followed.
Companies
NVIDIA, Microsoft
Tools
—
Tags
Sources