Megatron-Turing NLG 530B: Microsoft and NVIDIA scale dense past GPT-3

In one sentence Microsoft and NVIDIA announce MT-NLG, a 530B-parameter dense model trained with DeepSpeed and Megatron-LM, at the time the largest dense LM ever produced.

Verified Official source

ShareLinkedIn X

Microsoft and NVIDIA flex their joint muscle: they release Megatron-Turing NLG 530B, a dense (not sparse) language model with 530 billion parameters. Three times bigger than GPT-3.

They trained it on NVIDIA's Selene supercomputer with thousands of A100 GPUs. Microsoft brings the software (DeepSpeed), NVIDIA the system (Megatron-LM + hardware).

It's not released to the public — it's a demonstration that you can still scale dense past OpenAI's numbers. It's also the start of the Microsoft-NVIDIA axis on frontier-model training, which we'll see grow with Azure ND-H100 clusters and beyond.