Megatron-Turing NLG 530B: Microsoft and NVIDIA scale dense past GPT-3
In one sentence Microsoft and NVIDIA announce MT-NLG, a 530B-parameter dense model trained with DeepSpeed and Megatron-LM, at the time the largest dense LM ever produced.
Microsoft and NVIDIA flex their joint muscle: they release Megatron-Turing NLG 530B, a dense (not sparse) language model with 530 billion parameters. Three times bigger than GPT-3.
They trained it on NVIDIA's Selene supercomputer with thousands of A100 GPUs. Microsoft brings the software (DeepSpeed), NVIDIA the system (Megatron-LM + hardware).
It's not released to the public — it's a demonstration that you can still scale dense past OpenAI's numbers. It's also the start of the Microsoft-NVIDIA axis on frontier-model training, which we'll see grow with Azure ND-H100 clusters and beyond.
Companies
Microsoft, NVIDIA
Tools
Megatron-Turing NLG 530B
Tags
Sources
- https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/
- https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/