DeepSpeed ZeRO-3: training models beyond 100 billion parameters

In one sentence Microsoft announces ZeRO Stage 3 in DeepSpeed: by sharding parameters across GPUs in addition to gradients and optimizer states, it enables training of 100B+ parameter models on reasonable-size clusters.

Verified Official source

ShareLinkedIn X

Training a large AI model needs many GPUs. The problem is each GPU has to hold a copy of the model, and modern models no longer fit: 100 billion parameters don't fit on a single card.

Microsoft had already introduced a way to spread some data across GPUs; now it takes the final step and spreads the model weights themselves. Each GPU keeps only a "slice" and fetches the others on the fly when needed.

The practical result: a few-hundred-GPU cluster can train a model that previously required a dedicated supercomputer. It becomes the open-source reference library for anyone doing large-scale training.