Switch Transformer: Google scales to 1.6T parameters with Mixture of Experts

In one sentence Google Brain publishes Switch Transformer, a sparse model with 1.6 trillion parameters that activates only one expert per token, proving sparse routing can scale beyond dense models.

Verified Official source

ShareLinkedIn X

Google releases a paper on a new kind of neural network called Switch Transformer, pushing language models to 1.6 trillion parameters — about ten times larger than GPT-3.

The trick is called Mixture of Experts: instead of lighting up the whole network for every input word, the model picks a small specialized "expert". Like a big consulting firm where each question gets routed to the right consultant, rather than putting everyone to work at once.

The payoff: huge models that cost less energy to run, because only a small slice of the network is active per request. Proof that you can grow beyond "dense" without costs exploding.