Switch Transformer: Google scales to 1.6T parameters with Mixture of Experts
In one sentence Google Brain publishes Switch Transformer, a sparse model with 1.6 trillion parameters that activates only one expert per token, proving sparse routing can scale beyond dense models.
Google releases a paper on a new kind of neural network called Switch Transformer, pushing language models to 1.6 trillion parameters — about ten times larger than GPT-3.
The trick is called Mixture of Experts: instead of lighting up the whole network for every input word, the model picks a small specialized "expert". Like a big consulting firm where each question gets routed to the right consultant, rather than putting everyone to work at once.
The payoff: huge models that cost less energy to run, because only a small slice of the network is active per request. Proof that you can grow beyond "dense" without costs exploding.
Companies
Tools
Switch Transformer
Tags
Sources