Outrageously Large Neural Networks: Sparsely-Gated Mixture of Experts
The modern MoE ignition point: conditional computation at scale.
A Mixture-of-Experts model has many specialist sub-networks, but only uses a few for each input.
It offers a way to increase total model capacity without paying full compute on every token.
MoE scales capacity by activating specialists selectively instead of running the whole model every time.
The quick digest
The basic idea is intuitive: instead of one giant generalist doing all the work, train many experts and a gate that decides which experts should handle each input. Only the selected experts run, so the model can be huge in total but sparse in use.
This paper brings that idea into neural networks at scale. The hard part is making sure the gate does not send everything to the same expert. If routing collapses, you lose the benefit and overload part of the system.
Modern MoE models like Switch, GLaM, Mixtral, and Qwen MoE all inherit this tradeoff: more capacity through specialization, paid for with routing complexity.
What to remember
Read it like this
- First pass: Understand conditional computation first.
- Second pass: Then read the load-balancing losses.
- Then build taste: Connect it to Switch, GLaM, Mixtral, and Qwen MoE variants.
Create a toy MoE layer with two experts and inspect which examples each expert receives.