PAPER 17 · MoE origin

Outrageously Large Neural Networks: Sparsely-Gated Mixture of Experts

Shazeer et al. 2017 Paper

The modern MoE ignition point: conditional computation at scale.

Core concept

A Mixture-of-Experts model has many specialist sub-networks, but only uses a few for each input.

Why it mattered

It offers a way to increase total model capacity without paying full compute on every token.

Visual shortcut · Specialists with a router

MoE scales capacity by activating specialists selectively instead of running the whole model every time.

How it works

Train many expert networks.

Train a gate to route inputs.

Activate only a small number of experts.

Balance traffic so experts do not collapse.

The quick digest

The basic idea is intuitive: instead of one giant generalist doing all the work, train many experts and a gate that decides which experts should handle each input. Only the selected experts run, so the model can be huge in total but sparse in use.

This paper brings that idea into neural networks at scale. The hard part is making sure the gate does not send everything to the same expert. If routing collapses, you lose the benefit and overload part of the system.

Modern MoE models like Switch, GLaM, Mixtral, and Qwen MoE all inherit this tradeoff: more capacity through specialization, paid for with routing complexity.

What to remember

One-liner

Experts add capacity; gates decide who works.

Why it matters

Total parameters and active parameters are different.

Builder instinct

Routing balance is the whole game.

Read it like this

First pass: Understand conditional computation first.
Second pass: Then read the load-balancing losses.
Then build taste: Connect it to Switch, GLaM, Mixtral, and Qwen MoE variants.

Build instinct

Create a toy MoE layer with two experts and inspect which examples each expert receives.

Read source → All papers