PAPER 18 · MoE routing

Switch Transformers

Fedus et al. 2021 Paper

Simplified MoE routing using single-expert activation, helping stabilize giant sparse models.

Core concept

Switch Transformers make MoE simpler by sending each token to one expert instead of several.

Why it mattered

That simplification made sparse models easier to train and scale.

Visual shortcut · One expert is enough
router
idle
active expert
idle
idle

Switch makes MoE more scalable by choosing one expert per token and managing the traffic carefully.

How it works
Score experts for each token.
Pick the top expert only.
Drop or reroute if capacity is exceeded.
Use simplicity to scale sparse models.

The quick digest

Earlier MoE systems often routed tokens to multiple experts. Switch asks: what if each token goes to just one? That sounds like a small change, but it reduces communication, simplifies computation, and makes the system easier to scale.

The model still needs a router, expert capacity limits, and balancing losses so experts do not get overloaded. But the top-1 choice removes a lot of complexity from the sparse layer.

The lesson is engineering taste: sometimes the scalable version of an idea is the less clever version. Switch helped make MoE feel like a practical scaling path rather than a delicate research trick.

What to remember

One-liner
One expert per token can be enough.
Why it matters
Simplification made MoE more scalable.
Builder instinct
Sparse models fail through routing and capacity issues.

Read it like this

Build instinct

Simulate token routing with capacity limits and show what happens when too many tokens choose one expert.

Read source → All papers
Previous17 · Outrageously Large Neural Networks: Sparsely-Gated Mixture of Experts