PAPER 18 · MoE routing

Switch Transformers

Fedus et al. 2021 Paper

Simplified MoE routing using single-expert activation, helping stabilize giant sparse models.

Core concept

Switch Transformers make MoE simpler by sending each token to one expert instead of several.

Why it mattered

That simplification made sparse models easier to train and scale.

Visual shortcut · One expert is enough

Switch makes MoE more scalable by choosing one expert per token and managing the traffic carefully.

How it works

Score experts for each token.

Pick the top expert only.

Drop or reroute if capacity is exceeded.

Use simplicity to scale sparse models.

The quick digest

Earlier MoE systems often routed tokens to multiple experts. Switch asks: what if each token goes to just one? That sounds like a small change, but it reduces communication, simplifies computation, and makes the system easier to scale.

The model still needs a router, expert capacity limits, and balancing losses so experts do not get overloaded. But the top-1 choice removes a lot of complexity from the sparse layer.

The lesson is engineering taste: sometimes the scalable version of an idea is the less clever version. Switch helped make MoE feel like a practical scaling path rather than a delicate research trick.

What to remember

One-liner

One expert per token can be enough.

Why it matters

Simplification made MoE more scalable.

Builder instinct

Sparse models fail through routing and capacity issues.

Read it like this

First pass: Compare top-1 routing with earlier MoE designs.
Second pass: Look at capacity factor and dropped tokens.
Then build taste: Ask how this maps to serving sparse models.

Build instinct

Simulate token routing with capacity limits and show what happens when too many tokens choose one expert.

Read source → All papers