PAPER 29 · Bonus · distributed MoE

GShard

Lepikhin et al. 2020 Paper

Scaled giant multilingual models with conditional computation and automatic sharding.

Core concept

GShard trains enormous sparse models by combining expert routing with automatic sharding across many accelerators.

Why it mattered

It shows that MoE at scale is inseparable from distributed systems.

Visual shortcut · Sparse model across many devices
one model spread across many machines

GShard is MoE plus distributed systems: the routing only works if the hardware layout works.

How it works
Use sparse expert layers.
Shard parameters across accelerators.
Route tokens to remote experts.
Coordinate communication at scale.

The quick digest

GShard is what happens when MoE meets serious infrastructure. The model uses conditional computation so only some experts activate, but those experts and tokens must be spread across many devices efficiently.

The paper’s key idea is not just sparse experts. It is automatic sharding: the system helps split computation and data across accelerator pods so giant multilingual models become trainable.

Read it as a reminder that model architecture and hardware topology co-design each other. Sparse models only win if routing, placement, communication, and compilation all cooperate.

What to remember

One-liner
Huge MoE requires distributed systems.
Why it matters
Sharding makes enormous sparse models executable.
Builder instinct
Architecture and hardware topology co-design each other.

Read it like this

Build instinct

Diagram how tokens, experts, and devices communicate in a sparse layer; estimate where bottlenecks appear.

Read source → All papers
Previous28 · Toolformer