GShard
Scaled giant multilingual models with conditional computation and automatic sharding.
GShard trains enormous sparse models by combining expert routing with automatic sharding across many accelerators.
It shows that MoE at scale is inseparable from distributed systems.
GShard is MoE plus distributed systems: the routing only works if the hardware layout works.
The quick digest
GShard is what happens when MoE meets serious infrastructure. The model uses conditional computation so only some experts activate, but those experts and tokens must be spread across many devices efficiently.
The paper’s key idea is not just sparse experts. It is automatic sharding: the system helps split computation and data across accelerator pods so giant multilingual models become trainable.
Read it as a reminder that model architecture and hardware topology co-design each other. Sparse models only win if routing, placement, communication, and compilation all cooperate.
What to remember
Read it like this
- First pass: Read the sharding motivation first.
- Second pass: Then connect MoE routing to device placement.
- Then build taste: Compare with later serving-focused MoE systems.
Diagram how tokens, experts, and devices communicate in a sparse layer; estimate where bottlenecks appear.