PAPER 29 · Bonus · distributed MoE

GShard

Lepikhin et al. 2020 Paper

Scaled giant multilingual models with conditional computation and automatic sharding.

Core concept

GShard trains enormous sparse models by combining expert routing with automatic sharding across many accelerators.

Why it mattered

It shows that MoE at scale is inseparable from distributed systems.

Visual shortcut · Sparse model across many devices

GShard is MoE plus distributed systems: the routing only works if the hardware layout works.

How it works

Use sparse expert layers.

Shard parameters across accelerators.

Route tokens to remote experts.

Coordinate communication at scale.

The quick digest

GShard is what happens when MoE meets serious infrastructure. The model uses conditional computation so only some experts activate, but those experts and tokens must be spread across many devices efficiently.

The paper’s key idea is not just sparse experts. It is automatic sharding: the system helps split computation and data across accelerator pods so giant multilingual models become trainable.

Read it as a reminder that model architecture and hardware topology co-design each other. Sparse models only win if routing, placement, communication, and compilation all cooperate.

What to remember

One-liner

Huge MoE requires distributed systems.

Why it matters

Sharding makes enormous sparse models executable.

Builder instinct

Architecture and hardware topology co-design each other.

Read it like this

First pass: Read the sharding motivation first.
Second pass: Then connect MoE routing to device placement.
Then build taste: Compare with later serving-focused MoE systems.

Build instinct

Diagram how tokens, experts, and devices communicate in a sparse layer; estimate where bottlenecks appear.

Read source → All papers