PAPER 20 · MoE from dense

Sparse Upcycling

Komatsuzaki et al. 2022 / ICLR 2023 Paper

A practical technique for converting dense checkpoints into MoE models and reusing compute.

Core concept

Sparse upcycling turns a trained dense model into an MoE model instead of training sparse from scratch.

Why it mattered

It treats expensive dense checkpoints as reusable assets for building larger sparse models.

Visual shortcut · Turn dense into sparse

Sparse upcycling treats an already-trained dense model as raw material for a larger MoE model.

How it works

Start with a dense model that already knows useful things.

Split or copy feed-forward weights into experts.

Add sparse routing.

Continue training so experts specialize.

The quick digest

If you already spent a fortune training a dense model, starting over to build an MoE model is painful. Sparse upcycling asks whether you can copy pieces of the dense model into multiple experts, add routing, and continue training from there.

The intuition is like turning one general department into several specialist teams without hiring from zero. The experts inherit useful knowledge, then specialize during continued training.

The paper matters because it is a practical path from dense to sparse. It says MoE is not only an architecture you choose before training; it can also be a continuation strategy for models you already have.

What to remember

One-liner

Dense checkpoints can become sparse models.

Why it matters

Experts inherit first, specialize later.

Builder instinct

MoE can be a migration path, not only a starting choice.

Read it like this

First pass: Understand what gets copied from the dense model.
Second pass: Then study how experts specialize after continued training.
Then build taste: Compare with training sparse from scratch.

Build instinct

Sketch how you would split one dense MLP into multiple experts and what data you would use to specialize them.

Read source → All papers