PAPER 20 · MoE from dense

Sparse Upcycling

Komatsuzaki et al. 2022 / ICLR 2023 Paper

A practical technique for converting dense checkpoints into MoE models and reusing compute.

Core concept

Sparse upcycling turns a trained dense model into an MoE model instead of training sparse from scratch.

Why it mattered

It treats expensive dense checkpoints as reusable assets for building larger sparse models.

Visual shortcut · Turn dense into sparse
dense model
copy into experts
continue training
sparse model

Sparse upcycling treats an already-trained dense model as raw material for a larger MoE model.

How it works
Start with a dense model that already knows useful things.
Split or copy feed-forward weights into experts.
Add sparse routing.
Continue training so experts specialize.

The quick digest

If you already spent a fortune training a dense model, starting over to build an MoE model is painful. Sparse upcycling asks whether you can copy pieces of the dense model into multiple experts, add routing, and continue training from there.

The intuition is like turning one general department into several specialist teams without hiring from zero. The experts inherit useful knowledge, then specialize during continued training.

The paper matters because it is a practical path from dense to sparse. It says MoE is not only an architecture you choose before training; it can also be a continuation strategy for models you already have.

What to remember

One-liner
Dense checkpoints can become sparse models.
Why it matters
Experts inherit first, specialize later.
Builder instinct
MoE can be a migration path, not only a starting choice.

Read it like this

Build instinct

Sketch how you would split one dense MLP into multiple experts and what data you would use to specialize them.

Read source → All papers
Previous19 · Mixtral of Experts