Sparse Upcycling
A practical technique for converting dense checkpoints into MoE models and reusing compute.
Sparse upcycling turns a trained dense model into an MoE model instead of training sparse from scratch.
It treats expensive dense checkpoints as reusable assets for building larger sparse models.
Sparse upcycling treats an already-trained dense model as raw material for a larger MoE model.
The quick digest
If you already spent a fortune training a dense model, starting over to build an MoE model is painful. Sparse upcycling asks whether you can copy pieces of the dense model into multiple experts, add routing, and continue training from there.
The intuition is like turning one general department into several specialist teams without hiring from zero. The experts inherit useful knowledge, then specialize during continued training.
The paper matters because it is a practical path from dense to sparse. It says MoE is not only an architecture you choose before training; it can also be a continuation strategy for models you already have.
What to remember
Read it like this
- First pass: Understand what gets copied from the dense model.
- Second pass: Then study how experts specialize after continued training.
- Then build taste: Compare with training sparse from scratch.
Sketch how you would split one dense MLP into multiple experts and what data you would use to specialize them.