PAPER 25 · MoE economics

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Du et al. 2022 Paper

Validated MoE scaling economics with huge total parameters but smaller active parameter counts.

Core concept

GLaM uses MoE to get huge total capacity while activating only a small part of the model for each token.

Why it mattered

It made the economic argument for sparse language models much clearer.

Visual shortcut · Capacity without dense cost
router
many experts
few active
huge capacity
low active cost

GLaM shows the economic promise of sparse models: large total capacity with lower active compute.

How it works
Build a very large expert pool.
Route tokens to selected experts.
Run only a small active slice.
Compare quality per unit of compute.

The quick digest

GLaM is about capacity economics. A dense model pays for every parameter on every token. A sparse MoE model can hold many experts but activate only the relevant ones, so total capacity and active compute separate.

The paper shows that this can produce strong quality with better efficiency. But it also makes clear that sparse models are systems problems: routing, balancing, communication, and serving all become part of the architecture.

The practical lesson is that MoE is not just “bigger model.” It is a different cost structure. You buy optional capacity and then need a good router and infrastructure to use it.

What to remember

One-liner
MoE is an economic architecture.
Why it matters
Sparse capacity can be cheaper than dense capacity.
Builder instinct
Routing and systems complexity are the price.

Read it like this

Build instinct

Estimate serving cost for a dense model and an MoE model using active parameters, memory footprint, and expected throughput.

Read source → All papers
Previous24 · PaLM: Scaling Language Modeling with Pathways