PAPER 25 · MoE economics

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Du et al. 2022 Paper

Validated MoE scaling economics with huge total parameters but smaller active parameter counts.

Core concept

GLaM uses MoE to get huge total capacity while activating only a small part of the model for each token.

Why it mattered

It made the economic argument for sparse language models much clearer.

Visual shortcut · Capacity without dense cost

GLaM shows the economic promise of sparse models: large total capacity with lower active compute.

How it works

Build a very large expert pool.

Route tokens to selected experts.

Run only a small active slice.

Compare quality per unit of compute.

The quick digest

GLaM is about capacity economics. A dense model pays for every parameter on every token. A sparse MoE model can hold many experts but activate only the relevant ones, so total capacity and active compute separate.

The paper shows that this can produce strong quality with better efficiency. But it also makes clear that sparse models are systems problems: routing, balancing, communication, and serving all become part of the architecture.

The practical lesson is that MoE is not just “bigger model.” It is a different cost structure. You buy optional capacity and then need a good router and infrastructure to use it.

What to remember

One-liner

MoE is an economic architecture.

Why it matters

Sparse capacity can be cheaper than dense capacity.

Builder instinct

Routing and systems complexity are the price.

Read it like this

First pass: Track active compute versus total capacity.
Second pass: Then look at quality/cost comparisons.
Then build taste: Connect GLaM to Switch and Mixtral.

Build instinct

Estimate serving cost for a dense model and an MoE model using active parameters, memory footprint, and expected throughput.

Read source → All papers