PAPER 30 · Bonus · MoE roots

Adaptive Mixtures of Local Experts

Jacobs et al. 1991 Paper

The early neural-network root of the mixture-of-experts idea.

Core concept

Instead of one model handling everything, use multiple specialists and a learned gate that chooses between them.

Why it mattered

This is the early root of the expert-routing idea behind modern MoE models.

Visual shortcut · Classic expert routing

This is the original specialist-and-dispatcher pattern behind later MoE systems.

How it works

Train several local experts.

Train a gate to choose among them.

Let experts specialize by input region.

Combine specialization with learned routing.

The quick digest

The idea is almost organizational: not every expert should answer every question. Train several local experts, then train a gating model that decides which expert is appropriate for a given input.

In early form, this is not about giant Transformers. It is about dividing a problem space into regions where specialists can do better than one general model trying to smooth over everything.

The modern relevance is obvious once you see it. MoE layers, model routers, agent delegation, and workflow routing all reuse this pattern: specialize the workers, learn or design the dispatcher.

What to remember

One-liner

Specialists need a dispatcher.

Why it matters

The gate learns the problem split.

Builder instinct

This old idea is the seed of modern routing.

Read it like this

First pass: Read for the conceptual split between experts and gate.
Second pass: Ignore the age of the examples.
Then build taste: Then jump forward to sparsely-gated MoE and Switch.

Build instinct

Train two tiny classifiers on different data regions and a gate that decides which classifier handles each example.

Read source → All papers