PAPER 30 · Bonus · MoE roots

Adaptive Mixtures of Local Experts

Jacobs et al. 1991 Paper

The early neural-network root of the mixture-of-experts idea.

Core concept

Instead of one model handling everything, use multiple specialists and a learned gate that chooses between them.

Why it mattered

This is the early root of the expert-routing idea behind modern MoE models.

Visual shortcut · Classic expert routing
router
local expert
local expert
local expert
local expert

This is the original specialist-and-dispatcher pattern behind later MoE systems.

How it works
Train several local experts.
Train a gate to choose among them.
Let experts specialize by input region.
Combine specialization with learned routing.

The quick digest

The idea is almost organizational: not every expert should answer every question. Train several local experts, then train a gating model that decides which expert is appropriate for a given input.

In early form, this is not about giant Transformers. It is about dividing a problem space into regions where specialists can do better than one general model trying to smooth over everything.

The modern relevance is obvious once you see it. MoE layers, model routers, agent delegation, and workflow routing all reuse this pattern: specialize the workers, learn or design the dispatcher.

What to remember

One-liner
Specialists need a dispatcher.
Why it matters
The gate learns the problem split.
Builder instinct
This old idea is the seed of modern routing.

Read it like this

Build instinct

Train two tiny classifiers on different data regions and a gate that decides which classifier handles each example.

Read source → All papers
Previous29 · GShard