Adaptive Mixtures of Local Experts
The early neural-network root of the mixture-of-experts idea.
Instead of one model handling everything, use multiple specialists and a learned gate that chooses between them.
This is the early root of the expert-routing idea behind modern MoE models.
This is the original specialist-and-dispatcher pattern behind later MoE systems.
The quick digest
The idea is almost organizational: not every expert should answer every question. Train several local experts, then train a gating model that decides which expert is appropriate for a given input.
In early form, this is not about giant Transformers. It is about dividing a problem space into regions where specialists can do better than one general model trying to smooth over everything.
The modern relevance is obvious once you see it. MoE layers, model routers, agent delegation, and workflow routing all reuse this pattern: specialize the workers, learn or design the dispatcher.
What to remember
Read it like this
- First pass: Read for the conceptual split between experts and gate.
- Second pass: Ignore the age of the examples.
- Then build taste: Then jump forward to sparsely-gated MoE and Switch.
Train two tiny classifiers on different data regions and a gate that decides which classifier handles each example.