PAPER 31 · Bonus · MoE roots

Hierarchical Mixtures of Experts

Jordan and Jacobs 1994 Paper

Extended mixture-of-experts into hierarchical gating structures.

Core concept

Hierarchical MoE routes inputs through a tree: broad decision first, specialist decision second.

Why it mattered

It shows that expert routing can be structured, not just a flat pick among specialists.

Visual shortcut · Router tree
gate
local gate
local gate
E1E2E3E4

Hierarchical MoE routes like an org chart: category first, specialist second.

How it works
Make a broad routing decision.
Route inside the selected branch.
Use a narrower expert.
Let hierarchy organize specialization.

The quick digest

Flat MoE asks one gate to choose among experts. Hierarchical MoE organizes experts into a tree. An input first goes through a broad gate, then a narrower one, until it reaches the relevant specialist path.

That mirrors how humans route work: first decide the category, then pick the specialist. For complex domains, hierarchy can make routing more interpretable and modular.

Today, this idea shows up outside neural layers too. Model cascades, agent teams, domain routers, and tool-selection trees all use broad-to-specific delegation.

What to remember

One-liner
Routing can be broad-to-specific.
Why it matters
Hierarchy makes specialization modular.
Builder instinct
Bad routing compounds down the tree.

Read it like this

Build instinct

Design a router tree for local AI tasks: coding, research, summarization, extraction, and numeric analysis.

Read source → All papers
Previous30 · Adaptive Mixtures of Local Experts