PAPER 31 · Bonus · MoE roots

Hierarchical Mixtures of Experts

Jordan and Jacobs 1994 Paper

Extended mixture-of-experts into hierarchical gating structures.

Core concept

Hierarchical MoE routes inputs through a tree: broad decision first, specialist decision second.

Why it mattered

It shows that expert routing can be structured, not just a flat pick among specialists.

Visual shortcut · Router tree

Hierarchical MoE routes like an org chart: category first, specialist second.

How it works

Make a broad routing decision.

Route inside the selected branch.

Use a narrower expert.

Let hierarchy organize specialization.

The quick digest

Flat MoE asks one gate to choose among experts. Hierarchical MoE organizes experts into a tree. An input first goes through a broad gate, then a narrower one, until it reaches the relevant specialist path.

That mirrors how humans route work: first decide the category, then pick the specialist. For complex domains, hierarchy can make routing more interpretable and modular.

Today, this idea shows up outside neural layers too. Model cascades, agent teams, domain routers, and tool-selection trees all use broad-to-specific delegation.

What to remember

One-liner

Routing can be broad-to-specific.

Why it matters

Hierarchy makes specialization modular.

Builder instinct

Bad routing compounds down the tree.

Read it like this

First pass: Focus on the tree of gates.
Second pass: Ask what each level of the hierarchy is supposed to learn.
Then build taste: Connect it to modern router models and multi-agent task delegation.

Build instinct

Design a router tree for local AI tasks: coding, research, summarization, extraction, and numeric analysis.

Read source → All papers