Hierarchical Mixtures of Experts
Extended mixture-of-experts into hierarchical gating structures.
Hierarchical MoE routes inputs through a tree: broad decision first, specialist decision second.
It shows that expert routing can be structured, not just a flat pick among specialists.
Hierarchical MoE routes like an org chart: category first, specialist second.
The quick digest
Flat MoE asks one gate to choose among experts. Hierarchical MoE organizes experts into a tree. An input first goes through a broad gate, then a narrower one, until it reaches the relevant specialist path.
That mirrors how humans route work: first decide the category, then pick the specialist. For complex domains, hierarchy can make routing more interpretable and modular.
Today, this idea shows up outside neural layers too. Model cascades, agent teams, domain routers, and tool-selection trees all use broad-to-specific delegation.
What to remember
Read it like this
- First pass: Focus on the tree of gates.
- Second pass: Ask what each level of the hierarchy is supposed to learn.
- Then build taste: Connect it to modern router models and multi-agent task delegation.
Design a router tree for local AI tasks: coding, research, summarization, extraction, and numeric analysis.