PAPER 19 · Open MoE

Mixtral of Experts

Mistral AI 2024 Paper

Open-weight MoE that proved sparse models can deliver dense-model quality at smaller active inference cost.

Core concept

Mixtral brought open-weight MoE into practical LLM use: many experts, only a few active per token.

Why it mattered

It made sparse open models feel deployable rather than purely academic.

Visual shortcut · Open MoE in practice
router
expert A
expert B
expert C
expert D

Mixtral made sparse expert models feel tangible for open-model builders.

How it works
Route each token to a subset of experts.
Run only active experts.
Combine their outputs.
Compare quality and serving cost against dense models.

The quick digest

Mixtral uses a sparse expert architecture where each token routes through a small subset of experts. The model has a large total parameter count, but inference only activates part of it for each token.

The exciting part was not just benchmark quality. It was the open-weight deployment story: builders could actually run and compare a strong MoE model, then see the tradeoffs between memory footprint, active compute, serving speed, and answer quality.

Mixtral is the paper you read to understand why “parameter count” became a less complete model descriptor. For sparse models, you always ask: total parameters, active parameters, memory required, and serving support.

What to remember

One-liner
Open MoE made sparse models tangible.
Why it matters
Memory and active compute are separate concerns.
Builder instinct
Serving quality decides whether MoE feels efficient.

Read it like this

Build instinct

Run a Mixtral-family model if available and compare memory use, tokens/sec, and output quality against a dense model.

Read source → All papers
Previous18 · Switch Transformers