PAPER 19 · Open MoE

Mixtral of Experts

Mistral AI 2024 Paper

Open-weight MoE that proved sparse models can deliver dense-model quality at smaller active inference cost.

Core concept

Mixtral brought open-weight MoE into practical LLM use: many experts, only a few active per token.

Why it mattered

It made sparse open models feel deployable rather than purely academic.

Visual shortcut · Open MoE in practice

Mixtral made sparse expert models feel tangible for open-model builders.

How it works

Route each token to a subset of experts.

Run only active experts.

Combine their outputs.

Compare quality and serving cost against dense models.

The quick digest

Mixtral uses a sparse expert architecture where each token routes through a small subset of experts. The model has a large total parameter count, but inference only activates part of it for each token.

The exciting part was not just benchmark quality. It was the open-weight deployment story: builders could actually run and compare a strong MoE model, then see the tradeoffs between memory footprint, active compute, serving speed, and answer quality.

Mixtral is the paper you read to understand why “parameter count” became a less complete model descriptor. For sparse models, you always ask: total parameters, active parameters, memory required, and serving support.

What to remember

One-liner

Open MoE made sparse models tangible.

Why it matters

Memory and active compute are separate concerns.

Builder instinct

Serving quality decides whether MoE feels efficient.

Read it like this

First pass: Look at total versus active parameters.
Second pass: Then compare benchmarks with dense models of similar inference cost.
Then build taste: Ask what hardware memory is required even when only some experts activate.

Build instinct

Run a Mixtral-family model if available and compare memory use, tokens/sec, and output quality against a dense model.

Read source → All papers