PAPER 06 · Compute-optimal training

Training Compute-Optimal Large Language Models

Hoffmann et al. 2022 Paper

The Chinchilla paper: for a fixed compute budget, train smaller models on more tokens.

Core concept

Many famous large models were too big for how little they were trained; smaller models trained on more tokens could be better.

Why it mattered

It shifted attention from parameter count to compute-optimal balance between model size and data.

Visual shortcut · Feed the model enough
parameters tokens compute-optimal means balanced

Chinchilla says the biggest model is not always the best use of compute; the model also needs enough reading experience.

How it works
Notice models were parameter-heavy.
Train a smaller model on many more tokens.
Compare at the same compute budget.
Conclude balance beats bragging rights.

The quick digest

Chinchilla is the “you are feeding the model wrong” paper. Earlier scaling work encouraged bigger models, but DeepMind found that many models had grown parameters faster than training data. They were like giant students who had not read enough books.

The Chinchilla model had fewer parameters than some headline models but saw many more tokens. At the same compute budget, it performed better because the budget was balanced: enough capacity, enough experience.

This matters for local AI because it teaches taste. Bigger is not automatically better. A smaller model trained or fine-tuned on the right data can beat a larger one that is undertrained, poorly curated, or mismatched to the task.

What to remember

One-liner
Bigger models can be underfed.
Why it matters
More high-quality tokens can beat more parameters.
Builder instinct
Compute-optimal means balanced, not largest.

Read it like this

Build instinct

Train two toy models with the same compute budget: bigger/fewer tokens versus smaller/more tokens.

Read source → All papers
Previous05 · Scaling Laws for Neural Language Models