PAPER 05 · Scaling laws

Scaling Laws for Neural Language Models

Kaplan et al. 2020 Paper

The first clean empirical scaling framework for parameters, data, and compute.

Core concept

Model performance improves predictably as you spend more compute, use more data, and train bigger models.

Why it mattered

It gave labs a planning map: if you know your compute budget, you can estimate what model/data mix should improve loss.

Visual shortcut · Training as budget math
compute lower loss sweet spot small model

Scaling laws turn model training into resource allocation: decide how to spend compute across size and data.

How it works
Train many models at different sizes.
Measure loss across compute budgets.
Fit smooth empirical curves.
Use the curves to plan the next training run.

The quick digest

This paper treats language model training like an empirical science of budgets. Instead of asking “what architecture trick should we try?” it asks: what happens if we systematically vary model size, dataset size, and compute?

The answer is that loss follows surprisingly smooth curves. Bigger models, more data, and more compute tend to improve performance in predictable ways. That made scaling feel less like superstition and more like capital allocation.

The important nontechnical lesson: there is no free lunch. If you make the model bigger but do not feed it enough data, or if you have data but not enough compute, you leave performance on the table. Scaling laws are about finding the efficient frontier.

What to remember

One-liner
Scaling is predictable enough to plan around.
Why it matters
Parameters, data, and compute are a budget triangle.
Builder instinct
Loss curves help train models, not decide products.

Read it like this

Build instinct

Make a tiny scaling experiment: train small models on different token counts and plot validation loss.

Read source → All papers
Previous04 · Language Models are Few-Shot Learners