PAPER 22 · Synthetic data

Textbooks Are All You Need

Gunasekar et al. 2023 Paper

High-quality synthetic “textbook” data can let small models outperform much larger models on targeted domains.

Core concept

Small models can become surprisingly strong when trained on clean, educational, synthetic “textbook” data.

Why it mattered

It showed that data curriculum can substitute for some raw scale on targeted skills.

Visual shortcut · Teach small models well

The paper says small models can learn a lot when the training data is more like a good class than a web scrape.

How it works

Generate high-quality lesson-like data.

Include examples and exercises.

Train a compact model.

Evaluate on the targeted skill.

The quick digest

This paper asks: what if instead of feeding a small model random scraped code, we feed it high-quality lessons, examples, and exercises? The authors generate textbook-style synthetic data designed to teach programming concepts clearly.

The result is that small models can perform far above what their size suggests when the training data is concentrated and pedagogical. The model is not bigger; the lesson plan is better.

For local AI, this is one of the most important intuitions. If your task is narrow, you may not need a giant model. You may need a small model trained on the right curriculum with the right evals.

What to remember

One-liner

Small models can learn deeply from clean curricula.

Why it matters

Synthetic data works when it teaches, not when it floods.

Builder instinct

For narrow skills, dataset design can beat scale.

Read it like this

First pass: Focus on how the dataset was constructed.
Second pass: Then compare model size against benchmark performance.
Then build taste: Ask which workflows could use a “textbook” dataset.

Build instinct

Create 100 high-quality examples for one narrow skill and fine-tune/evaluate a small model against a larger general model.

Read source → All papers