PAPER 11 · Instruction tuning

Training Language Models to Follow Instructions with Human Feedback

Ouyang et al. 2022 Paper

The modern post-training blueprint: supervised instruction tuning plus preference optimization/RLHF.

Core concept

InstructGPT showed that assistant behavior is trained with human feedback, not automatically produced by pretraining.

Why it mattered

It explains the leap from autocomplete model to helpful chatbot product.

Visual shortcut · Autocomplete becomes assistant
demo data
reward model
PPO tune
rank outputs human feedback reshapes autocomplete

Instruction tuning shapes a model from “continue this text” toward “help this user.”

How it works
Collect examples of good assistant answers.
Ask humans to compare outputs.
Train a reward model from preferences.
Optimize the model toward preferred behavior.

The quick digest

A base language model is trained to continue text. That is not the same as following instructions, being concise, admitting uncertainty, or giving users what they actually wanted. InstructGPT adds a behavior-shaping pipeline on top of the base model.

The pipeline is simple in spirit: humans write good answers, humans compare model answers, a reward model learns those preferences, and the language model is optimized toward answers people prefer. The result can be smaller than the base model but feel more useful.

This is one of the papers that turns LLMs into products. It says the model’s personality, helpfulness, and refusal behavior are not just prompts; they are outcomes of training data, preference collection, and optimization choices.

What to remember

One-liner
Base models continue text; instruction models serve users.
Why it matters
Human preferences are product data.
Builder instinct
Assistant behavior is trained, not wished into existence.

Read it like this

Build instinct

Collect 30 preferred/rejected answer pairs for one workflow and train or simulate a tiny preference ranking loop.

Read source → All papers
Previous10 · Retrieval-Augmented Generation