PAPER 11 · Instruction tuning

Training Language Models to Follow Instructions with Human Feedback

Ouyang et al. 2022 Paper

The modern post-training blueprint: supervised instruction tuning plus preference optimization/RLHF.

Core concept

InstructGPT showed that assistant behavior is trained with human feedback, not automatically produced by pretraining.

Why it mattered

It explains the leap from autocomplete model to helpful chatbot product.

Visual shortcut · Autocomplete becomes assistant

Instruction tuning shapes a model from “continue this text” toward “help this user.”

How it works

Collect examples of good assistant answers.

Ask humans to compare outputs.

Train a reward model from preferences.

Optimize the model toward preferred behavior.

The quick digest

A base language model is trained to continue text. That is not the same as following instructions, being concise, admitting uncertainty, or giving users what they actually wanted. InstructGPT adds a behavior-shaping pipeline on top of the base model.

The pipeline is simple in spirit: humans write good answers, humans compare model answers, a reward model learns those preferences, and the language model is optimized toward answers people prefer. The result can be smaller than the base model but feel more useful.

This is one of the papers that turns LLMs into products. It says the model’s personality, helpfulness, and refusal behavior are not just prompts; they are outcomes of training data, preference collection, and optimization choices.

What to remember

One-liner

Base models continue text; instruction models serve users.

Why it matters

Human preferences are product data.

Builder instinct

Assistant behavior is trained, not wished into existence.

Read it like this

First pass: Follow the three-stage pipeline.
Second pass: Pay attention to who gives feedback and what they reward.
Then build taste: Ask which parts can be replaced by DPO or simpler preference methods.

Build instinct

Collect 30 preferred/rejected answer pairs for one workflow and train or simulate a tiny preference ranking loop.

Read source → All papers