PAPER 12 · Preference alignment

Direct Preference Optimization

Rafailov et al. 2023 Paper

A simpler, stable alternative to PPO-style RLHF that optimizes preferences directly through the loss.

Core concept

DPO lets you tune a model from preferred/rejected answer pairs without running a complicated reinforcement learning loop.

Why it mattered

It made preference tuning simpler and more accessible for smaller teams.

Visual shortcut · Preference tuning without the big RL machine
prompt
chosen
rejected
tune
learn the preference gap

DPO compresses preference tuning into a direct lesson: make the preferred answer more likely than the rejected one.

How it works
Collect chosen/rejected response pairs.
Compare model probabilities for both answers.
Push the model toward the chosen one.
Keep it anchored to a reference model.

The quick digest

RLHF is powerful but operationally messy: train a reward model, run reinforcement learning, keep the model from drifting, debug instability. DPO asks whether you can get much of the same preference-shaping effect directly from pairs of good and bad answers.

The answer is yes. Given a prompt, a preferred response, and a rejected response, DPO adjusts the model so the preferred response becomes more likely relative to the rejected one while staying anchored to a reference model.

For builders, the paper’s practical meaning is huge: if you can collect clean preference pairs, you can steer style, helpfulness, refusal patterns, and domain behavior without building a full RLHF machine.

What to remember

One-liner
Preference pairs can directly steer a model.
Why it matters
Simpler alignment loops made tuning more accessible.
Builder instinct
The dataset becomes the steering wheel.

Read it like this

Build instinct

Create chosen/rejected pairs for one writing style and run a small DPO fine-tune or mock ranking eval.

Read source → All papers
Previous11 · Training Language Models to Follow Instructions with Human Feedback