DAY 05 · SFT VS DPO — TEACHING TASTE

One method shows the model what to say. The other shows it what's better.

Everything so far has been supervised fine-tuning — show the model good answers, it learns to imitate them. But imitation has a ceiling: it can only teach what a "right" answer looks like, not which of two decent answers is the one you'd actually prefer. Today is the second gear: preference tuning, and the method (DPO) that made it simple enough to run on your desk.

Core idea

SFT (supervised fine-tuning) trains on examples of the ideal answer — "produce this." DPO (Direct Preference Optimization) trains on pairs — a better answer and a worse answer for the same prompt — and teaches the model to lean toward the better one and away from the worse. SFT teaches a target; DPO teaches a preference between two real options.

Why it matters

Some things are easy to demonstrate and hard to define. "Be helpful but not pushy," "match our tone," "don't over-hedge" — you may not be able to write the perfect answer, but you can reliably say which of two answers is better. DPO turns that judgment into training signal.

Why imitation hits a ceiling

Supervised fine-tuning is powerful but one-sided: it only ever sees good examples. It never learns what to avoid, because you never showed it the near-misses. The model gets very good at copying your "yes" answers, but it has no signal about the subtly-wrong answers that look plausible. For voice and format, imitation is plenty. For judgment and taste — the difference between fine and excellent — you want the model to see both a good and a bad answer side by side and learn the gap between them.

The old way: RLHF (and why it scared people off)

The original recipe for teaching preferences was RLHF — reinforcement learning from human feedback. It's how the first ChatGPT was aligned, and it works, but it's a beast: you train a separate reward model to score answers, then use a finicky reinforcement-learning algorithm (PPO) to push the main model toward high-reward outputs. Three models in play, unstable training, lots of knobs. Effective, but not something you casually run on a desk on a Saturday.

The breakthrough: DPO

DPO's insight was that you don't need the reward model or the reinforcement learning at all. The same preference data — pairs labeled "this one's better" — can be fed into a single, stable training step that looks a lot like ordinary classification. There's a clever bit of math that proves the detour through a separate reward model is unnecessary; you can optimize the model directly from the preference pairs. The result matches or beats RLHF on many tasks while being dramatically simpler and cheaper to run.

In practice this means preference tuning went from "a research team's project" to "a script you can run after your SFT pass." DPO is the reason teaching taste is now within reach on hardware like yours.

The picture

SFT prompt → ideal answer "copy this" DPO ✓ better answer ✗ worse answer "prefer ✓ over ✗" target vs preference
SFT points at one good answer. DPO learns the gap between a good one and a bad one.

The order that works: SFT first, then DPO

These aren't rivals — they're stages. You almost always do SFT first to get the model into the right neighborhood (the right voice, format, and basic competence), then optionally apply DPO to sharpen its judgment within that neighborhood. DPO refines a model that's already decent; it's not a substitute for teaching the behavior in the first place. For this week's capstone, SFT alone gets the real estate analyst its voice. DPO is the tool you reach for when you want to teach it, say, to prefer specific, grounded claims over vague hype — a genuine taste judgment.

The non-technical version

SFT is handing a writing student a stack of model essays and saying "write like these." DPO is sitting with them over two of their own drafts and saying "this one's better than that one, and here's the kind of reason why" — over and over, until they internalize your taste. Imitation gets them competent. Comparison gets them good. You need the first before the second is worth doing.

~/cuda-week/finetune/preference.jsonl
# A DPO example: same prompt, two answers, a verdict
{"prompt": "Describe this 2bd condo near downtown.",
 "chosen":   "Bright 2-bed, 2-bath condo three blocks from the core. Floor-to-ceiling windows, in-unit laundry, one covered parking spot. Walk to transit and the farmers market; quiet building, low fees.",
 "rejected": "AMAZING once-in-a-lifetime condo!!! You will LOVE this stunning luxury dream home, a true must-see paradise you can't miss!!!"}

# chosen = grounded + specific.  rejected = hype + empty.
# DPO teaches the model to lean toward 'chosen' and away from 'rejected'.

Vocabulary to keep

SFT
Train on ideal answers. Imitation. The first stage.
DPO
Train on better/worse pairs. Preference. The refinement stage.
RLHF
The older, heavier preference method DPO largely replaced.
chosen / rejected
The two answers in a preference pair — the win and the near-miss.
Hands-on direction

Take one prompt from your domain and write both a "chosen" and a "rejected" answer — where the rejected one is plausible, not cartoonishly bad. That's the hard, useful part: good preference data lives on the line between fine and better, not between great and garbage. If you can articulate why "chosen" wins, you've found a taste judgment worth training.

SFT teaches the model your answers. DPO teaches it your judgment. Do the first to make it competent, the second to make it yours.
PreviousDay 4