DAY 06 · DID IT ACTUALLY WORK?

The loss went down. That doesn't mean the model got better.

This is the day almost everyone skips, and it's the day that separates a fine-tune you can trust from a demo that fools you. A falling training loss is the most seductive lie in machine learning. Today is how to tell whether your model genuinely got better at the job — or just memorized your examples and quietly forgot how to do everything else.

Core idea

A fine-tune can fail in two opposite directions. Overfitting: it memorizes your training examples and can't generalize to new inputs. Catastrophic forgetting: it learns your new behavior but loses general abilities it used to have. Good evaluation catches both — by testing on data the model never trained on, and by checking that old skills still work.

Why it matters

Training loss only tells you the model fits the data it saw. It says nothing about new inputs or retained skills. If you ship on training loss alone, you ship a model that aces the test it already had the answers to and falls apart in the real world.

Overfitting: memorizing the test

Remember the ~10% of examples you held out on Day 4 and never trained on? This is what they're for. You measure the model on those unseen examples, not on the ones it studied. If it does well on training examples but poorly on the held-out ones, it memorized rather than learned — like a student who aced the practice exam because they'd seen the exact questions, then froze on the real one. The fix is usually fewer training steps, a lower learning rate, a smaller rank, or more diverse data.

Catastrophic forgetting: winning the battle, losing the war

The second failure is sneakier. You teach the model your real estate voice beautifully — and discover it's gotten noticeably worse at basic reasoning, or now answers every question like a listing, even "what's 12 times 8?" That's catastrophic forgetting: pulling the weights hard toward your narrow task overwrote general competence. This is exactly why LoRA's gentler, frozen-base approach (Day 2) tends to forget less than a full fine-tune — but it can still happen if you train too long or too hard. You catch it by keeping a handful of "general skill" prompts in your eval and checking they still work after training.

The picture

You're aiming for the middle. Both edges look fine on training loss alone.

Evals that beat "vibes"

The honest minimum is a small, fixed test set you run before and after — same prompts every time, so you're comparing like with like. Three layers, cheapest first:

Held-out task eval. 20–50 prompts the model never trained on, scored for whether the new behavior actually shows up. This is your primary signal.
Side-by-side comparison. Run base model vs fine-tuned model on the same prompts and read them next to each other. For voice and style, your eyes are a legitimate instrument — just keep the prompt set fixed so it's fair.
Regression check. A few general prompts (basic math, a normal question, a refusal) to confirm you didn't break core abilities while teaching the new ones.

You can layer on automated scoring or an "LLM-as-judge" later, but a fixed prompt set you actually read beats an elaborate metric you don't trust. The discipline is consistency: same prompts, before and after, written down.

The knobs when something's wrong

If you overfit: train for fewer epochs, lower the learning rate, reduce rank, or add more varied data. If you forgot general skills: train less aggressively, lower the learning rate, or mix a little general data back into your set. If it just didn't learn the behavior at all: more or better examples (back to Day 4), higher rank, or a higher alpha so the adapter speaks louder. Almost every fix is "adjust one thing, re-run, re-eval" — which is cheap on a Spark, so you can afford to be disciplined.

The non-technical version

A new hire who memorized last quarter's reports word-for-word looks brilliant until you hand them this quarter's numbers. Another hire learned your house style so hard they now write the lunch order like a quarterly report. You don't find either problem by admiring their training exercises — you find it by giving them fresh work and watching what happens. Evaluation is fresh work, given on purpose, before you let the model loose on real users.

~/cuda-week/finetune/eval_plan.md

# Run this BEFORE you call the fine-tune "done"

held_out:    20-50 prompts never trained on   -> did the behavior transfer?
side_by_side: base vs tuned, same prompts      -> is it actually better?
regression:  5 general prompts (math, q&a)     -> did I break anything?

# Same prompt set every run. Write the results down.
# "Training loss went down" is NOT on this list on purpose.

Vocabulary to keep

Overfitting

Memorized the training data; fails on new inputs.

Catastrophic forgetting

Learned the new task but lost general abilities.

Held-out eval

Testing on examples the model never trained on.

Regression check

Confirming old skills still work after training.

Hands-on direction

Build your eval set before the capstone training run, not after — writing it first keeps you honest, because you can't quietly redefine "good" to match whatever the model produced. Twenty held-out prompts and five regression prompts is enough. Tomorrow you'll run them against the model you train and finally answer the only question that matters: did it actually work?

Training loss measures how well the model fits answers it already saw. Everything that matters happens on the answers it hasn't.