DAY 02 · LORA — THE LOW-RANK TRICK

Don't repaint the house to change one room.

A full fine-tune rewrites all 8 billion weights of a model — slow, memory-hungry, and easy to wreck. LoRA is the trick that broke that open: freeze the entire model, bolt on two tiny matrices, and train only those. You change the model's behavior by training less than 1% of it. Today is the one mental model that makes the rest of the week click.

Core idea

LoRA — Low-Rank Adaptation — freezes the original weights and learns a small, low-rank "diff" that gets added on top. The big model never moves. Two skinny matrices learn the adjustment. At the end you have a tiny adapter file, often a few megabytes, that carries your entire fine-tune.

Why it matters

This is what makes fine-tuning fit on one desk. You go from needing enough memory to hold and update billions of gradients, to needing enough to update a few million. Same result for most behavior changes, a fraction of the cost — and you can keep dozens of adapters for one base model.

Why you don't need to move every weight

Here's the insight the LoRA paper started from. When you fine-tune a model for a specific task, the change you make to the weights — the difference between before and after — turns out to be "low rank." In plain terms: the adjustment is simple. It can be described with far fewer numbers than the full weight matrix contains. You're not teaching the model a whole new brain; you're nudging it in a particular direction, and a nudge doesn't need billions of parameters to express.

So instead of editing the giant weight matrix directly, LoRA represents that nudge as two small matrices multiplied together. A big matrix that's, say, 4096×4096 has ~16 million numbers. Two skinny matrices of 4096×8 and 8×4096 have ~65 thousand. That little "8" is the rank — how much room the adapter has to express change. You train the skinny pair; the 16-million-number original stays frozen.

The picture

W (frozen)8B weights + A × B trainable adapter (tiny) adapted modelW + (A×B)
The frozen giant plus a small learned correction. Only A and B ever train.

Two knobs you'll actually touch

Rank (r) is how much capacity the adapter has. Higher rank = more room to learn, more parameters, more risk of overfitting on small data. For most behavior fine-tunes, ranks of 8 to 32 are the working range. Bigger isn't automatically better; a rank of 16 on a few hundred examples often beats a rank of 128 that just memorizes them.

Alpha is how loudly the adapter speaks. It scales the adapter's contribution before it's added back to the frozen weights. Turn it up and the fine-tune's influence grows; turn it down and the model leans on its original behavior. A common starting point is alpha equal to the rank, or twice it. These two numbers are most of the dial-turning you'll do this week.

The quiet superpower: adapters are swappable

Because the base model never changes, your fine-tune is just a small file sitting on top of it. That means one base model can wear many adapters — a real-estate-voice adapter, a legal-summary adapter, a JSON-formatter adapter — and you swap them like lenses without storing three full copies of an 8B model. This is also why serving stacks can host many fine-tunes cheaply: same frozen weights underneath, different little hats on top.

"Learns less, forgets less"

There's a real tradeoff to be honest about. Because LoRA only edits a small, low-rank slice of the model, it learns less aggressively than a full fine-tune — and that's often a feature. By leaving the original weights frozen, it tends to forget less of the model's general ability. A full fine-tune has more power to specialize, and more power to bulldoze the things the model already knew. For most people, most of the time, LoRA's gentler touch is exactly what you want.

The non-technical version

Imagine a 1,000-page operations manual for a company. You want to change how the company handles one type of customer. You could rewrite all 1,000 pages — expensive, slow, and you'll probably introduce mistakes on page 738 that nobody catches. Or you could clip a two-page addendum to the front: "for this situation, do it this way instead." The manual stays intact, the change is small and clearly scoped, and you can keep a drawer of different addenda for different situations. LoRA is the addendum.

Vocabulary to keep

LoRA
Freeze the model, train a small low-rank adapter added on top.
Rank (r)
The adapter's capacity. 8–32 covers most behavior fine-tunes.
Alpha
How strongly the adapter's change is applied to the frozen weights.
Adapter
The small trained file — your whole fine-tune, often a few MB.
Hands-on direction

You don't train today — you build the intuition. Picture the job from Day 1 and ask: is this a small nudge to behavior, or a deep change to what the model knows? LoRA is built for nudges. If your honest answer is "I need it to know an entirely new domain it's never seen," LoRA will struggle, and that's a signal to revisit the Day 1 decision before you spend GPU time.

~/cuda-week/finetune/lora_notes.md
# LoRA, in five lines

1. Freeze the whole base model.
2. Add two skinny matrices (A, B) to the layers you target.
3. Train ONLY A and B  ->  <1% of the parameters.
4. rank  = how much the adapter can learn   (start 16)
   alpha = how loud the adapter speaks       (start 16-32)
5. Save the adapter (a few MB). Swap it like a lens.
A full fine-tune rewrites the brain. LoRA tucks a sticky note into the exact right page and trains only the note.
PreviousDay 1