DAY 03 · QLORA — FINE-TUNING ON ONE DESK ⭐

The trick that turned a cluster job into a coffee break.

LoRA shrank what you have to train. QLoRA shrank what you have to hold in memory while you train it. Put the two together and a model that needed a rack of GPUs in 2022 fine-tunes on the box on your desk in 2026. This is the single most important unlock in the whole week — and the day you set up the tools you'll use for the capstone.

Core idea

QLoRA keeps the frozen base model in memory at 4-bit precision instead of 16-bit — roughly a quarter of the space — then trains LoRA adapters on top of it in full precision. The big thing you're not changing gets squeezed small; the small thing you are changing stays accurate. You lose almost nothing and save enormous memory.

Why it matters

The original QLoRA result took fine-tuning a 65B model from needing >780 GB of GPU memory down to under 48 GB — with no measurable quality loss versus a full 16-bit fine-tune. That's the difference between "rent a datacenter" and "use the machine you own."

The three ideas stacked inside QLoRA

1. 4-bit NF4 quantization. The frozen weights are stored in a custom 4-bit format called NF4 (Normalized Float 4), designed specifically for the bell-curve way neural network weights are distributed. It packs each weight into 4 bits while keeping more of the important detail than a naive 4-bit scheme would. The base model is now ~4× smaller in memory.

2. Train the adapter in full precision. You never train the 4-bit weights — they stay frozen. The LoRA adapters on top are kept in bfloat16, the accurate format, so the part actually learning isn't handicapped by the compression. During the math, the 4-bit weights are temporarily expanded back up to do their work, then dropped again. Compression for storage, precision for learning.

3. Double quantization. Quantizing produces little bookkeeping numbers (the scales). QLoRA quantizes those too, saving roughly another 0.37 bits per parameter. It sounds like a rounding error; across billions of parameters it's gigabytes.

The picture

Squeeze the frozen part to 4-bit; keep the learning part accurate.

Where the DGX Spark changes the story

QLoRA was invented so people could fine-tune on a single consumer GPU with 16–24 GB. Your DGX Spark has 128 GB of unified memory, all exposed as one pool, so you rarely fight the memory ceiling that QLoRA was built to dodge. What you get instead is headroom: you can fine-tune comfortably in the 8–20B sweet spot, and QLoRA stretches you all the way up to 70B-class models if you want to push it. Tuning Llama 3.1 8B with LoRA on the Spark has been clocked around 53,000 tokens/sec — fine-tunes that finish while your coffee's still warm.

Unsloth — the tool you'll actually run

You won't hand-write the quantization math. Unsloth is an open-source library that wraps all of this — QLoRA, the NF4 format, the LoRA bookkeeping — behind a few lines of Python, and it's tuned for NVIDIA hardware with custom Triton kernels (remember Triton from the Week 9 preview — Python-level GPU code). On the Spark it delivers roughly a 2.5× training speed-up over plain Hugging Face transformers while using less memory. Today you install it; Day 7 you point it at the real estate dataset.

~/cuda-week/finetune/setup_qlora.sh

# Day 3: stand up the toolchain (run on the Spark)
pip install unsloth

python - <<'PY'
from unsloth import FastLanguageModel

# Load Llama 3.1 8B already in 4-bit (QLoRA-ready)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name   = "unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length = 2048,
    load_in_4bit = True,        # <- the Q in QLoRA
)

# Attach LoRA adapters (the only thing that will train)
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,                     # rank  (Day 2)
    lora_alpha = 16,            # alpha (Day 2)
    target_modules = ["q_proj","k_proj","v_proj","o_proj",
                      "gate_proj","up_proj","down_proj"],
)
print("QLoRA model ready. Trainable params:")
model.print_trainable_parameters()   # expect <1% of total
PY

Hands-on direction

Success today is not a fine-tuned model — it's a clean run of the script above with print_trainable_parameters() showing well under 1% of the model is trainable. That one line is the whole lesson made concrete: the 4-bit frozen giant is along for the ride, and the tiny bf16 adapter is the only thing learning. If this runs, your Day 7 capstone is already de-risked.

The non-technical version

You want to annotate a priceless 1,000-page reference book, but you can't fit the original on your small desk. So you make a high-quality 4-bit photocopy — smaller, lighter, good enough to read every word — and you do your annotating on crisp full-quality sticky notes attached to it. The original never gets touched. Your notes are sharp. And the whole thing fits where the original never could. That photocopy-plus-sharp-notes is QLoRA.

Vocabulary to keep

QLoRA

4-bit frozen base model + full-precision LoRA adapter trained on top.

NF4

A 4-bit number format shaped to how neural-net weights are distributed.

Double quantization

Quantizing the quantization constants too — extra savings, free.

Unsloth

The library that runs all this fast on NVIDIA hardware.

LoRA decided what trains. QLoRA decided what fits. Together they put model training on a desk instead of in a datacenter.