In Week 1 you prompted it. Today you train it.
Week 1's capstone built a real estate analyst by leaning on a long prompt every single call. Today you bake that behavior into the weights. You'll QLoRA fine-tune Llama 3.1 8B on a real estate dataset, give it one consistent house voice, evaluate it honestly, merge the adapter, and serve it on your DGX. By dinner you have a model that drafts listings and reasons about housing the way you taught it — no babysitting prompt required.
The goal is not "the model answered." It's a model whose behavior changed: ask it cold, with no system prompt, and it still writes in your analyst's voice and reasons like a housing pro. That's the proof a fine-tune did something a prompt couldn't — the behavior is now part of the model, not part of the request.
This is the week's whole argument made real. You take everything — the Day 1 decision, LoRA, QLoRA, the dataset discipline, the eval — and run it end to end on hardware you own. You finish on the far side of the line: from someone who uses models to someone who makes them.
The pipeline
The acceptance test
By the end you want three artifacts: a trained adapter, a before/after eval on held-out prompts, and a served model you can chat with in Ollama. The adapter proves training ran. The eval proves it got better, not just different. The served model proves the behavior survived merging and is usable. Hit all three and you've genuinely fine-tuned a model — not run a demo.
Step 1 — the dataset (Day 4 made real)
Use a public real estate dataset as raw material — listing descriptions, property attributes, market notes — and shape it into clean user→assistant pairs where every assistant reply is written in one consistent analyst voice: specific, grounded, no hype. Aim for a few hundred good examples and hold out ~10% for evaluation. This is the hour that decides everything; the training is the easy part.
Step 2 — QLoRA fine-tune (Days 2–3 made real)
~/cuda-week/finetune/train_analyst.pyfrom unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length=2048, load_in_4bit=True) # QLoRA (Day 3)
model = FastLanguageModel.get_peft_model(
model, r=16, lora_alpha=16, # LoRA knobs (Day 2)
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"])
data = load_dataset("json", data_files="analyst_train.jsonl", split="train")
trainer = SFTTrainer(
model=model, tokenizer=tokenizer, train_dataset=data,
args=SFTConfig(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5, num_train_epochs=2, # watch for overfit (Day 6)
learning_rate=2e-4, logging_steps=5,
output_dir="analyst_lora"))
trainer.train() # ~minutes on the Spark
model.save_pretrained("analyst_lora") # the adapter (a few MB)
Step 3 — evaluate honestly (Day 6 made real)
Run your held-out prompts through the base model and the fine-tuned model with no system prompt, side by side. The tell that it worked: the base model writes generic copy or asks for instructions, and yours writes in the analyst voice unprompted. Run your five regression prompts too — confirm it can still do basic reasoning and didn't forget how to be a normal model.
Step 4 — merge and serve (Week 2 made real)
Merge the adapter into the base weights to get a standalone model, export it to GGUF, and load it into Ollama — the same serving muscle from Week 2. Now it's a model on your network that anyone with access can call, with your fine-tune baked in.
~/cuda-week/finetune/ship_it.sh# Merge adapter -> standalone GGUF -> Ollama (run on the Spark)
python - <<'PY'
from unsloth import FastLanguageModel
model, tok = FastLanguageModel.from_pretrained("analyst_lora", load_in_4bit=False)
model.save_pretrained_gguf("analyst_gguf", tok, quantization_method="q4_k_m")
PY
# Point Ollama at the merged model and chat with NO system prompt
ollama create analyst -f analyst_gguf/Modelfile
ollama run analyst "Write a listing for a 4bd colonial in Westchester, needs a kitchen update."
# If it sounds like your analyst with zero instructions -> the fine-tune worked.
Run it in passes
- Pass 1: tiny run — 30 examples, 1 epoch — just to prove the pipeline executes end to end.
- Pass 2: the real dataset, 2 epochs, then your full eval. Compare to base.
- Pass 3: if it overfit or forgot, turn one knob (epochs, learning rate, rank) and re-run. The Spark makes this cheap.
- Pass 4 (optional): add a small DPO pass (Day 5) to teach it to prefer grounded specifics over hype.
What to publish to yourself
Finish with a side-by-side: the same prompt answered by stock Llama 3.1 8B and by your fine-tune, no system prompt on either, plus one line of training stats (examples, epochs, minutes, tokens/sec). That artifact is the whole week in one image — proof that the behavior moved from the prompt into the model, on your desk.
A good capstone result is not "the model wrote a listing." It's "the base model wrote generic copy and asked for direction, and mine wrote in the analyst's voice with nothing but the question." That contrast — same prompt, two models, visibly different defaults — is the entire point of the week, made undeniable.
Capstone checklist
- [ ] Build dataset: a few hundred clean pairs, one consistent voice, ~10% held out
- [ ] QLoRA fine-tune Llama 3.1 8B with Unsloth — adapter saved
- [ ] Eval: held-out prompts, base vs tuned, no system prompt
- [ ] Regression check: general skills still intact
- [ ] Merge adapter → GGUF → Ollama, serving on the DGX
- [ ] Side-by-side artifact + training stats written down
A prompt makes a model act the part for one request. A fine-tune makes it the part. Today you crossed from renting behavior to owning it.