The model is a commodity. Your dataset is the moat.
Days 2 and 3 were the machinery. Today is the thing that actually decides whether your fine-tune is good: the data. Everyone can download the same base model and run the same Unsloth script. What makes your model yours — and better — is examples nobody else has, formatted with care. Most failed fine-tunes are data failures wearing a training-code costume.
A fine-tune learns the pattern in your examples — including the patterns you didn't mean to teach. The model copies your data's voice, structure, length, and habits. Clean, consistent, on-target examples make a clean model. Messy, contradictory, or sloppy examples make a confidently sloppy model. The training code is the easy 10%; the dataset is the hard, valuable 90%.
You can't out-hyperparameter bad data. A rank tweak won't save examples that disagree with each other. The single highest-leverage hour this week is the one you spend making 100 examples genuinely good — not the one you spend tuning learning rate.
What a training example actually is
For instruction fine-tuning, each example is a little conversation: a user turn and the ideal assistant reply you want the model to learn to produce. You're not writing facts for the model to memorize — you're showing it, over and over, "when the input looks like this, the right response looks like this." A few hundred consistent demonstrations and the model internalizes the pattern.
The reply side is where your moat lives. If you want a specific voice, every example's answer must be written in that voice. If you want tight three-sentence summaries, every answer is a tight three-sentence summary. The model has no idea what you "meant" — it learns what you showed it. Your examples are the spec.
The chat template — the boring detail that breaks everything
Every instruct model was trained with special tokens marking where the user turn starts, where the assistant turn starts, and where each ends. Llama, Qwen, and Mistral each use a different set of these markers. When you fine-tune, your data has to be wrapped in that model's exact template, or you're teaching it against a format it doesn't recognize — and the results get subtly, maddeningly wrong.
This is the most common silent failure in fine-tuning. The good news: tools like Unsloth apply the correct chat template for you when you tell them which model you're using. The lesson is to know this layer exists, so when a fine-tune comes out garbled, "did the template match the model?" is your first question, not your last.
The picture
How many examples, really?
For a focused behavior change — a voice, a format, a narrow task — you can see real results with as few as 100 high-quality examples, and a few hundred to a couple thousand is a comfortable, strong range. More is not automatically better. A thousand sloppy examples lose to a hundred excellent ones, because the model faithfully learns the sloppiness too. Quality and consistency beat raw count almost every time at this scale.
Quality rules that actually move the needle
- Consistency over volume. If two examples answer similar inputs in contradictory styles, you're teaching confusion. Pick one way and hold it.
- Diversity of inputs, consistency of outputs. Vary the questions widely; keep the manner of the answers uniform. That's how the model learns the behavior, not just the examples.
- Match the length you want. The model copies answer length. Want concise? Every example is concise. Long, rambling demonstrations train a long, rambling model.
- No leftover junk. Stray HTML, broken encoding, duplicated rows, half-finished answers — the model can't tell trash from signal. It learns all of it.
- Hold some out. Keep ~10% aside, never trained on, so Day 6 you can test whether the model generalized or just memorized.
The non-technical version
Training a new hire by example: if every sample report you hand them is sharp, consistent, and in the house style, they'll write sharp, consistent, on-style reports. If half your samples are great and half are rushed garbage you grabbed off an old drive, they'll average the two and you'll wonder why their work is uneven. The model is the most literal trainee you'll ever have. It copies exactly what you show it — your standards and your shortcuts alike.
~/cuda-week/finetune/example.jsonl# One line = one training example (before templating)
{"messages": [
{"role": "user",
"content": "Write a listing description for a 3bd/2ba bungalow in Austin, 1850 sqft, updated kitchen, big backyard."},
{"role": "assistant",
"content": "Tucked into a quiet Austin street, this 3-bed, 2-bath bungalow lives larger than its 1,850 square feet. The updated kitchen anchors an easy, open flow, and the oversized backyard is the real headline — room to garden, host, or just breathe. Move-in ready, walkable, and priced to go."}
]}
# Notice: the VOICE is the product. Every example's reply
# is written the same confident, concise way. That consistency
# is what the model actually learns.
Vocabulary to keep
Write ten training examples by hand for the behavior you want — really write them, don't generate them. Ten is enough to feel the discipline: holding one voice, one length, one format across all ten is harder than it sounds, and that difficulty is exactly the work that makes a fine-tune good. The capstone dataset will be bigger, but this is where you learn what "good" looks like.
You can't fix bad data with a good learning rate. The dataset is the part of the fine-tune that's actually yours — guard its quality like the asset it is.