Let a small model draft. Let the big model judge.
Speculative decoding speeds up generation by asking a cheaper model to propose tokens, then having the larger model verify them in chunks.
Instead of making the big model produce every next token one at a time, a small draft model guesses several tokens ahead. The large model accepts the correct prefix and rejects the rest.
If many drafted tokens survive, the user gets the same answer faster. The win is not a different model; it is a better division of labor during decoding.
The picture
The bottleneck: one expensive step at a time
Autoregressive generation is sequential. Token 100 depends on token 99. That makes LLM decoding feel like walking down a hallway one door at a time.
Speculative decoding asks whether a cheaper model can run ahead and guess several doors. The large model then checks the guesses in a more efficient chunk.
When speculation helps
It helps when the draft model is cheap and often right. Boilerplate, predictable completions, code patterns, and constrained outputs can be friendly territory. It helps less when the draft model guesses poorly or the verification overhead eats the win.
This is why “2-3x faster” is not a law. It is a workload-dependent prize.
How to think like a serving engineer
Speculative decoding is a reminder that serving work is full of systems trades. You are not changing the model’s knowledge. You are changing how much expensive compute you spend per accepted token.
The non-technical version
Think of the small model as an intern drafting the next few words and the large model as the editor. If the intern is usually right, the editor can approve a chunk at once. If the intern guesses badly, the editor spends time rejecting work and the shortcut stops being a shortcut.
That is the whole taste: speculation is not automatically faster. It is faster when cheap guesses match what the big model would have said anyway.
Where it tends to shine
- Structured outputs where the next tokens are predictable.
- Boilerplate prose, code patterns, and repetitive completions.
- Workloads where first-token latency is already acceptable and generation length is the pain.
- Setups where the draft model is cheap enough that wrong guesses do not dominate the bill.
The experiment to run
Compare normal decoding against speculative decoding with the same prompts and output lengths. Record accepted tokens, rejected tokens, total latency, and output quality. If the answer is equally good and the accepted-token rate is high, speculation earned its keep.
Vocabulary to keep
If the draft model is wrong half the time, speculation may be theater. If it is right most of the time, the big model gets to move in larger steps.
drafted_tokens = 1000
accepted_tokens = 720
acceptance_rate = accepted_tokens / drafted_tokens
print(f"acceptance: {acceptance_rate:.0%}")
# High acceptance: speculation can help.
# Low acceptance: you are paying for guesses you throw away.
Speculation is useful when cheap guesses are often right enough to save expensive steps.