PAPER 02 · Intuition builder

The Illustrated Transformer

Jay Alammar 2018 Article

The best visual walkthrough of attention and tensor flow before you dive into code.

Core concept

This is the visual explainer that turns the Transformer from math vocabulary into a machine you can picture.

Why it mattered

If the original paper is the blueprint, this is the animation that shows where the tensors flow.

Visual shortcut · The picture to keep
token token token
Q K V views
attention mix
picture-first mental model

The article turns the Transformer into a visible pipeline: text becomes vectors, vectors exchange context, and the final vector becomes a probability distribution.

How it works
Translate every word into a vector.
Add position so order is not lost.
Use attention to blend relevant context.
Repeat the block until the model has a rich representation.

The quick digest

The Illustrated Transformer walks through the same architecture with diagrams instead of dense notation. It shows tokens becoming embeddings, embeddings getting position information, and then each layer repeatedly mixing context through attention and transforming each token with a small neural network.

The most useful part is the query/key/value mental model. A token creates a query, compares it with keys from other tokens, and uses the resulting scores to blend their values. That sounds abstract, but it is just a learned “what should I pay attention to?” system.

You come away with a picture of the repeated block: attention gathers context, the feed-forward layer processes it, residuals preserve useful signals, and the decoder turns the final vector into probabilities for the next token.

What to remember

One-liner
Use it to see the machine before you study the math.
Why it matters
Query/key/value is just “what am I looking for, what do others offer, what do I take?”
Builder instinct
The repeated block is the core mental model.

Read it like this

Build instinct

Recreate the diagrams with toy tensors in a notebook: three tokens, tiny vectors, one attention head.

Read source → All papers
Previous01 · Attention Is All You Need