Attention Is All You Need
The original Transformer paper. It replaced recurrence with self-attention and made the modern LLM stack possible.
A Transformer is a way for every word/token to compare itself with every other word/token at the same time, then decide what context matters.
It removed the old “read one step at a time” bottleneck and made language models much easier to train on GPUs at huge scale.
The paper is easiest to understand as a change in traffic pattern: stop passing one memory down a line; let every token build context from the whole table at once.
The quick digest
Before this paper, strong sequence models mostly read text like a careful person moving left-to-right, carrying a hidden memory forward. That worked, but it was slow and awkward: the later words had to wait for the earlier words, and long-range relationships were hard to keep crisp.
The Transformer says: put all the tokens on the table at once. For each token, ask which other tokens help explain it. In “the dog chased the ball because it was excited,” the word “it” needs to know whether dog or ball matters. Attention is the scoring system that learns those relationships. Multi-head attention just means the model runs several relationship detectors in parallel: one head might track grammar, another might track references, another might track phrase structure.
The paper also adds the supporting parts that make this usable: positional encodings so the model knows word order, residual connections so information survives many layers, and feed-forward blocks so each token can be transformed after it gathers context. The punchline: this architecture trains fast, parallelizes well, and became the skeleton underneath GPT-style LLMs.
What to remember
Read it like this
- First pass: Start with Figure 1 until you can point to where information mixes.
- Second pass: Then read scaled dot-product attention and multi-head attention.
- Then build taste: Save the BLEU tables for last; the architecture is the enduring contribution.
Build one causal attention head and print the attention matrix. If the heatmap makes intuitive sense, the whole paper becomes less scary.