PAPER 01 · Transformer core

Attention Is All You Need

Vaswani et al. 2017 Paper

The original Transformer paper. It replaced recurrence with self-attention and made the modern LLM stack possible.

Core concept

A Transformer is a way for every word/token to compare itself with every other word/token at the same time, then decide what context matters.

Why it mattered

It removed the old “read one step at a time” bottleneck and made language models much easier to train on GPUs at huge scale.

Visual shortcut · The whole move

The paper is easiest to understand as a change in traffic pattern: stop passing one memory down a line; let every token build context from the whole table at once.

How it works

Put all tokens in the sequence on the table together.

For each token, score which other tokens matter.

Run several relationship detectors in parallel with attention heads.

Add position, residuals, and feed-forward layers so the system can stack deeply.

The quick digest

Before this paper, strong sequence models mostly read text like a careful person moving left-to-right, carrying a hidden memory forward. That worked, but it was slow and awkward: the later words had to wait for the earlier words, and long-range relationships were hard to keep crisp.

The Transformer says: put all the tokens on the table at once. For each token, ask which other tokens help explain it. In “the dog chased the ball because it was excited,” the word “it” needs to know whether dog or ball matters. Attention is the scoring system that learns those relationships. Multi-head attention just means the model runs several relationship detectors in parallel: one head might track grammar, another might track references, another might track phrase structure.

The paper also adds the supporting parts that make this usable: positional encodings so the model knows word order, residual connections so information survives many layers, and feed-forward blocks so each token can be transformed after it gathers context. The punchline: this architecture trains fast, parallelizes well, and became the skeleton underneath GPT-style LLMs.

What to remember

One-liner

Attention is learned context routing.

Why it matters

The breakthrough was parallel comparison, not just better word prediction.

Builder instinct

Modern LLMs are mostly this idea scaled, simplified, and optimized.

Read it like this

First pass: Start with Figure 1 until you can point to where information mixes.
Second pass: Then read scaled dot-product attention and multi-head attention.
Then build taste: Save the BLEU tables for last; the architecture is the enduring contribution.

Build instinct

Build one causal attention head and print the attention matrix. If the heatmap makes intuitive sense, the whole paper becomes less scary.

Read source → All papers