PAPER 09 · GPU efficiency

FlashAttention

Dao et al. 2022 Paper

Memory-efficient exact attention that made longer contexts and higher throughput more practical.

Core concept

FlashAttention makes attention faster by avoiding wasteful reads and writes to GPU memory.

Why it mattered

It unlocked longer contexts and better throughput without changing the model’s mathematical output.

Visual shortcut · Same math, smarter memory
fast on-chip tile move less data, keep the math

FlashAttention wins by changing where data lives during computation, not by changing what attention means.

How it works
See that memory traffic is the bottleneck.
Split attention into tiles.
Avoid writing giant intermediate matrices.
Return the same result faster and with less memory.

The quick digest

This paper is not about making attention approximate or smarter. It is about making the same attention computation respect how GPUs actually work. Standard attention creates huge intermediate matrices that are expensive to move around memory.

FlashAttention tiles the computation so the GPU works on manageable chunks, keeps the right things close to the processor, and avoids materializing the full attention matrix. The result is exact attention with much less memory traffic.

The simple lesson: in deep learning systems, speed is often about memory movement, not raw arithmetic. FlashAttention matters because better kernels can make previously impractical context lengths, batch sizes, and serving costs suddenly practical.

What to remember

One-liner
The same math can be much faster if memory movement is smarter.
Why it matters
GPU bottlenecks are often IO, not arithmetic.
Builder instinct
Kernel work can unlock product capability.

Read it like this

Build instinct

Benchmark ordinary attention versus a FlashAttention-backed implementation on sequence lengths that stress memory.

Read source → All papers
Previous08 · RoFormer: Rotary Position Embedding