The KV cache is memory management, not magic.
Paged Attention is the memory trick that made vLLM practical. It treats the KV cache more like an operating system treats memory: broken into reusable blocks instead of fragile contiguous slabs.
Every active sequence needs key/value cache memory. Paged Attention stores that cache in blocks so memory can be allocated, reused, and shared more efficiently.
Serving fails when memory fragments or the KV cache explodes. Better cache management means more concurrent users, longer contexts, and fewer wasted gigabytes.
The picture
What the KV cache is doing
During generation, the model does not want to recompute the entire past for every new token. It keeps key/value tensors from previous tokens so attention can look back cheaply. That cache grows with context length and active users.
In serving, the KV cache can become the real bottleneck. The model weights may fit comfortably, but concurrent contexts eat memory fast.
Why paging is the right metaphor
Operating systems do not require every program to own one perfect contiguous slab of memory. They break memory into pages. vLLM borrows that mental model for attention cache blocks.
This lets the server reuse freed blocks, handle different sequence lengths, and reduce waste. It is not glamorous, but it is one of the reasons vLLM feels dramatically better than naive serving.
The product implication
Long context is not free. More users are not free. Every extra token kept alive has a memory cost. Paged Attention does not remove that cost; it makes the cost manageable enough to build with.
The non-technical version
The KV cache is the model’s scratchpad for every active conversation. Without it, the model would keep rereading the entire conversation history every time it writes the next word. With it, the model remembers the expensive parts it already computed.
The catch is that every active conversation needs scratchpad space. Paged Attention makes that scratchpad more like a set of reusable notebook pages instead of one giant notebook assigned to each user.
Why this changed serving
- Uneven request lengths waste less memory because cache blocks can be reused.
- Long prompts become easier to mix with short prompts.
- The server can admit more active sequences before memory becomes chaos.
- Continuous batching becomes more practical because active requests can enter and leave cleanly.
What to watch on your box
If concurrency fails before compute looks maxed, suspect memory. If long-context users make everyone slower, suspect KV pressure. The serving question is not just “how fast is the GPU?” It is also “how much conversation state can I keep alive?”
Vocabulary to keep
When a serving run crashes at higher concurrency, do not only blame compute. Ask whether the KV cache filled memory first.
active_users = 20 avg_context_tokens = 2_000 # More users * longer context = more KV cache pressure # Paged Attention helps manage the blocks; it does not make memory infinite.
Paged Attention made LLM serving feel less like one giant tensor and more like an operating system problem.