DAY 03 · CONTINUOUS BATCHING

The GPU should not wait for everyone to finish their sentence.

Continuous batching is the core serving trick behind modern LLM systems. It lets new requests join active work instead of waiting for a whole static batch to complete.

Core idea

Old batching groups requests, runs them together, and waits. Continuous batching keeps the GPU busy by adding and removing sequences as users arrive and finish.

Why it matters

LLM requests have wildly different lengths. Without continuous batching, short prompts wait behind long generations and the GPU spends too much time underfilled.

The picture

The shape to keep in your head for Continuous Batching.

Static batching was built for neat rectangles

Classic batching works beautifully when every item has roughly the same shape. LLM serving is messier. One user asks for a three-word answer; another asks for a market memo; a third pastes a long document. Static batches force that uneven work into a rigid schedule.

Continuous batching treats the active batch as a living thing. Finished sequences leave. New sequences enter. The server keeps feeding the GPU without waiting for a full batch boundary.

Why this matters for local hardware

On a DGX Spark, you do not have infinite GPUs to hide bad scheduling. The easiest way to waste the box is to serve one request at a time. Continuous batching is how one model becomes a shared resource.

This is also why vLLM became such a big deal: it packaged the scheduling and memory ideas into something builders could actually run.

The user-facing tradeoff

Batching improves throughput, but it can add queueing delay. The art is choosing policies that keep the GPU full without making the first token feel dead. That is why Day 2’s metrics matter before Day 3’s tuning.

The non-technical version

Static batching is like waiting for a full elevator, riding together, then making everyone exit before the next group can enter. Continuous batching is like an escalator: people step on and off while the machine keeps moving.

That is why the idea is so powerful for chat. Conversations are uneven. Some end quickly, some go long, and new ones arrive at weird times. The scheduler keeps the escalator full without freezing the whole system at every stop.

What vLLM is really buying you

It keeps active requests grouped so the GPU sees useful parallel work.
It admits new requests without waiting for every existing request to finish.
It pairs batching with KV-cache management, because scheduling and memory are tied together.
It exposes a normal API so the hard machinery hides behind a simple service shape.

How to know it is working

Run the same load test against naive one-at-a-time serving and then against vLLM. You are looking for aggregate throughput to rise much faster than user pain rises. If tokens/sec improves but p95 latency becomes ugly, the scheduler is winning on the server and losing on the product.

Vocabulary to keep

Static batch

Everyone starts together; everyone waits for the batch rhythm.

Continuous batch

Requests enter and leave while decoding continues.

GPU utilization

Higher because empty slots get reused quickly.

Tradeoff

More throughput can mean more scheduling complexity.

Hands-on direction

If the batch looks like a train that only leaves the station every few minutes, it is static. If it looks like an escalator with people stepping on and off, it is continuous.

Mental simulation

time 0: A enters, starts decoding
time 1: B enters while A is still decoding
time 2: C enters; A finishes and leaves
time 3: D takes A's slot

The GPU keeps working on active sequences instead of restarting from scratch.

Continuous batching turns scattered chats into a steady stream of GPU work.