The GPU should not wait for everyone to finish their sentence.
Continuous batching is the core serving trick behind modern LLM systems. It lets new requests join active work instead of waiting for a whole static batch to complete.
Old batching groups requests, runs them together, and waits. Continuous batching keeps the GPU busy by adding and removing sequences as users arrive and finish.
LLM requests have wildly different lengths. Without continuous batching, short prompts wait behind long generations and the GPU spends too much time underfilled.
The picture
Static batching was built for neat rectangles
Classic batching works beautifully when every item has roughly the same shape. LLM serving is messier. One user asks for a three-word answer; another asks for a market memo; a third pastes a long document. Static batches force that uneven work into a rigid schedule.
Continuous batching treats the active batch as a living thing. Finished sequences leave. New sequences enter. The server keeps feeding the GPU without waiting for a full batch boundary.
Why this matters for local hardware
On a DGX Spark, you do not have infinite GPUs to hide bad scheduling. The easiest way to waste the box is to serve one request at a time. Continuous batching is how one model becomes a shared resource.
This is also why vLLM became such a big deal: it packaged the scheduling and memory ideas into something builders could actually run.
The user-facing tradeoff
Batching improves throughput, but it can add queueing delay. The art is choosing policies that keep the GPU full without making the first token feel dead. That is why Day 2’s metrics matter before Day 3’s tuning.
The non-technical version
Static batching is like waiting for a full elevator, riding together, then making everyone exit before the next group can enter. Continuous batching is like an escalator: people step on and off while the machine keeps moving.
That is why the idea is so powerful for chat. Conversations are uneven. Some end quickly, some go long, and new ones arrive at weird times. The scheduler keeps the escalator full without freezing the whole system at every stop.
What vLLM is really buying you
- It keeps active requests grouped so the GPU sees useful parallel work.
- It admits new requests without waiting for every existing request to finish.
- It pairs batching with KV-cache management, because scheduling and memory are tied together.
- It exposes a normal API so the hard machinery hides behind a simple service shape.
How to know it is working
Run the same load test against naive one-at-a-time serving and then against vLLM. You are looking for aggregate throughput to rise much faster than user pain rises. If tokens/sec improves but p95 latency becomes ugly, the scheduler is winning on the server and losing on the product.
Vocabulary to keep
If the batch looks like a train that only leaves the station every few minutes, it is static. If it looks like an escalator with people stepping on and off, it is continuous.
time 0: A enters, starts decoding time 1: B enters while A is still decoding time 2: C enters; A finishes and leaves time 3: D takes A's slot The GPU keeps working on active sequences instead of restarting from scratch.
Continuous batching turns scattered chats into a steady stream of GPU work.