DAY 01 · INFERENCE VS SERVING

Inference is one answer. Serving is a system.

The first move in Week 2 is separating a model call from a product surface. A single prompt proves the model runs; serving proves many people can use it without wasting the GPU or hating the latency.

Core idea

Inference is the act of generating tokens for one request. Serving is everything around that act: queues, batching, scheduling, model loading, memory management, metrics, retries, and user-facing latency.

Why it matters

Most teams benchmark the wrong thing. They celebrate one fast response, then discover the system falls apart when ten users arrive at once. Serving is where GPU utilization becomes product margin.

The picture

The shape to keep in your head for Inference vs Serving.

The mistake almost everyone makes

When you run a model from a terminal, you are usually testing the happy path: one prompt, one user, one model, one answer. That is useful, but it hides the real engineering problem. A product has overlapping users, uneven prompt lengths, slow clients, retries, cold starts, and bursts.

Serving starts when the question changes from “can this model answer?” to “can this system answer for many people, predictably, without wasting the expensive part?”

The queue is the product surface

The moment two requests arrive at once, you need a policy. Which one runs first? Can they share a batch? Does the long prompt block the short one? Do you reject new work when memory is tight, or let latency climb until everyone suffers?

A good serving stack makes those policies explicit. vLLM, TensorRT-LLM, Triton, and production gateways all exist because raw model inference is only one piece of the product.

What changes on the DGX Spark

A single Qwen3 7B chat may barely wake the GPU. That does not mean the hardware is overkill; it means the workload is under-shaped. Week 2 is about shaping the workload so the box behaves like shared infrastructure instead of a private calculator.

The non-technical version

Imagine a restaurant with one chef and one table. Cooking one meal at a time is easy. The moment ten tables arrive, the job changes: someone has to seat people, pace orders, keep the kitchen full, avoid cold food, and decide when the room is too full to accept more guests.

The model is the chef. Serving is the whole restaurant system. If the chef is brilliant but the host stand is chaos, customers still have a bad night.

What to notice when you run it

If the first request feels fast but the fifth request feels stuck, you are seeing queueing.
If the GPU utilization is low while users are waiting, the server is not feeding work efficiently.
If memory jumps before compute saturates, your bottleneck may be context and KV cache, not raw math.
If streaming starts quickly but the answer drags, separate time-to-first-token from full completion time.

The operator question

Do not ask only, “Can the model answer?” Ask, “How many people can this setup serve before the experience bends?” That second question is where Week 2 lives.

Vocabulary to keep

Inference

One request enters the model and tokens come out.

Serving

A system schedules many requests through one or more models.

Utilization

The percentage of expensive silicon doing useful work.

User experience

The latency, streaming feel, and reliability people actually notice.

Hands-on direction

A useful serving diagram has at least one queue, one GPU-bound step, and one user-facing latency measurement. If your sketch is only “prompt -> model -> answer,” you are still thinking about inference, not serving.

~/cuda-week/serving/request_log.md

# Draw this before touching vLLM

User -> API endpoint -> queue -> model server -> GPU -> streamed tokens

For each arrow, write down:
- what could wait here?
- what could fail here?
- what should we measure here?

The product is not the model call. The product is the serving loop around the model call.