One user is wasted GPU. Today we go from solo to service.
Week 1 was about understanding what CUDA is and why your DGX Spark is the right hardware. Week 2 is about using it like a service, not a toy. By Sunday you'll have a real serving stack on your desk — vLLM running Qwen3 7B at 1,000+ tokens/sec across multiple concurrent users, with metrics, batching, and the production techniques every AI company runs in their datacenter.
A single user chatting with Qwen3 7B on your DGX uses about 10% of the GPU's capacity. The rest is idle silicon — gross margin sitting on a shelf. The techniques in this week are how OpenAI, Anthropic, and every production AI company turn that idle capacity into revenue. You'll do the same on your desk.
The week, at a glance
Week 2 is live as a full daily arc. Each lesson keeps the same promise: plain-English mental model first, then the serving habit, metric, or experiment that makes it real on your own DGX.
Inference vs Serving
The conceptual move — "inference" is one prompt, "serving" is a queue. Why they're different engineering problems and where teams confuse them.
Lesson liveThe Three Numbers That Matter
Tokens/sec, time-to-first-token, latency vs throughput. What to measure, what to ignore, what your users actually feel.
Lesson liveContinuous Batching
The trick that turns one GPU into a multi-user service. Why your inference is wasting 90% of the chip and how vLLM fixes it.
Lesson livePaged Attention
What vLLM actually does under the hood. The KV-cache memory trick borrowed from operating systems that made modern serving possible.
Lesson liveSpeculative Decoding
How to make a model run 2–3× faster than itself. Run a tiny draft model in parallel, verify with the big one, only pay for tokens that survive.
Lesson liveTensorRT-LLM
NVIDIA's production inference compiler — the layer above vLLM that big-tech runs in their datacenters. When it's worth the complexity, when it isn't.
Lesson liveCapstone — Serve a Real Model
Deploy vLLM with continuous batching on your DGX. Hit 1,000+ tokens/sec aggregate throughput. Benchmark and graph latency under load. Your own production-grade inference service.
Lesson liveWhat you'll be able to answer by Sunday night
- What the difference between "inference" and "serving" actually is, and why it matters for cost.
- Why your DGX Spark can comfortably serve 20-30 simultaneous Qwen3 7B users — and the math behind that number.
- How vLLM works under the hood, in language a CEO can follow.
- When to use vLLM, when to graduate to TensorRT-LLM, and when neither is worth the trouble.
- What "1,000 tokens per second aggregate throughput" feels like when you watch it happen on your own hardware.
What you need before Day 1
Carryover from Week 1
- DGX Spark on your network, Ollama installed, CUDA verified.
- Comfort SSH-ing in and running a Python script.
- The intuition from Wk 1 Day 5 about throughput vs latency — we're going to make those numbers move with your own hands this week.
New this week
- ~100 GB of free disk for vLLM's model cache and benchmark logs.
- Patience for one config that won't work on the first try — production inference always demands a debug round.
Week 1 taught you what your hardware is. Week 2 teaches you what it can do for other people. Same chip, ten times the value — if you know how to share the GPU correctly.