WEEK 02 OF 12 · INFERENCE AT SCALE

One user is wasted GPU. Today we go from solo to service.

Week 1 was about understanding what CUDA is and why your DGX Spark is the right hardware. Week 2 is about using it like a service, not a toy. By Sunday you'll have a real serving stack on your desk — vLLM running Qwen3 7B at 1,000+ tokens/sec across multiple concurrent users, with metrics, batching, and the production techniques every AI company runs in their datacenter.

The arc: Wk 1 · CUDA, what & why Wk 2 · Inference at scale Wk 3 · Fine-tuning
Why this week matters

A single user chatting with Qwen3 7B on your DGX uses about 10% of the GPU's capacity. The rest is idle silicon — gross margin sitting on a shelf. The techniques in this week are how OpenAI, Anthropic, and every production AI company turn that idle capacity into revenue. You'll do the same on your desk.

The week, at a glance

Week 2 is live as a full daily arc. Each lesson keeps the same promise: plain-English mental model first, then the serving habit, metric, or experiment that makes it real on your own DGX.

DAY 01 · MON

Inference vs Serving

The conceptual move — "inference" is one prompt, "serving" is a queue. Why they're different engineering problems and where teams confuse them.

Lesson live
DAY 02 · TUE

The Three Numbers That Matter

Tokens/sec, time-to-first-token, latency vs throughput. What to measure, what to ignore, what your users actually feel.

Lesson live
DAY 03 · WED ⭐

Continuous Batching

The trick that turns one GPU into a multi-user service. Why your inference is wasting 90% of the chip and how vLLM fixes it.

Lesson live
DAY 04 · THU

Paged Attention

What vLLM actually does under the hood. The KV-cache memory trick borrowed from operating systems that made modern serving possible.

Lesson live
DAY 05 · FRI

Speculative Decoding

How to make a model run 2–3× faster than itself. Run a tiny draft model in parallel, verify with the big one, only pay for tokens that survive.

Lesson live
DAY 06 · SAT

TensorRT-LLM

NVIDIA's production inference compiler — the layer above vLLM that big-tech runs in their datacenters. When it's worth the complexity, when it isn't.

Lesson live
DAY 07 · SUN ⭐

Capstone — Serve a Real Model

Deploy vLLM with continuous batching on your DGX. Hit 1,000+ tokens/sec aggregate throughput. Benchmark and graph latency under load. Your own production-grade inference service.

Lesson live

What you'll be able to answer by Sunday night

What you need before Day 1

Carryover from Week 1

New this week

The big idea

Week 1 taught you what your hardware is. Week 2 teaches you what it can do for other people. Same chip, ten times the value — if you know how to share the GPU correctly.

← Replay Wk 1 Back to home