WEEK 02 OF 12 · INFERENCE AT SCALE

One user is wasted GPU. Today we go from solo to service.

Week 1 was about understanding what CUDA is and why your DGX Spark is the right hardware. Week 2 is about using it like a service, not a toy. By Sunday you'll have a real serving stack on your desk — vLLM running Qwen3 7B at 1,000+ tokens/sec across multiple concurrent users, with metrics, batching, and the production techniques every AI company runs in their datacenter.

The arc: Wk 1 · CUDA, what & why → Wk 2 · Inference at scale → Wk 3 · Fine-tuning

Why this week matters

A single user chatting with Qwen3 7B on your DGX uses about 10% of the GPU's capacity. The rest is idle silicon — gross margin sitting on a shelf. The techniques in this week are how OpenAI, Anthropic, and every production AI company turn that idle capacity into revenue. You'll do the same on your desk.

The week, at a glance

Week 2 is live as a full daily arc. Each lesson keeps the same promise: plain-English mental model first, then the serving habit, metric, or experiment that makes it real on your own DGX.

DAY 01 · MON

Inference vs Serving

The conceptual move — "inference" is one prompt, "serving" is a queue. Why they're different engineering problems and where teams confuse them.

Lesson live

DAY 02 · TUE

The Three Numbers That Matter

Tokens/sec, time-to-first-token, latency vs throughput. What to measure, what to ignore, what your users actually feel.

Lesson live

DAY 03 · WED ⭐

Continuous Batching

The trick that turns one GPU into a multi-user service. Why your inference is wasting 90% of the chip and how vLLM fixes it.

Lesson live

DAY 04 · THU

Paged Attention

What vLLM actually does under the hood. The KV-cache memory trick borrowed from operating systems that made modern serving possible.

Lesson live

DAY 05 · FRI

Speculative Decoding

How to make a model run 2–3× faster than itself. Run a tiny draft model in parallel, verify with the big one, only pay for tokens that survive.

Lesson live

DAY 06 · SAT

TensorRT-LLM

NVIDIA's production inference compiler — the layer above vLLM that big-tech runs in their datacenters. When it's worth the complexity, when it isn't.

Lesson live

DAY 07 · SUN ⭐

Capstone — Serve a Real Model

Deploy vLLM with continuous batching on your DGX. Hit 1,000+ tokens/sec aggregate throughput. Benchmark and graph latency under load. Your own production-grade inference service.

Lesson live

What you'll be able to answer by Sunday night

What the difference between "inference" and "serving" actually is, and why it matters for cost.
Why your DGX Spark can comfortably serve 20-30 simultaneous Qwen3 7B users — and the math behind that number.
How vLLM works under the hood, in language a CEO can follow.
When to use vLLM, when to graduate to TensorRT-LLM, and when neither is worth the trouble.
What "1,000 tokens per second aggregate throughput" feels like when you watch it happen on your own hardware.

What you need before Day 1

Carryover from Week 1

DGX Spark on your network, Ollama installed, CUDA verified.
Comfort SSH-ing in and running a Python script.
The intuition from Wk 1 Day 5 about throughput vs latency — we're going to make those numbers move with your own hands this week.

New this week

~100 GB of free disk for vLLM's model cache and benchmark logs.
Patience for one config that won't work on the first try — production inference always demands a debug round.

The big idea

Week 1 taught you what your hardware is. Week 2 teaches you what it can do for other people. Same chip, ten times the value — if you know how to share the GPU correctly.

← Replay Wk 1 Back to home