Ship the local model like other people can use it.
The Week 2 capstone is a real serving shape: model server, concurrent requests, benchmark numbers, and a graph that explains what your DGX Spark can handle.
The goal is not another chat demo. The goal is a service you can reason about: Qwen3 7B behind vLLM, load tests, aggregate throughput, latency under concurrency, and a clean explanation of the limits.
This is the moment the box becomes infrastructure. You stop saying “I can run a model” and start saying “I can serve users.”
The picture
The acceptance test
By the end of the capstone, you want three artifacts: a running local endpoint, a small benchmark log, and a short interpretation of what the numbers mean.
The endpoint proves the model serves. The benchmark proves it survives concurrency. The interpretation proves you understand the system instead of just running commands.
What to deploy
Use Qwen3 7B because it is small enough to serve comfortably and real enough to expose serving behavior. The model choice is less important than the service envelope: prompt size, output length, concurrency, time-to-first-token, throughput, and failure mode.
You can swap the model later. The serving discipline stays.
What “done” feels like
You should be able to say: at this concurrency and output length, my DGX Spark feels safe; above this point, latency climbs or memory pressure appears. That sentence is more valuable than a single hero screenshot.
The capstone story
You are building the smallest honest version of production serving: a model behind an API, multiple users hitting it at once, measurements that separate server capacity from user experience, and a written service envelope.
The point is not to pretend your desk is a hyperscaler. The point is to understand the same shape at human scale. Queue, scheduler, memory, GPU, metrics, tradeoffs. That is the whole game in miniature.
Run it in passes
- Pass 1: one request, just to prove the endpoint and model are alive.
- Pass 2: five concurrent requests, to see batching start to matter.
- Pass 3: twenty concurrent requests, to expose queueing and tail latency.
- Pass 4: one long-context case, to see KV-cache pressure show up.
What to publish to yourself
Finish with a small table and one paragraph: concurrency, aggregate tokens/sec, p50 TTFT, p95 total latency, error count, and what broke first. That artifact is the real deliverable because it turns a pile of commands into operational judgment.
Vocabulary to keep
A good capstone result is not “the model answered.” It is “the model served 20 concurrent users with acceptable latency, and here is where it started to bend.”
[ ] Start model server [ ] Send one request [ ] Send 5 concurrent requests [ ] Send 20 concurrent requests [ ] Record TTFT, total latency, output tokens [ ] Write the service envelope in plain English
A model on your desk is a toy until you can describe its service envelope.