DAY 07 · SERVING CAPSTONE

Ship the local model like other people can use it.

The Week 2 capstone is a real serving shape: model server, concurrent requests, benchmark numbers, and a graph that explains what your DGX Spark can handle.

Core idea

The goal is not another chat demo. The goal is a service you can reason about: Qwen3 7B behind vLLM, load tests, aggregate throughput, latency under concurrency, and a clean explanation of the limits.

Why it matters

This is the moment the box becomes infrastructure. You stop saying “I can run a model” and start saying “I can serve users.”

The picture

The shape to keep in your head for Capstone — Serve a Real Model.

The acceptance test

By the end of the capstone, you want three artifacts: a running local endpoint, a small benchmark log, and a short interpretation of what the numbers mean.

The endpoint proves the model serves. The benchmark proves it survives concurrency. The interpretation proves you understand the system instead of just running commands.

What to deploy

Use Qwen3 7B because it is small enough to serve comfortably and real enough to expose serving behavior. The model choice is less important than the service envelope: prompt size, output length, concurrency, time-to-first-token, throughput, and failure mode.

You can swap the model later. The serving discipline stays.

What “done” feels like

You should be able to say: at this concurrency and output length, my DGX Spark feels safe; above this point, latency climbs or memory pressure appears. That sentence is more valuable than a single hero screenshot.

The capstone story

You are building the smallest honest version of production serving: a model behind an API, multiple users hitting it at once, measurements that separate server capacity from user experience, and a written service envelope.

The point is not to pretend your desk is a hyperscaler. The point is to understand the same shape at human scale. Queue, scheduler, memory, GPU, metrics, tradeoffs. That is the whole game in miniature.

Run it in passes

Pass 1: one request, just to prove the endpoint and model are alive.
Pass 2: five concurrent requests, to see batching start to matter.
Pass 3: twenty concurrent requests, to expose queueing and tail latency.
Pass 4: one long-context case, to see KV-cache pressure show up.

What to publish to yourself

Finish with a small table and one paragraph: concurrency, aggregate tokens/sec, p50 TTFT, p95 total latency, error count, and what broke first. That artifact is the real deliverable because it turns a pile of commands into operational judgment.

Vocabulary to keep

Endpoint

A local API that accepts chat/completion requests.

Load test

Concurrent traffic that resembles users.

Metrics

TTFT, total latency, tokens/sec, errors.

Envelope

The safe operating range for your hardware.

Hands-on direction

A good capstone result is not “the model answered.” It is “the model served 20 concurrent users with acceptable latency, and here is where it started to bend.”

Capstone checklist

[ ] Start model server
[ ] Send one request
[ ] Send 5 concurrent requests
[ ] Send 20 concurrent requests
[ ] Record TTFT, total latency, output tokens
[ ] Write the service envelope in plain English

A model on your desk is a toy until you can describe its service envelope.