DAY 02 · SERVING METRICS

Tokens/sec is not enough. Users feel time.

Today is the scoreboard. You need three numbers in your head before tuning anything: aggregate throughput, time-to-first-token, and end-to-end latency.

Core idea

Throughput tells you how much work the server completes. Time-to-first-token tells you how quickly the user sees life. End-to-end latency tells you how long the full answer takes.

Why it matters

A system can have great throughput and feel slow, or feel snappy for one user and collapse under load. The trick is learning which number you are optimizing and what tradeoff you are accepting.

The picture

The shape to keep in your head for The Three Numbers That Matter.

Three numbers, three audiences

Users care about time-to-first-token because it tells them the system is alive. Engineers care about tail latency because it reveals contention and bad scheduling. Operators care about aggregate tokens/sec because it is the capacity number that maps to cost.

A production dashboard needs all three. A single “tok/s” screenshot is not a serving benchmark.

Why averages lie

Average latency hides pain. If nine people get a fast answer and one person waits forever, the average still looks fine. That is why serving teams watch p50, p95, and p99. The tail is where product trust dies.

In local experiments, you can start simple: record each request duration, sort the list, and inspect the slowest few. The point is to build the habit before the stack gets complicated.

The first real measurement loop

You do not need a full observability stack today. You need a tiny harness that sends repeated prompts, records start time, first-token time, end time, and output length. That one CSV becomes the seed of your Week 2 taste.

How to read the numbers like a product person

High throughput means the business likes the server. Low time-to-first-token means the user trusts the interface. Low tail latency means the system is fair under pressure. You want all three, but they do not always move together.

If you pack the GPU harder, aggregate tokens/sec may rise while first-token delay gets worse. If you optimize for the first user only, the system may feel snappy in a demo and expensive in production. The work is learning which tradeoff you are making.

A simple scoring habit

p50: what a normal request feels like.
p95: what a stressed request feels like.
max: what your worst user experienced.
errors: the requests that never became latency numbers because they failed.

What good looks like locally

For Week 2, “good” is not a universal benchmark number. Good is a repeatable measurement loop. You should be able to run the same prompt set at 1, 5, 10, and 20 concurrent users and explain when the curve starts to bend.

Vocabulary to keep

TTFT

How long until the first streamed token appears.

Throughput

Total output tokens per second across all users.

Tail latency

How bad it gets for the unlucky requests.

Saturation

The point where more users make everyone slower.

Hands-on direction

After ten requests, sort by total_ms and ask why the slowest request was slow. Long prompt? Long answer? Queue wait? Cold model? That question is the beginning of inference engineering.

~/cuda-week/serving/metrics.csv

request_id,prompt_tokens,output_tokens,ttft_ms,total_ms
1,42,180,310,4200
2,318,96,740,3900
3,64,260,330,6100

# Add columns later:
# concurrent_users, model, batch_size, gpu_util, kv_cache_gb

Serving taste starts when you stop asking “is it fast?” and ask “fast for whom, under what load?”