DAY 05 · WHAT CUDA LETS YOU ACTUALLY DO

The field is bigger than AI. It's just that AI is the part you can see.

Generative AI is roughly 30% of CUDA's reason for being. The other 70% is happening in places you don't usually look: drug labs simulating proteins, banks pricing options at 4am, weather services rendering hurricanes, robotics teams training agents in synthetic worlds, fraud teams catching transactions in milliseconds. Today we widen the lens — because mastering NVIDIA means knowing the whole map, not just the corner you live in.

The landscape, in one diagram

Eight major domains. AI is two of them. CUDA is the gravity well; everything orbits it.

The eight things you can do with CUDA

1 · Train state-of-the-art neural networks CUDA-exclusive in practice

The marquee use case. Every frontier model — GPT, Claude, Gemini, Llama, Stable Diffusion, Suno — was trained on NVIDIA hardware running CUDA. We built a tiny one yourself in Day 2 (in earlier course versions). On your DGX, you can train models up to ~1B parameters from scratch overnight, and fine-tune up to ~70B parameters using LoRA.

What you can do today: nanoGPT, Hugging Face Trainer, Unsloth, Axolotl, DeepSpeed, Lightning, Megatron-LM.

2 · Serve inference at production scale CUDA-dominant

Real production AI services run on TensorRT-LLM or vLLM, both CUDA-only. Your DGX Spark can comfortably host 10–30 concurrent users on Llama 8B, or 5–10 on 70B. Continuous batching, paged attention, FP4 — all the techniques that make ChatGPT feel snappy work natively here.

What you can do today: Ollama, vLLM, TensorRT-LLM, NVIDIA Triton Inference Server, llama.cpp.

3 · Run scientific simulations CUDA-exclusive in practice

Weather forecasting (NOAA/ECMWF use NVIDIA), drug discovery (every major pharma runs CUDA), protein folding (AlphaFold variants), molecular dynamics, computational chemistry, particle physics, climate modeling. NVIDIA's cuFFT, cuSPARSE, NVIDIA Modulus (physics-informed neural networks) are the backbone of computational science.

Why this matters for your DGX: if you ever wonder "could AI help with X scientific problem?" — the answer is almost always yes, and the tools assume CUDA.

4 · GPU-accelerated data analytics CPU possible, GPU 50× faster

RAPIDS is NVIDIA's pandas/Spark replacement that runs entirely on GPU. cuDF is a drop-in pandas accelerator. cuML is a sklearn replacement. cuGraph handles graph analytics at scale. For business-side workloads (sales analysis, churn modeling, fraud detection), this is often the easier CUDA win than AI.

Practical DGX use: if you have a 10GB CSV that takes pandas an hour to process, RAPIDS will do it in under a minute.

5 · Real-time ray-traced graphics CUDA + RT cores

Modern AAA games (Cyberpunk, Alan Wake 2) use NVIDIA's RTX path-tracing pipeline, which runs on dedicated RT cores driven by CUDA-adjacent kernels. Plus DLSS — the AI upscaler — runs on tensor cores. Most pure-graphics work won't touch your DGX Spark, but it's worth knowing the same silicon does both.

For your DGX: mostly irrelevant unless you do AI-assisted 3D rendering or video work (e.g., NVIDIA Broadcast, RTX Video).

6 · Computational finance but CUDA dominant for big banks

Monte Carlo option pricing, value-at-risk simulations, real-time portfolio optimization, derivatives modeling. Goldman, JPM, Citadel run massive NVIDIA fleets specifically for these workloads. The math is the same shape as ML: lots of independent parallel computations on big arrays.

Practical DGX use: if you ever want to backtest a trading strategy across millions of permutations, your DGX will do it in minutes instead of hours.

7 · Robotics & simulation NVIDIA-led

Isaac Sim is NVIDIA's robotics simulation platform. Train a robot's policy in a virtual world before deploying to real hardware. Cosmos is their world-model platform that generates synthetic training data. This is one of the fastest-growing CUDA domains in 2026 — every robotics startup is on it.

Practical DGX use: the Spark is small for serious robotics training, but it can run Isaac Sim and Cosmos demos comfortably. Big leagues need a DGX Station or cloud H100s.

8 · Custom kernels with Triton CUDA-only

OpenAI's Triton (the Python kernel language) lets you write your own GPU kernels in roughly the way you'd write a NumPy function. It compiles down to PTX (CUDA's assembly). When the standard libraries aren't fast enough, this is how you eke out the last 30%. We tackle this in Week 9 of the curriculum.

For now: just know it exists, and that nothing equivalent exists on Apple, AMD, or anywhere else.

The publication firehose — why CUDA's lead grows, not shrinks

One of the easier-to-overlook reasons CUDA's lead compounds: every research paper that lands on arXiv ships a CUDA implementation by default. Even when authors mean to support multiple backends, "CUDA-first, others later" is the universal pattern. A small selection of recent landmark techniques and how long it took the rest of the world to catch up:

2022FlashAttention (Stanford). CUDA implementation released day one. AMD got a working port ~12 months later, half the speed. Apple: still partial in 2026.

2023vLLM PagedAttention (Berkeley). CUDA-only at launch. AMD fork ~9 months. Apple: never properly ported.

2023QLoRA (Washington). Required BitsandBytes — CUDA-only. Apple/AMD ports lag by 12+ months and miss features.

2024FlashAttention 3. CUDA-only, requires Hopper or newer. Still no AMD equivalent.

2024Marlin / FP4 inference kernels. CUDA + Blackwell only. Apple has no FP4 silicon at all.

2025Mamba / state-space models. CUDA-first. Active research, mostly NVIDIA-only optimizations.

The structural lead

Every quarter that goes by, the CUDA-only library count grows faster than competitors can clone the existing ones. The lead isn't a fixed gap — it's a moving target that's pulling away.

Lab: try a non-AI use of CUDA

So far this course has lived inside the AI corner of CUDA. Today's lab steps out — we'll use RAPIDS cuDF to do data analytics 50× faster than pandas. Useful for any business workload where you have big CSVs.

1 · Install RAPIDS

$ pip install --extra-index-url=https://pypi.nvidia.com \
    cudf-cu13 cuml-cu13 cugraph-cu13

2 · Generate a 10M-row dataset to chew on

~/cuda-week/bench_rapids.py

import pandas as pd, cudf, numpy as np, time

N = 10_000_000
df = pd.DataFrame({
    "id": np.arange(N),
    "category": np.random.choice(["A","B","C","D","E"], N),
    "value": np.random.randn(N) * 100,
    "price": np.random.uniform(10, 10_000, N),
})

# --- pandas (CPU) ---
t0 = time.perf_counter()
out_cpu = (df.groupby("category")
             .agg({"value": ["mean", "std"], "price": "sum"})
             .sort_values(("price", "sum"), ascending=False))
t_cpu = time.perf_counter() - t0
print(f"pandas:  {t_cpu*1000:7.0f} ms")

# --- cuDF (GPU) ---
gdf = cudf.from_pandas(df)
t0 = time.perf_counter()
out_gpu = (gdf.groupby("category")
              .agg({"value": ["mean", "std"], "price": "sum"})
              .sort_values(("price", "sum"), ascending=False))
t_gpu = time.perf_counter() - t0
print(f"cuDF:    {t_gpu*1000:7.0f} ms")
print(f"Speedup: {t_cpu/t_gpu:7.0f}×")

$ python3 bench_rapids.py
pandas:    7,420 ms
cuDF:        145 ms
Speedup:      51×

Same code shape. Same result. ~50× faster. This is what CUDA does for everything that isn't a neural network.

The expanded use case

Half the value of a DGX Spark for a business person isn't fine-tuning models — it's that every existing data workflow you have can run dramatically faster on it without you learning anything new. Pandas → cuDF is a one-line change. Your data team has a CUDA story even if your AI team doesn't.

What you'll explore in later weeks

Wk 7 · Multi-modal: CUDA for vision, audio, generation — Stable Diffusion, Whisper, F5-TTS
Wk 9 · Triton kernels: write your own custom CUDA without touching C++
Wk 10 · Production serving: NVIDIA Triton Inference Server, monitoring, scaling
Wk 12 · Capstone: ship a real product that touches multiple CUDA domains

Today's reflection

Pick the domain above that overlaps most with what your company actually does. Ask: "Is there a workflow we already have that we'd 10× if it ran 50× faster?" If yes, you have a use case for the silicon you bought, even if you never train a model.