DAY 01 · WHAT CUDA ACTUALLY IS

A 17-year-old bet that became the most valuable software platform in the world.

Everyone says "CUDA" like everyone says "the cloud" — confidently and vaguely. Today we end the vagueness. By the end of this hour you'll know what the word stands for, what NVIDIA actually shipped in 2007, what's on your DGX Spark when you say import torch, and why a 17-year head start is worth more than the silicon itself.

The one-sentence definition

CUDA is NVIDIA's platform — a programming language, a runtime, and a library catalog — for using GPUs as general-purpose parallel computers, not just graphics cards.

The word is an acronym most people forget: Compute Unified Device Architecture. Three things to underline:

  1. Compute — not graphics. CUDA's whole reason to exist was to take a chip designed for video games and aim it at math.
  2. Unified — one programming model that works across every NVIDIA GPU made since 2006. Code you write today runs on a 17-year-old Tesla card and on your Blackwell DGX Spark.
  3. Architecture — not just an API. CUDA is the contract between the chip designers and the software world. Every NVIDIA chip is built to honor it.

The origin story (it's a real bet)

In 2006, NVIDIA was a graphics company. Their chips drew triangles for games. Researchers had started noticing — slowly, awkwardly — that you could trick the chips into doing math by pretending the math was a picture. This was called GPGPU and it was painful.

NVIDIA's CEO Jensen Huang made a call: instead of letting the trick stay a trick, build an actual programming environment for it. Spend years and billions on a software stack for a market that didn't exist yet. Subsidize academic research. Train a generation of grad students to think in CUDA before they thought in anything else.

For about a decade, this looked like a slow, unprofitable distraction. Then 2012 happened — AlexNet, the paper that started the deep learning era, was trained on two NVIDIA GTX 580s using CUDA. Every breakthrough since has run on the same stack.

2006CUDA 1.0 announced.
First version of the toolkit. Almost nobody outside graphics labs notices.
2007First CUDA-capable consumer cards.
G80 architecture. The world's first programmable parallel processor for the masses.
2008cuBLAS ships — the first hand-tuned linear algebra library.
2009Fermi architecture · proper double-precision floats.
CUDA stops being "graphics card with extras" and becomes a real HPC tool.
2012AlexNet wins ImageNet — trained in CUDA.
The deep learning revolution begins. Every paper for the next decade is "we trained X on Y NVIDIA GPUs."
2014cuDNN ships — neural net primitives, hand-written by NVIDIA engineers.
2017Tensor Cores arrive on Volta. Transformers paper drops.
The two events that quietly defined the next decade.
2020A100 + Ampere. Pandemic supercomputers run on it.
2022ChatGPT.
Trained on tens of thousands of NVIDIA A100s. CUDA becomes a household word, sort of.
2024Blackwell + FP4. The chip on your DGX Spark.
2025DGX Spark — Blackwell on a desk for $4.7k.
The first time CUDA-class compute is sitting next to a couch instead of in a datacenter.
2026You're reading this.
~5M CUDA developers worldwide. Every major AI framework, model and tool is CUDA-first.
The point of the timeline

CUDA is older than the iPhone. Every competitor is trying to catch up to a software platform with two decades of accumulated tooling, papers, optimizations, and trained engineers. Hardware can be replicated. Time cannot.

The three layers — what's actually on your DGX

When you type import torch, you are reaching down through a stack like this:

Your code · PyTorch / TensorFlow / Ollama / your script Layer 3 · Libraries cuBLAS · cuDNN · cuTLASS · NCCL · cuRAND · cuFFT · CUTLASS · Thrust Layer 2 · CUDA Runtime & Compiler nvcc · PTX · streams · memory · the parallel programming model Layer 1 · CUDA Driver — talks to the silicon GB10 Blackwell — your GPU
"CUDA" is shorthand for everything from Layer 1 to Layer 3. People usually mean Layer 3 when they say "the moat."

Layer 1 — the driver

Kernel-mode code that knows the actual silicon. Manages memory, schedules work, talks to the chip. You never write to it directly.

Layer 2 — the runtime & compiler

This is "CUDA the language." The C++ extensions you may have seen — __global__, <<<blocks, threads>>> — get compiled by nvcc into a portable assembly called PTX, then JIT-compiled to your specific GPU at runtime. This is the layer Apple and AMD have to clone, and the layer that's hardest to clone.

Layer 3 — the libraries

Fifteen+ years of NVIDIA engineers writing the fastest possible implementations of every operation AI and HPC need. cuBLAS for matrix math. cuDNN for neural net primitives. NCCL for multi-GPU communication. The actual moat lives here — we'll spend Day 2 entirely on this.

Lab: see your stack

Quick fifteen-minute hands-on so you can look at the layers we just drew.

1 · The driver and runtime versions

$ ssh you@dgx-spark.local
$ nvidia-smi
+----------------------------------------+
| NVIDIA-SMI 580.xx  Driver  580.xx      |
| CUDA Version: 13.0                     |
+----------------------------------------+
| GB10  Blackwell   128 GiB              |
+----------------------------------------+

Driver Version = Layer 1. CUDA Version = Layer 2's API contract. Both shipped pre-installed on your DGX OS.

2 · The compiler

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Cuda compilation tools, release 13.0, V13.0.xx

That's the Layer-2 compiler. If you ever wrote raw CUDA, this is the thing that would compile it.

3 · The libraries

$ ls /usr/lib/x86_64-linux-gnu/ | grep -E '^libcu|^libnccl' | head -20
libcublas.so.13
libcublasLt.so.13
libcudart.so.13
libcudnn.so.9
libcufft.so.12
libcurand.so.10
libcusparse.so.12
libcutlass.so
libnccl.so.2
libnvjpeg.so.13

Each .so file is one of those Layer 3 libraries — millions of lines of hand-tuned code, sitting on your disk, written largely by NVIDIA engineers, free for you to link against. This is the moat made physical.

4 · Confirm PyTorch sees the whole stack

$ python3 -c "
import torch
print('PyTorch  :', torch.__version__)
print('CUDA OK  :', torch.cuda.is_available())
print('Device   :', torch.cuda.get_device_name(0))
print('cuBLAS   :', torch.backends.cuda.matmul.allow_tf32)
print('cuDNN    :', torch.backends.cudnn.version())
"
PyTorch  : 2.6.0+cu130
CUDA OK  : True
Device   : NVIDIA GB10
cuBLAS   : True
cuDNN    : 91200
Aha moment

When that last line prints a real number, your Python code now has a paved road from a one-line script all the way down to silicon designed for it. Apple's MacBook can run PyTorch — but that road is rocky, partly missing, and the trucks driving on it weren't built for it. Tomorrow we'll see why that distinction is worth a market cap.

What "CUDA" is not

Three confusions worth killing right now:

It is not just "the API for the GPU"

OpenGL, DirectX, Metal, Vulkan are graphics APIs. CUDA is something else — it's a compute programming model. The two intersect (CUDA can do graphics, Vulkan now does compute) but they were built for different audiences and have different ecosystems.

It is not the same thing as "PyTorch"

PyTorch sits on top of CUDA. So does TensorFlow, JAX, MLX (Apple's), TensorRT, vLLM, Ollama, and basically everything else. Those frameworks call into CUDA libraries. CUDA is the road; PyTorch is the truck.

It is not just for AI

CUDA powers weather simulation, drug discovery, oil reservoir modeling, real-time ray tracing, computational finance, particle physics. The AI moment made it famous, but the customer base was already there.

So what — why this matters for you

A consumer asks: "I want to do AI on my desk — should I buy a Mac Studio or a DGX Spark?" The honest answer requires you to internalize one thing: buying a GPU is buying access to CUDA. The hardware is the ticket; the platform is the ride.

Apple's chips are excellent silicon. But Apple has chosen, deliberately, not to ship a CUDA equivalent. Their stack is called Metal Performance Shaders (MPS) for graphics + MLX for ML, and it's a fraction the size of NVIDIA's. "Same memory, half the software" is a fair short summary. We'll see exactly how that plays out on Day 3.

Today's reflection

Run the four lab commands above. Save the outputs in a file called day1-cuda-stack.txt. Each line you saw is a piece of paid software running on your machine right now, that you could not buy from any other vendor at any price. That, in concrete form, is what you spent $4.7k on.
← PreviousCourse intro