A 17-year-old bet that became the most valuable software platform in the world.
Everyone says "CUDA" like everyone says "the cloud" — confidently and vaguely. Today we end the vagueness. By the end of this hour you'll know what the word stands for, what NVIDIA actually shipped in 2007, what's on your DGX Spark when you say import torch, and why a 17-year head start is worth more than the silicon itself.
The one-sentence definition
CUDA is NVIDIA's platform — a programming language, a runtime, and a library catalog — for using GPUs as general-purpose parallel computers, not just graphics cards.
The word is an acronym most people forget: Compute Unified Device Architecture. Three things to underline:
- Compute — not graphics. CUDA's whole reason to exist was to take a chip designed for video games and aim it at math.
- Unified — one programming model that works across every NVIDIA GPU made since 2006. Code you write today runs on a 17-year-old Tesla card and on your Blackwell DGX Spark.
- Architecture — not just an API. CUDA is the contract between the chip designers and the software world. Every NVIDIA chip is built to honor it.
The origin story (it's a real bet)
In 2006, NVIDIA was a graphics company. Their chips drew triangles for games. Researchers had started noticing — slowly, awkwardly — that you could trick the chips into doing math by pretending the math was a picture. This was called GPGPU and it was painful.
NVIDIA's CEO Jensen Huang made a call: instead of letting the trick stay a trick, build an actual programming environment for it. Spend years and billions on a software stack for a market that didn't exist yet. Subsidize academic research. Train a generation of grad students to think in CUDA before they thought in anything else.
For about a decade, this looked like a slow, unprofitable distraction. Then 2012 happened — AlexNet, the paper that started the deep learning era, was trained on two NVIDIA GTX 580s using CUDA. Every breakthrough since has run on the same stack.
CUDA is older than the iPhone. Every competitor is trying to catch up to a software platform with two decades of accumulated tooling, papers, optimizations, and trained engineers. Hardware can be replicated. Time cannot.
The three layers — what's actually on your DGX
When you type import torch, you are reaching down through a stack like this:
Layer 1 — the driver
Kernel-mode code that knows the actual silicon. Manages memory, schedules work, talks to the chip. You never write to it directly.
Layer 2 — the runtime & compiler
This is "CUDA the language." The C++ extensions you may have seen — __global__, <<<blocks, threads>>> — get compiled by nvcc into a portable assembly called PTX, then JIT-compiled to your specific GPU at runtime. This is the layer Apple and AMD have to clone, and the layer that's hardest to clone.
Layer 3 — the libraries
Fifteen+ years of NVIDIA engineers writing the fastest possible implementations of every operation AI and HPC need. cuBLAS for matrix math. cuDNN for neural net primitives. NCCL for multi-GPU communication. The actual moat lives here — we'll spend Day 2 entirely on this.
Lab: see your stack
Quick fifteen-minute hands-on so you can look at the layers we just drew.
1 · The driver and runtime versions
$ ssh you@dgx-spark.local $ nvidia-smi +----------------------------------------+ | NVIDIA-SMI 580.xx Driver 580.xx | | CUDA Version: 13.0 | +----------------------------------------+ | GB10 Blackwell 128 GiB | +----------------------------------------+
Driver Version = Layer 1. CUDA Version = Layer 2's API contract. Both shipped pre-installed on your DGX OS.
2 · The compiler
$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Cuda compilation tools, release 13.0, V13.0.xx
That's the Layer-2 compiler. If you ever wrote raw CUDA, this is the thing that would compile it.
3 · The libraries
$ ls /usr/lib/x86_64-linux-gnu/ | grep -E '^libcu|^libnccl' | head -20 libcublas.so.13 libcublasLt.so.13 libcudart.so.13 libcudnn.so.9 libcufft.so.12 libcurand.so.10 libcusparse.so.12 libcutlass.so libnccl.so.2 libnvjpeg.so.13
Each .so file is one of those Layer 3 libraries — millions of lines of hand-tuned code, sitting on your disk, written largely by NVIDIA engineers, free for you to link against. This is the moat made physical.
4 · Confirm PyTorch sees the whole stack
$ python3 -c " import torch print('PyTorch :', torch.__version__) print('CUDA OK :', torch.cuda.is_available()) print('Device :', torch.cuda.get_device_name(0)) print('cuBLAS :', torch.backends.cuda.matmul.allow_tf32) print('cuDNN :', torch.backends.cudnn.version()) " PyTorch : 2.6.0+cu130 CUDA OK : True Device : NVIDIA GB10 cuBLAS : True cuDNN : 91200
When that last line prints a real number, your Python code now has a paved road from a one-line script all the way down to silicon designed for it. Apple's MacBook can run PyTorch — but that road is rocky, partly missing, and the trucks driving on it weren't built for it. Tomorrow we'll see why that distinction is worth a market cap.
What "CUDA" is not
Three confusions worth killing right now:
It is not just "the API for the GPU"
OpenGL, DirectX, Metal, Vulkan are graphics APIs. CUDA is something else — it's a compute programming model. The two intersect (CUDA can do graphics, Vulkan now does compute) but they were built for different audiences and have different ecosystems.
It is not the same thing as "PyTorch"
PyTorch sits on top of CUDA. So does TensorFlow, JAX, MLX (Apple's), TensorRT, vLLM, Ollama, and basically everything else. Those frameworks call into CUDA libraries. CUDA is the road; PyTorch is the truck.
It is not just for AI
CUDA powers weather simulation, drug discovery, oil reservoir modeling, real-time ray tracing, computational finance, particle physics. The AI moment made it famous, but the customer base was already there.
So what — why this matters for you
A consumer asks: "I want to do AI on my desk — should I buy a Mac Studio or a DGX Spark?" The honest answer requires you to internalize one thing: buying a GPU is buying access to CUDA. The hardware is the ticket; the platform is the ride.
Apple's chips are excellent silicon. But Apple has chosen, deliberately, not to ship a CUDA equivalent. Their stack is called Metal Performance Shaders (MPS) for graphics + MLX for ML, and it's a fraction the size of NVIDIA's. "Same memory, half the software" is a fair short summary. We'll see exactly how that plays out on Day 3.
Today's reflection
Run the four lab commands above. Save the outputs in a file called day1-cuda-stack.txt. Each line you saw is a piece of paid software running on your machine right now, that you could not buy from any other vendor at any price. That, in concrete form, is what you spent $4.7k on.