DAY 02 · THE SOFTWARE STACK — THE REAL MOAT

Why the moat isn't the chips. It's the libraries.

Anyone with a few billion dollars can fab a GPU. Almost nobody can ship the seventeen years of hand-written, paper-published, battle-tested numerical libraries on top. Today we tour that catalog — the real reason every AI lab is locked in — and watch one of those libraries do something on your DGX Spark that would melt a CPU.

The mental model: NVIDIA's library stack is a city

Imagine you wanted to compete with Manhattan. You can buy land. You can pour concrete. What you can't buy is two centuries of every kind of business and skill someone has needed in this place. CUDA's library catalog is that city. It's not just one product — it's fifteen-plus standalone libraries, each one the world's fastest implementation of its job, plus a thriving ecosystem of third-party libraries built on top.

The catalog — what every framework actually calls

Foundational libraries (NVIDIA's own)

cuBLAS

Linear algebra · matrix math

The fastest known implementation of "multiply this matrix by that matrix" on NVIDIA hardware. Every neural net forward pass calls this thousands of times.

since 2008 · ~98% peak FLOPS on most shapes

cuDNN

Neural network primitives

Hand-tuned implementations of every common deep learning op — convolutions, attention, normalization. PyTorch and TensorFlow both call into it for the hot paths.

since 2014 · the original AI moat

cuTLASS

Building blocks for custom matmul

When the standard libraries aren't fast enough, this is what NVIDIA hands you to build your own kernel. Used inside FlashAttention, vLLM, etc.

open-source · the power-user library

NCCL

Multi-GPU communication

Lets thousands of GPUs share gradients and weights in lockstep. Without it, training GPT-4 is impossible. Even on your single DGX, it's how the chip would talk to others.

the spinal cord of every supercomputer

cuFFT / cuRAND / cuSPARSE

FFTs · random numbers · sparse matrices

The plumbing nobody talks about. Used in scientific computing, simulation, signal processing, and increasingly in newer ML architectures.

decades of HPC heritage

Thrust / CUB

Parallel data structures & algorithms

If you ever need to sort, scan, or reduce arrays of billions of elements on a GPU, these are how you do it without writing the kernel yourself.

C++ STL for parallel work

Higher-level libraries (NVIDIA + ecosystem)

TensorRT

Production inference compiler

Takes a trained model, picks the optimal kernels for your specific GPU, and produces a binary that runs inference 2–10× faster than vanilla PyTorch.

how every "fast" cloud LLM ships

TensorRT-LLM

Production LLM inference

TensorRT specialized for transformer models — supports continuous batching, paged attention, FP4. Big-tech-grade serving on your DGX Spark.

free, open-source

Triton (NVIDIA's)

Inference server

Hosts your models behind an HTTP/gRPC endpoint with batching, queuing, scaling. Industrial-grade serving in a box.

the boring infrastructure win

Triton (OpenAI's)

Python kernel language

A Pythonic way to write your own CUDA kernels — one of the most exciting CUDA-adjacent projects of the decade. We build with this in Week 9.

different "Triton," same green ecosystem

FlashAttention / xFormers

Attention kernels

Hand-written CUDA implementations of attention that broke the memory wall in transformers. Without these, large-context LLMs are impossible. CUDA-only.

2022 paper that ate the world

vLLM / Unsloth / Axolotl

High-level training & serving

Open-source. Written by tiny teams. Each one became indispensable in months. Each one calls cuBLAS / cuDNN / cuTLASS / FlashAttention under the hood.

CUDA-first, often CUDA-only

The compounding moat

Every line in those cards is also a hiring funnel. Every grad student who writes a paper using cuDNN learns CUDA. Every researcher who ships open-source on top of vLLM trains the next cohort. Apple has a fraction of the equivalent libraries because they've had a fraction of the academic and open-source mindshare. Mindshare compounds harder than silicon ever could.

Why this is hard to replicate

Three reasons, in order of severity:

1 · Each library is years of paid PhD-level work

cuDNN alone is hundreds of thousands of lines of hand-tuned C++ and CUDA, with a separate fast path for every (op × precision × shape × architecture) combination. NVIDIA has employed dozens of full-time numerical specialists on it for ten years. AMD's equivalent (MIOpen) is decent but not equivalent.

2 · The ecosystem self-reinforces

FlashAttention came from Stanford. vLLM came from Berkeley. Unsloth came from a tiny startup. None of them work on AMD without significant porting. The reason isn't malice — it's that the people writing these tools learned CUDA first, ran their experiments on CUDA, published in CUDA, and the next cohort inherited the same stack.

3 · The chip designers and the library writers are in the same building

When NVIDIA designs a new architecture (Blackwell), the cuBLAS team has been writing kernels for it for two years already. By launch day, the library is ready. AMD's library team gets the chip after launch and starts catching up.

Lab: feel the library do its thing

Quick benchmark to make Day 2 concrete. We'll multiply two large matrices three ways: pure Python, NumPy on the CPU, and PyTorch + cuBLAS on your DGX Spark's GPU.

~/cuda-week/bench_matmul.py

import time, torch, numpy as np

N = 4096   # 4096 × 4096 matrices
A_np = np.random.randn(N, N).astype(np.float32)
B_np = np.random.randn(N, N).astype(np.float32)

# --- CPU via NumPy (which uses MKL/OpenBLAS — already very fast) ---
t0 = time.perf_counter()
C_np = A_np @ B_np
t_cpu = time.perf_counter() - t0
print(f"CPU NumPy:  {t_cpu*1000:7.1f} ms")

# --- GPU via PyTorch (which calls cuBLAS) ---
A = torch.from_numpy(A_np).cuda()
B = torch.from_numpy(B_np).cuda()
torch.cuda.synchronize()

# warmup — first call compiles kernels
_ = A @ B; torch.cuda.synchronize()

t0 = time.perf_counter()
for _ in range(10):
    C = A @ B
torch.cuda.synchronize()
t_gpu = (time.perf_counter() - t0) / 10
print(f"GPU cuBLAS: {t_gpu*1000:7.1f} ms")

# --- the punchline ---
print(f"Speedup:    {t_cpu/t_gpu:7.0f}×")
flops = 2 * N**3
print(f"GPU TFLOPS: {flops/t_gpu/1e12:7.1f}")

$ python3 bench_matmul.py
CPU NumPy:   1820.4 ms
GPU cuBLAS:     5.7 ms
Speedup:        319×
GPU TFLOPS:    24.1

Your numbers will be in the same ballpark. Two things worth noting:

NumPy on the CPU is already very fast — it calls MKL/OpenBLAS, the CPU equivalent of cuBLAS. The 300× speedup is GPU vs tuned CPU, not vs naive Python.
That 24 TFLOPS measurement is your DGX hitting roughly 80% of theoretical peak. That's cuBLAS doing its job. A naive hand-written kernel would get maybe 30%.

What you just measured

Every framework you care about — PyTorch, JAX, Ollama, vLLM, anything that does math at scale — is, at its hot path, doing the exact thing you just benchmarked. cuBLAS is the silent floor under most of modern AI. Apple has its own version (Apple's BLAS in Accelerate framework), but it's smaller, slower, and doesn't have the deep PyTorch integration.

The third-party CUDA-only universe

NVIDIA's own libraries are only half the story. The other half is open-source projects that only run on CUDA — not because the authors hate other vendors, but because building for CUDA is the default and porting is hard. A partial list of what you'd lose without CUDA:

Tool	What it does	Apple/AMD parity?
FlashAttention 3	Memory-efficient attention for long contexts	None / partial via custom forks
vLLM	Production LLM serving with continuous batching	None on Apple, partial on AMD
Unsloth	2× faster LoRA fine-tuning	CUDA-only
TensorRT-LLM	NVIDIA's production LLM compiler	NVIDIA-only by definition
BitsandBytes (4/8-bit quant)	Run big models in less memory	Limited Apple support, lagging features
DeepSpeed / Megatron-LM	Multi-GPU training at scale	CUDA-only
Liger-Kernel / Apex	Hand-tuned training kernels	CUDA-only

This is the part that doesn't fit on a spec sheet. When you read a benchmark blog post that says "Apple is 80% as fast as NVIDIA" — they're usually testing the 20% of tasks where Apple has parity. The other 80% won't even run.

How to read this for your own decisions

For the rest of the week, you can use one mental shortcut: "is this AI technique CUDA-only?" If yes, NVIDIA wins by default for any team doing it. The list above is most of modern AI. That's the moat.

Strategic lens

When evaluating a chip competitor — AMD, Apple, a Chinese startup — ask: "How many of those library entries do they have at parity? How many do they have at all?" The answer is almost never above 60%. That gap is what the chip executives lose sleep over, and what NVIDIA prices into its 70% gross margin.

Today's reflection

Run the benchmark. Save the output as day2-cublas.txt. Look at the speedup number. That's not "GPU vs CPU." That's seventeen years of paid library work, condensed into one decimal.