DAY 02 · THE SOFTWARE STACK — THE REAL MOAT

Why the moat isn't the chips. It's the libraries.

Anyone with a few billion dollars can fab a GPU. Almost nobody can ship the seventeen years of hand-written, paper-published, battle-tested numerical libraries on top. Today we tour that catalog — the real reason every AI lab is locked in — and watch one of those libraries do something on your DGX Spark that would melt a CPU.

The mental model: NVIDIA's library stack is a city

Imagine you wanted to compete with Manhattan. You can buy land. You can pour concrete. What you can't buy is two centuries of every kind of business and skill someone has needed in this place. CUDA's library catalog is that city. It's not just one product — it's fifteen-plus standalone libraries, each one the world's fastest implementation of its job, plus a thriving ecosystem of third-party libraries built on top.

The catalog — what every framework actually calls

Foundational libraries (NVIDIA's own)

cuBLAS
Linear algebra · matrix math
The fastest known implementation of "multiply this matrix by that matrix" on NVIDIA hardware. Every neural net forward pass calls this thousands of times.
since 2008 · ~98% peak FLOPS on most shapes
cuDNN
Neural network primitives
Hand-tuned implementations of every common deep learning op — convolutions, attention, normalization. PyTorch and TensorFlow both call into it for the hot paths.
since 2014 · the original AI moat
cuTLASS
Building blocks for custom matmul
When the standard libraries aren't fast enough, this is what NVIDIA hands you to build your own kernel. Used inside FlashAttention, vLLM, etc.
open-source · the power-user library
NCCL
Multi-GPU communication
Lets thousands of GPUs share gradients and weights in lockstep. Without it, training GPT-4 is impossible. Even on your single DGX, it's how the chip would talk to others.
the spinal cord of every supercomputer
cuFFT / cuRAND / cuSPARSE
FFTs · random numbers · sparse matrices
The plumbing nobody talks about. Used in scientific computing, simulation, signal processing, and increasingly in newer ML architectures.
decades of HPC heritage
Thrust / CUB
Parallel data structures & algorithms
If you ever need to sort, scan, or reduce arrays of billions of elements on a GPU, these are how you do it without writing the kernel yourself.
C++ STL for parallel work

Higher-level libraries (NVIDIA + ecosystem)

TensorRT
Production inference compiler
Takes a trained model, picks the optimal kernels for your specific GPU, and produces a binary that runs inference 2–10× faster than vanilla PyTorch.
how every "fast" cloud LLM ships
TensorRT-LLM
Production LLM inference
TensorRT specialized for transformer models — supports continuous batching, paged attention, FP4. Big-tech-grade serving on your DGX Spark.
free, open-source
Triton (NVIDIA's)
Inference server
Hosts your models behind an HTTP/gRPC endpoint with batching, queuing, scaling. Industrial-grade serving in a box.
the boring infrastructure win
Triton (OpenAI's)
Python kernel language
A Pythonic way to write your own CUDA kernels — one of the most exciting CUDA-adjacent projects of the decade. We build with this in Week 9.
different "Triton," same green ecosystem
FlashAttention / xFormers
Attention kernels
Hand-written CUDA implementations of attention that broke the memory wall in transformers. Without these, large-context LLMs are impossible. CUDA-only.
2022 paper that ate the world
vLLM / Unsloth / Axolotl
High-level training & serving
Open-source. Written by tiny teams. Each one became indispensable in months. Each one calls cuBLAS / cuDNN / cuTLASS / FlashAttention under the hood.
CUDA-first, often CUDA-only
The compounding moat

Every line in those cards is also a hiring funnel. Every grad student who writes a paper using cuDNN learns CUDA. Every researcher who ships open-source on top of vLLM trains the next cohort. Apple has a fraction of the equivalent libraries because they've had a fraction of the academic and open-source mindshare. Mindshare compounds harder than silicon ever could.

Why this is hard to replicate

Three reasons, in order of severity:

1 · Each library is years of paid PhD-level work

cuDNN alone is hundreds of thousands of lines of hand-tuned C++ and CUDA, with a separate fast path for every (op × precision × shape × architecture) combination. NVIDIA has employed dozens of full-time numerical specialists on it for ten years. AMD's equivalent (MIOpen) is decent but not equivalent.

2 · The ecosystem self-reinforces

FlashAttention came from Stanford. vLLM came from Berkeley. Unsloth came from a tiny startup. None of them work on AMD without significant porting. The reason isn't malice — it's that the people writing these tools learned CUDA first, ran their experiments on CUDA, published in CUDA, and the next cohort inherited the same stack.

3 · The chip designers and the library writers are in the same building

When NVIDIA designs a new architecture (Blackwell), the cuBLAS team has been writing kernels for it for two years already. By launch day, the library is ready. AMD's library team gets the chip after launch and starts catching up.

Lab: feel the library do its thing

Quick benchmark to make Day 2 concrete. We'll multiply two large matrices three ways: pure Python, NumPy on the CPU, and PyTorch + cuBLAS on your DGX Spark's GPU.

~/cuda-week/bench_matmul.py
import time, torch, numpy as np

N = 4096   # 4096 × 4096 matrices
A_np = np.random.randn(N, N).astype(np.float32)
B_np = np.random.randn(N, N).astype(np.float32)

# --- CPU via NumPy (which uses MKL/OpenBLAS — already very fast) ---
t0 = time.perf_counter()
C_np = A_np @ B_np
t_cpu = time.perf_counter() - t0
print(f"CPU NumPy:  {t_cpu*1000:7.1f} ms")

# --- GPU via PyTorch (which calls cuBLAS) ---
A = torch.from_numpy(A_np).cuda()
B = torch.from_numpy(B_np).cuda()
torch.cuda.synchronize()

# warmup — first call compiles kernels
_ = A @ B; torch.cuda.synchronize()

t0 = time.perf_counter()
for _ in range(10):
    C = A @ B
torch.cuda.synchronize()
t_gpu = (time.perf_counter() - t0) / 10
print(f"GPU cuBLAS: {t_gpu*1000:7.1f} ms")

# --- the punchline ---
print(f"Speedup:    {t_cpu/t_gpu:7.0f}×")
flops = 2 * N**3
print(f"GPU TFLOPS: {flops/t_gpu/1e12:7.1f}")
$ python3 bench_matmul.py
CPU NumPy:   1820.4 ms
GPU cuBLAS:     5.7 ms
Speedup:        319×
GPU TFLOPS:    24.1

Your numbers will be in the same ballpark. Two things worth noting:

What you just measured

Every framework you care about — PyTorch, JAX, Ollama, vLLM, anything that does math at scale — is, at its hot path, doing the exact thing you just benchmarked. cuBLAS is the silent floor under most of modern AI. Apple has its own version (Apple's BLAS in Accelerate framework), but it's smaller, slower, and doesn't have the deep PyTorch integration.

The third-party CUDA-only universe

NVIDIA's own libraries are only half the story. The other half is open-source projects that only run on CUDA — not because the authors hate other vendors, but because building for CUDA is the default and porting is hard. A partial list of what you'd lose without CUDA:

ToolWhat it doesApple/AMD parity?
FlashAttention 3Memory-efficient attention for long contextsNone / partial via custom forks
vLLMProduction LLM serving with continuous batchingNone on Apple, partial on AMD
Unsloth2× faster LoRA fine-tuningCUDA-only
TensorRT-LLMNVIDIA's production LLM compilerNVIDIA-only by definition
BitsandBytes (4/8-bit quant)Run big models in less memoryLimited Apple support, lagging features
DeepSpeed / Megatron-LMMulti-GPU training at scaleCUDA-only
Liger-Kernel / ApexHand-tuned training kernelsCUDA-only

This is the part that doesn't fit on a spec sheet. When you read a benchmark blog post that says "Apple is 80% as fast as NVIDIA" — they're usually testing the 20% of tasks where Apple has parity. The other 80% won't even run.

How to read this for your own decisions

For the rest of the week, you can use one mental shortcut: "is this AI technique CUDA-only?" If yes, NVIDIA wins by default for any team doing it. The list above is most of modern AI. That's the moat.

Strategic lens

When evaluating a chip competitor — AMD, Apple, a Chinese startup — ask: "How many of those library entries do they have at parity? How many do they have at all?" The answer is almost never above 60%. That gap is what the chip executives lose sleep over, and what NVIDIA prices into its 70% gross margin.

Today's reflection

Run the benchmark. Save the output as day2-cublas.txt. Look at the speedup number. That's not "GPU vs CPU." That's seventeen years of paid library work, condensed into one decimal.
← Day 1What CUDA Actually Is