Why the moat isn't the chips. It's the libraries.
Anyone with a few billion dollars can fab a GPU. Almost nobody can ship the seventeen years of hand-written, paper-published, battle-tested numerical libraries on top. Today we tour that catalog — the real reason every AI lab is locked in — and watch one of those libraries do something on your DGX Spark that would melt a CPU.
The mental model: NVIDIA's library stack is a city
Imagine you wanted to compete with Manhattan. You can buy land. You can pour concrete. What you can't buy is two centuries of every kind of business and skill someone has needed in this place. CUDA's library catalog is that city. It's not just one product — it's fifteen-plus standalone libraries, each one the world's fastest implementation of its job, plus a thriving ecosystem of third-party libraries built on top.
The catalog — what every framework actually calls
Foundational libraries (NVIDIA's own)
Higher-level libraries (NVIDIA + ecosystem)
Every line in those cards is also a hiring funnel. Every grad student who writes a paper using cuDNN learns CUDA. Every researcher who ships open-source on top of vLLM trains the next cohort. Apple has a fraction of the equivalent libraries because they've had a fraction of the academic and open-source mindshare. Mindshare compounds harder than silicon ever could.
Why this is hard to replicate
Three reasons, in order of severity:
1 · Each library is years of paid PhD-level work
cuDNN alone is hundreds of thousands of lines of hand-tuned C++ and CUDA, with a separate fast path for every (op × precision × shape × architecture) combination. NVIDIA has employed dozens of full-time numerical specialists on it for ten years. AMD's equivalent (MIOpen) is decent but not equivalent.
2 · The ecosystem self-reinforces
FlashAttention came from Stanford. vLLM came from Berkeley. Unsloth came from a tiny startup. None of them work on AMD without significant porting. The reason isn't malice — it's that the people writing these tools learned CUDA first, ran their experiments on CUDA, published in CUDA, and the next cohort inherited the same stack.
3 · The chip designers and the library writers are in the same building
When NVIDIA designs a new architecture (Blackwell), the cuBLAS team has been writing kernels for it for two years already. By launch day, the library is ready. AMD's library team gets the chip after launch and starts catching up.
Lab: feel the library do its thing
Quick benchmark to make Day 2 concrete. We'll multiply two large matrices three ways: pure Python, NumPy on the CPU, and PyTorch + cuBLAS on your DGX Spark's GPU.
~/cuda-week/bench_matmul.pyimport time, torch, numpy as np N = 4096 # 4096 × 4096 matrices A_np = np.random.randn(N, N).astype(np.float32) B_np = np.random.randn(N, N).astype(np.float32) # --- CPU via NumPy (which uses MKL/OpenBLAS — already very fast) --- t0 = time.perf_counter() C_np = A_np @ B_np t_cpu = time.perf_counter() - t0 print(f"CPU NumPy: {t_cpu*1000:7.1f} ms") # --- GPU via PyTorch (which calls cuBLAS) --- A = torch.from_numpy(A_np).cuda() B = torch.from_numpy(B_np).cuda() torch.cuda.synchronize() # warmup — first call compiles kernels _ = A @ B; torch.cuda.synchronize() t0 = time.perf_counter() for _ in range(10): C = A @ B torch.cuda.synchronize() t_gpu = (time.perf_counter() - t0) / 10 print(f"GPU cuBLAS: {t_gpu*1000:7.1f} ms") # --- the punchline --- print(f"Speedup: {t_cpu/t_gpu:7.0f}×") flops = 2 * N**3 print(f"GPU TFLOPS: {flops/t_gpu/1e12:7.1f}")
$ python3 bench_matmul.py CPU NumPy: 1820.4 ms GPU cuBLAS: 5.7 ms Speedup: 319× GPU TFLOPS: 24.1
Your numbers will be in the same ballpark. Two things worth noting:
- NumPy on the CPU is already very fast — it calls MKL/OpenBLAS, the CPU equivalent of cuBLAS. The 300× speedup is GPU vs tuned CPU, not vs naive Python.
- That 24 TFLOPS measurement is your DGX hitting roughly 80% of theoretical peak. That's cuBLAS doing its job. A naive hand-written kernel would get maybe 30%.
Every framework you care about — PyTorch, JAX, Ollama, vLLM, anything that does math at scale — is, at its hot path, doing the exact thing you just benchmarked. cuBLAS is the silent floor under most of modern AI. Apple has its own version (Apple's BLAS in Accelerate framework), but it's smaller, slower, and doesn't have the deep PyTorch integration.
The third-party CUDA-only universe
NVIDIA's own libraries are only half the story. The other half is open-source projects that only run on CUDA — not because the authors hate other vendors, but because building for CUDA is the default and porting is hard. A partial list of what you'd lose without CUDA:
| Tool | What it does | Apple/AMD parity? |
|---|---|---|
| FlashAttention 3 | Memory-efficient attention for long contexts | None / partial via custom forks |
| vLLM | Production LLM serving with continuous batching | None on Apple, partial on AMD |
| Unsloth | 2× faster LoRA fine-tuning | CUDA-only |
| TensorRT-LLM | NVIDIA's production LLM compiler | NVIDIA-only by definition |
| BitsandBytes (4/8-bit quant) | Run big models in less memory | Limited Apple support, lagging features |
| DeepSpeed / Megatron-LM | Multi-GPU training at scale | CUDA-only |
| Liger-Kernel / Apex | Hand-tuned training kernels | CUDA-only |
This is the part that doesn't fit on a spec sheet. When you read a benchmark blog post that says "Apple is 80% as fast as NVIDIA" — they're usually testing the 20% of tasks where Apple has parity. The other 80% won't even run.
How to read this for your own decisions
For the rest of the week, you can use one mental shortcut: "is this AI technique CUDA-only?" If yes, NVIDIA wins by default for any team doing it. The list above is most of modern AI. That's the moat.
When evaluating a chip competitor — AMD, Apple, a Chinese startup — ask: "How many of those library entries do they have at parity? How many do they have at all?" The answer is almost never above 60%. That gap is what the chip executives lose sleep over, and what NVIDIA prices into its 70% gross margin.
Today's reflection
Run the benchmark. Save the output as day2-cublas.txt. Look at the speedup number. That's not "GPU vs CPU." That's seventeen years of paid library work, condensed into one decimal.