DAY 03 · DGX SPARK VS MAC STUDIO ⭐

Same 128 GB. Different machine.

Both ship with 128 GB of unified memory. Both cost about the same. Both can run a 70-billion-parameter model. And they are not interchangeable. Today is the head-to-head you came for — the one that explains, in numbers and in software, why CUDA is the answer to "why this box and not that one."

The illusion of equivalence

Open the spec sheets next to each other and you'd think you were comparing two flavors of the same machine.

NVIDIA
DGX Spark · GB10 Blackwell
$2,999 · ARM + GPU
Unified memory128 GB LPDDR5x
Memory bandwidth~273 GB/s
GPU compute (BF16)~280 TFLOPS
FP4 (sparse)~1,000 TFLOPS
Tensor Cores5th-gen, FP4
CPU20-core ARM
Power (max)~240 W
SoftwareCUDA — full stack
PyTorch backendCUDA · first-class
vLLM / TensorRT-LLMnative
Multi-GPU scaling200 Gbps NVLink-C2C
vs
APPLE
Mac Studio · M3 Ultra
$3,999 · all-Apple silicon
Unified memory128 GB LPDDR5
Memory bandwidth~800 GB/s
GPU compute (BF16)~30 TFLOPS
FP4 (sparse)— no native FP4
Tensor Cores— no equivalent
CPU28-core ARM (perf)
Power (max)~270 W
SoftwareMetal · MLX
PyTorch backendMPS · partial
vLLM / TensorRT-LLM— unsupported
Multi-GPU scaling— single-box only

Three numbers tell the story:

The first two define what each machine is good at. The third defines what's even possible.

Bandwidth vs compute — why both numbers matter

Last week (or in your head from Day 1, Wk 0) we drew the picture: a GPU is a kitchen of line cooks (compute) plus a pantry (memory). Two ways to be fast: more cooks, or a faster pantry.

Memory bandwidth (GB/s) Mac 800 DGX 273 Compute · BF16 TFLOPS Mac 30 DGX 280 Compute · FP4 TFLOPS Mac — none DGX ~1,000
Mac has the faster pantry. DGX has way more cooks, plus a special team that only knows FP4.

What "memory-bound" vs "compute-bound" means in practice

LLM inference, especially at small batch sizes, is memory-bound — the bottleneck is reading weights, not multiplying them. That's why Mac's 800 GB/s pantry can keep up with NVIDIA on single-user chat for standard models.

Almost everything else in AI — training, fine-tuning, batched serving, custom kernels, FP4 inference, attention kernels, scientific computing — is compute-bound. Then NVIDIA wins by 5-30×, and the software stack widens the gap further.

The cleanest way to remember it

Mac is great at running models someone else trained.
DGX is great at everything else.

Workload-by-workload — pick what you actually want to do

What's your workload?

Click an option. The card below tells you which machine wins, by how much, and why.

Chat with Llama 8B FP16
single user, casual
Chat with Llama 70B FP4
single user, smart model
Serve 20 users at once
internal team app
LoRA fine-tune Llama 8B
overnight job
Train a model from scratch
10M-param transformer
Run Stable Diffusion XL
image gen
Real-time voice agent
Whisper + LLM + TTS
Scientific simulation
CUDA-only libraries

Verdict

DGX Spark wins · ~1.5×

Memory-bandwidth-bound workload. Mac's faster pipe almost catches up; NVIDIA edges out via cuBLAS and tensor cores.

DGX ~120 tok/s · Mac ~80 tok/s

Software, in plain English — what runs on each

You want to use…DGX Spark (CUDA)Mac Studio
PyTorch✅ first-class, every feature⚠️ via MPS, ~80% feature parity, slower
Ollama / llama.cpp✅ native CUDA✅ excellent Metal support
Hugging Face Transformers✅ everything works⚠️ most works, some breaks on MPS
vLLM (production serving)✅ native❌ unsupported
TensorRT-LLM✅ NVIDIA-only
FlashAttention✅ CUDA-only❌ workarounds only
Unsloth (fast fine-tuning)✅ CUDA-only
BitsandBytes (4/8-bit)✅ full⚠️ limited
AWQ / GPTQ quantization✅ standard⚠️ limited
Stable Diffusion / SDXL✅ first-day support⚠️ works, slower, buggier
MLX (Apple's native)❌ irrelevant✅ Apple-only, growing fast
JAX✅ via XLA⚠️ via Metal, partial
cuBLAS / cuDNN / cuTLASS✅ — the moat❌ (Apple has Accelerate, smaller)

Three out of every four things in modern AI ship "CUDA-first, Apple-eventually-maybe." The Apple stack is real and improving — MLX is impressive — but it's a junior version of CUDA running about 18 months behind the frontier.

The decision tree, honest version

Will you only run pretrained models? yes Big-batch serving or production? no Mac Studio single-user, quiet, 800 GB/s yes DGX Spark vLLM, TensorRT-LLM, batches no — I want to train, fine-tune, or hack Need cutting-edge libraries? yes (almost always) DGX Spark CUDA-only universe rarely Mac (carefully) light fine-tunes via MLX Three branches. Two of them point green.

The case for Mac (steelmanned)

It's not a one-sided fight. Mac genuinely wins for some use cases:

The case for DGX Spark

The framing that makes it click

A Mac Studio is the best AI consumer machine. A DGX Spark is the cheapest AI builder machine. Different products. The spec-sheet overlap is a trap; the software ecosystems live in different worlds.

Lab: prove the gap on your own DGX

1 · Quick FP4 sanity check (Apple cannot do this)

$ ollama pull llama3.3:70b
$ ollama run llama3.3:70b --verbose
>>> /show modelfile
FROM /usr/share/ollama/models/manifests/.../llama3.3-70b
PARAMETER quantization F4_K_M  (FP4 / 4-bit)

A 70B model running at FP4 on dedicated FP4 silicon. Mac runs the same model in INT4 software emulation — works, but ~half the throughput.

2 · Try to install vLLM (this won't work on a Mac)

$ pip install vllm
$ python3 -c "import vllm; print(vllm.__version__)"
0.7.x

That installs cleanly. The same command on a Mac drops a CUDA dependency error halfway through. The error itself is the lesson.

3 · Match a Mac on memory bandwidth (you can't)

$ python3 - <<'PY'
import torch, time
N = 10_000_000
x = torch.randn(N, device="cuda"); y = torch.randn(N, device="cuda")
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(100): z = x + y
torch.cuda.synchronize()
dt = (time.perf_counter() - t0) / 100
gbs = 3 * x.numel() * 4 / dt / 1e9
print(f"Achieved memory bandwidth: {gbs:.0f} GB/s")
PY
Achieved memory bandwidth: 245 GB/s

You'll see ~240–260 GB/s — that's 90% of theoretical peak for the DGX Spark. A Mac Studio M3 Ultra runs the same benchmark at ~720 GB/s. This is the one number where Mac unambiguously wins.

Today's reflection

Pick the workload from the picker that's closest to your reality. Note the verdict and the gap. Imagine running it twice a day for a year. That's the practical impact of CUDA on your week, expressed in tokens per second.
← Day 2The Software Stack