DAY 03 · DGX SPARK VS MAC STUDIO ⭐

Same 128 GB. Different machine.

Both ship with 128 GB of unified memory. Both cost about the same. Both can run a 70-billion-parameter model. And they are not interchangeable. Today is the head-to-head you came for — the one that explains, in numbers and in software, why CUDA is the answer to "why this box and not that one."

The illusion of equivalence

Open the spec sheets next to each other and you'd think you were comparing two flavors of the same machine.

NVIDIA

DGX Spark · GB10 Blackwell

$2,999 · ARM + GPU

Unified memory128 GB LPDDR5x

Memory bandwidth~273 GB/s

GPU compute (BF16)~280 TFLOPS

FP4 (sparse)~1,000 TFLOPS

Tensor Cores5th-gen, FP4

CPU20-core ARM

Power (max)~240 W

SoftwareCUDA — full stack

PyTorch backendCUDA · first-class

vLLM / TensorRT-LLMnative

Multi-GPU scaling200 Gbps NVLink-C2C

APPLE

Mac Studio · M3 Ultra

$3,999 · all-Apple silicon

Unified memory128 GB LPDDR5

Memory bandwidth~800 GB/s

GPU compute (BF16)~30 TFLOPS

FP4 (sparse)— no native FP4

Tensor Cores— no equivalent

CPU28-core ARM (perf)

Power (max)~270 W

SoftwareMetal · MLX

PyTorch backendMPS · partial

vLLM / TensorRT-LLM— unsupported

Multi-GPU scaling— single-box only

Three numbers tell the story:

Mac wins on memory bandwidth — ~3× faster pipe between memory and chip.
NVIDIA wins on raw compute — ~10× more BF16 FLOPS, plus FP4 silicon Mac doesn't have at all.
NVIDIA wins on software — the entire CUDA library catalog, every modern AI tool, every research paper.

The first two define what each machine is good at. The third defines what's even possible.

Bandwidth vs compute — why both numbers matter

Last week (or in your head from Day 1, Wk 0) we drew the picture: a GPU is a kitchen of line cooks (compute) plus a pantry (memory). Two ways to be fast: more cooks, or a faster pantry.

Mac has the faster pantry. DGX has way more cooks, plus a special team that only knows FP4.

What "memory-bound" vs "compute-bound" means in practice

LLM inference, especially at small batch sizes, is memory-bound — the bottleneck is reading weights, not multiplying them. That's why Mac's 800 GB/s pantry can keep up with NVIDIA on single-user chat for standard models.

Almost everything else in AI — training, fine-tuning, batched serving, custom kernels, FP4 inference, attention kernels, scientific computing — is compute-bound. Then NVIDIA wins by 5-30×, and the software stack widens the gap further.

The cleanest way to remember it

Mac is great at running models someone else trained.
DGX is great at everything else.

Workload-by-workload — pick what you actually want to do

What's your workload?

Click an option. The card below tells you which machine wins, by how much, and why.

Chat with Llama 8B FP16

single user, casual

Chat with Llama 70B FP4

single user, smart model

Serve 20 users at once

internal team app

LoRA fine-tune Llama 8B

overnight job

Train a model from scratch

10M-param transformer

Run Stable Diffusion XL

image gen

Real-time voice agent

Whisper + LLM + TTS

Scientific simulation

CUDA-only libraries

Verdict

DGX Spark wins · ~1.5×

Memory-bandwidth-bound workload. Mac's faster pipe almost catches up; NVIDIA edges out via cuBLAS and tensor cores.

DGX ~120 tok/s · Mac ~80 tok/s

Software, in plain English — what runs on each

You want to use…	DGX Spark (CUDA)	Mac Studio
PyTorch	✅ first-class, every feature	⚠️ via MPS, ~80% feature parity, slower
Ollama / llama.cpp	✅ native CUDA	✅ excellent Metal support
Hugging Face Transformers	✅ everything works	⚠️ most works, some breaks on MPS
vLLM (production serving)	✅ native	❌ unsupported
TensorRT-LLM	✅ NVIDIA-only	❌
FlashAttention	✅ CUDA-only	❌ workarounds only
Unsloth (fast fine-tuning)	✅ CUDA-only	❌
BitsandBytes (4/8-bit)	✅ full	⚠️ limited
AWQ / GPTQ quantization	✅ standard	⚠️ limited
Stable Diffusion / SDXL	✅ first-day support	⚠️ works, slower, buggier
MLX (Apple's native)	❌ irrelevant	✅ Apple-only, growing fast
JAX	✅ via XLA	⚠️ via Metal, partial
cuBLAS / cuDNN / cuTLASS	✅ — the moat	❌ (Apple has Accelerate, smaller)

Three out of every four things in modern AI ship "CUDA-first, Apple-eventually-maybe." The Apple stack is real and improving — MLX is impressive — but it's a junior version of CUDA running about 18 months behind the frontier.

The decision tree, honest version

The case for Mac (steelmanned)

It's not a one-sided fight. Mac genuinely wins for some use cases:

Single-user inference of standard open models. Mac's 800 GB/s memory bandwidth is the killer feature. You can chat with Llama 70B at decent tok/s on a Mac with no AI training experience.
Silent, beautiful, all-day desktop. No fans you can hear. Doubles as your main computer.
The ecosystem you already own. If you live in Sonoma + Xcode + Final Cut, the Mac is part of your life. The DGX is a server.
MLX is real and fast. Apple has a small, sharp ML team and they ship. For greenfield Apple-only projects, MLX is increasingly a credible choice.

The case for DGX Spark

Every modern AI library, every paper from the last 5 years, every open-source production tool — runs first, fastest, and most reliably here.
FP4 silicon. The fastest-growing inference quantization standard, and Apple has nothing equivalent.
NCCL + 200 Gbps NVLink-C2C — you can pair two DGX Sparks for 256 GB of unified memory and twice the compute. Apple has no comparable cluster fabric for desktops.
If your future self ever wants to write a custom kernel, train an architecture, do scientific computing, or ship a service — you'll need this hardware anyway.

The framing that makes it click

A Mac Studio is the best AI consumer machine. A DGX Spark is the cheapest AI builder machine. Different products. The spec-sheet overlap is a trap; the software ecosystems live in different worlds.

Lab: prove the gap on your own DGX

1 · Quick FP4 sanity check (Apple cannot do this)

$ ollama pull llama3.3:70b
$ ollama run llama3.3:70b --verbose
>>> /show modelfile
FROM /usr/share/ollama/models/manifests/.../llama3.3-70b
PARAMETER quantization F4_K_M  (FP4 / 4-bit)

A 70B model running at FP4 on dedicated FP4 silicon. Mac runs the same model in INT4 software emulation — works, but ~half the throughput.

2 · Try to install vLLM (this won't work on a Mac)

$ pip install vllm
$ python3 -c "import vllm; print(vllm.__version__)"
0.7.x

That installs cleanly. The same command on a Mac drops a CUDA dependency error halfway through. The error itself is the lesson.

3 · Match a Mac on memory bandwidth (you can't)

$ python3 - <<'PY'
import torch, time
N = 10_000_000
x = torch.randn(N, device="cuda"); y = torch.randn(N, device="cuda")
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(100): z = x + y
torch.cuda.synchronize()
dt = (time.perf_counter() - t0) / 100
gbs = 3 * x.numel() * 4 / dt / 1e9
print(f"Achieved memory bandwidth: {gbs:.0f} GB/s")
PY
Achieved memory bandwidth: 245 GB/s

You'll see ~240–260 GB/s — that's 90% of theoretical peak for the DGX Spark. A Mac Studio M3 Ultra runs the same benchmark at ~720 GB/s. This is the one number where Mac unambiguously wins.

Today's reflection

Pick the workload from the picker that's closest to your reality. Note the verdict and the gap. Imagine running it twice a day for a year. That's the practical impact of CUDA on your week, expressed in tokens per second.