Your $1.61/hr Azure H100 spot VM is faster than your £3,000 MacBook Pro at LLM inference

A side-by-side I didn't expect to do: Apple M5 Max MacBook Pro running an MLX-optimised Qwen model vs an Azure NC40ads_H100_v5 spot VM running the same parameter-count model on vLLM.

The headline numbers:

Setup	Throughput (batch=1)	Cost
Apple M5 Max MacBook Pro, Qwen 3.5-35B MLX	118 tok/s	£3,000+ up front
Azure NC40ads_H100_v5 spot, Qwen3.6-35B FP8	143 tok/s	$1.61/hr (~£110/mo at 20hr/week)

The cloud VM is 21% faster at single-stream inference, on dramatically cheaper hardware (per hour of use), with the option to scale to batch=128 (4,715 tok/s aggregate) which the laptop cannot.

This isn't a fair fight in any sense. The M5 Max is portable, runs offline, has no spot eviction risk, and doesn't need cloud-init or Terraform. But for sustained AI inference workloads, the cost shape is so different that the comparison is worth making.

What's behind the gap

Three things explain why a much smaller, much cheaper, much more constrained piece of hardware (laptop) loses to a much bigger, faster cloud GPU even at batch=1:

1. HBM3 bandwidth crushes unified memory

The H100 NVL has 3.35 TB/s of HBM3 memory bandwidth. The M5 Max has roughly 500-550 GB/s of unified memory bandwidth. That's a 6× gap.

LLM inference at batch=1 is memory-bandwidth-bound, not compute-bound. The model spends most of its time reading weights from HBM into the SMs, multiplying, writing back. The faster the memory, the faster the inference.

6× more memory bandwidth doesn't translate to 6× more tokens/second because of other bottlenecks (KV cache management, sampling overhead, batch=1 underutilisation), but it does explain why the H100 wins at batch=1 despite being a much bigger chip with much more theoretical compute headroom.

2. FP8 quantisation on H100 is hardware-accelerated

H100 has dedicated FP8 tensor cores. The M5 Max runs MLX's quantised paths in software-accelerated 4-bit quantisation, which is a different (and somewhat less optimal) precision tradeoff for code generation specifically.

In our Terraform corpus, FP8 models consistently outscored their BF16 equivalents. The M5 Max running MLX's 4-bit quant of Qwen 3.5-35B almost certainly scores below the cloud H100's FP8 of Qwen3.6-35B on the same corpus, but I haven't run the side-by-side. (If anyone with an M5 Max wants to run our corpus, the harness is open.)

3. Batch size scaling

Batch=1 is the M5 Max's best case. The cloud H100 can scale to batch=128 (4,715 tok/s aggregate) for sustained workloads: embeddings pipelines, multi-user chat, batch document processing. The laptop physically cannot run batch=128 without OOM-ing, regardless of patience.

For a single user doing single requests, the throughput delta is 21%. For any team or any sustained workload, the delta is closer to 40× because the cloud GPU can fill itself with parallel work.

The cost shape comparison

Hardware costs don't compare like-for-like (CapEx vs OpEx, owned vs rented), but a 3-year TCO is illuminating. Assumes the H100 spot VM is spun up on demand for around 20 hours per week of actual use, not left running 24/7 or through full working hours.

	Apple M5 Max (Pro, 64GB+)	Azure NC40ads spot (on-demand, ~20hr/wk)
Up-front cost	£3,000+	£0
Year 1 cash out	£3,000	£1,320
Year 2 cash out	£0	£1,320
Year 3 cash out	£0	£1,320
3-year total	£3,000	£3,960
Tax write-off (UK SME)	Capital allowance, 18% pa	OpEx, 100% in year
Failure mode	Hardware repair or replace	Stop the cron, swap region
Multi-user concurrent	No	Yes (batch=128)
Eviction risk	None	~few-percent/month

Year 1 the cloud option is 2.3× cheaper. Over 3 years the costs are roughly even with a small edge to the laptop. Over any timescale, the cloud VM wins for any team-shared workload, because batch=128 (4,715 tok/s aggregate) is on the table while the laptop can never go beyond batch=1.

Where each one wins

The M5 Max wins for:

A single developer who wants offline inference (planes, train, conferences)
Compliance scenarios where no inference can leave the device
Privacy-first personal workloads
Always-available, sub-second startup
The "I don't want to manage cloud" simplicity tax

The cloud H100 wins for:

Any team-shared workload (concurrent users)
Sustained throughput beyond batch=1
Models that don't fit in the laptop's VRAM (anything >35B in BF16 or >70B in 4-bit)
Production inference where uptime requirements matter
Cost-per-token at scale (the $0.095/M output number)

Making spot work (the unobvious bit)

"Spot VM" makes most architects think of getting evicted halfway through a job. Worth being explicit about how this actually behaves in practice, because the eviction risk is real but manageable with two patterns.

Pick a region that's off-peak when you work. Eviction rates aren't uniform. They cluster at peak demand times in each region. For inference workloads (where latency to your geography rarely matters), deploy the H100 somewhere that's quiet during your working hours. UK team? East Asia spot capacity is plentiful 8am-5pm UK time. Under testing we rarely see evictions outside the local peak.

Price varies more than you'd think. Same NC40ads_H100_v5 SKU ranges from around $1.50/hr in quiet APAC regions up to ~$7/hr in busy UK and US zones. The $1.61/hr figure in this post is the low end, achievable in several APAC regions during European working hours. The cheap regions tend to be newer with marginally higher eviction risk; the expensive regions have rock-solid capacity. For interactive on-demand use the cheap regions win on TCO even after factoring in the occasional eviction restart. Our spot pricing tool shows the current cross-region rates so you can pick the right region for your timezone.

Automate the redeploy. A Terraform pipeline that re-applies on the scheduledevents metadata probe firing, plus a 5-second poll for Preempt events. The VM comes back inside 3-5 minutes from cold-init with model weights pulled from blob. Net experience: occasional 5-minute hiccup, not "the workload won't run."

What spot won't survive: regions with 5%+ eviction rates (typically new regions during their peak hours), or workloads that need uninterrupted hours-long runs like training jobs. For interactive single-developer inference at 20 hrs/week, neither applies. Pick the right region, wire up the redeploy, and spot economics work without drama.

A nuance worth flagging

This comparison is unfair to the M5 Max in one specific way: it's a portable, fanless, battery-powered device that you also use as your laptop. The cloud H100 is a 700W, rack-mounted, datacenter-bound piece of hardware that does nothing else.

Apple Silicon's value proposition isn't "fastest LLM inference". It's "competitive LLM inference in a 2kg device that runs on battery." That's an engineering achievement. The fact that the cloud H100 is 21% faster at single-stream inference doesn't take away from how impressive the M5 Max is for what it is.

But: if your use case is "sustained LLM inference for a developer team," the laptop isn't actually the right tool. The cloud H100 spot VM is faster, scales beyond batch=1, supports the whole team rather than just one developer, and pays for itself in year one against a fresh laptop purchase. The laptop is the wrong cost shape for the workload.

The wider point

This is one of those comparisons that surprises people. "My laptop has an Apple Silicon chip; surely it's good enough?" Sometimes yes: for personal experiments, for offline work, for single-user development. For team workloads, the cost structure of cloud spot GPU makes it the obvious choice, and the throughput surprise (cloud is also faster, not just bigger) closes the case.

If you've been pricing AI inference architectures off your team's laptops, the £110/month on-demand cloud H100 deserves a spot in the comparison. The numbers aren't where you'd expect them.

Want help sizing AI inference for your team? Request a free assessment.