Back to Blog
5 min read

100× cheaper than GPT-4o, what a £1.61/hr H100 spot VM actually delivers for IaC

AzureSpot GPUvLLMFP8QwenTokensFinOps

100× cheaper than GPT-4o, what a £1.61/hr H100 spot VM actually delivers for IaC

Headline: a single NC40ads_H100_v5 spot VM running Qwen2.5-Coder-32B FP8 on vLLM serves output tokens at $0.095 per million at peak throughput. GPT-4o on Azure OpenAI is $10/M output. The ratio is ~105×.

That's the headline number, and it's true under specific conditions. What follows is what those conditions actually are, why most workloads don't hit them, and what the realistic operating cost looks like.

How $0.095/M is calculated

Three inputs:

  1. Spot price: NC40ads_H100_v5 in Indonesia Central / Southeast Asia, $1.61/hr (varies between $1.51–$1.61).
  2. Throughput: Qwen2.5-Coder-32B FP8 on vLLM, batch=128 concurrent users, 4,715 tok/s aggregate output.
  3. Arithmetic: 4,715 tok/s × 3,600 s = 16.97M tokens/hr. $1.61/hr ÷ 16.97M = $0.0949 per million output tokens.

That's at peak utilisation. Drop to batch=32 (still strong, 1,317 tok/s) and you're at $0.34/M. Drop to batch=8 (interactive sweet spot, 595 tok/s) and you're at $0.75/M. Drop to batch=1 and you're at $9.71/M, within striking distance of GPT-4o's $10/M.

The 100× ratio only holds at sustained batch=128.

What batch=128 actually means

Batch size in vLLM is the number of concurrent inference requests sharing the GPU's KV cache and being processed in parallel. At batch=128, the H100 NVL is genuinely full, every SM has work, every byte of HBM3 bandwidth is fed.

To hit batch=128 sustained, you need:

  • 128 simultaneous in-flight requests, continuously. Each has its own KV cache slot. The GPU never sits idle waiting for the next request.
  • Average prompt length matched to throughput. If your prompts are 8K input, 128 simultaneous prompts is 1M tokens of KV cache before any generation starts. Qwen2.5-Coder-32B FP8 has ~30GB of model weights on a 94GB H100 NVL, leaving ~64GB for KV cache. That's roughly enough.
  • A workload that genuinely loads the GPU like that. This is the part most teams underestimate.

The workloads that do sustain batch=128:

  • Production embeddings pipelines processing 1,000+ documents/minute
  • Batch document summarisation jobs
  • High-traffic chatbot RAG (200+ concurrent users)
  • Code completion across 200+ developers
  • Automated content generation at scale (10K+ articles/day)

The workloads that don't:

  • 5–50-developer teams using AI coding assistants. Bursty. Long idle periods between agentic actions.
  • Occasional batch jobs. Idle most of the time.
  • Most enterprise IaC generation. Engineer fires a request, reads the output, edits, re-runs. GPU at batch=2–4, mostly.
  • Single-user development. Always batch=1.

What realistic utilisation costs

Honest operating economics for a developer-tooling team self-hosting one NC40ads:

Workload patternAvg batch sizeTok/s aggregate$/M output
5-dev bursty (Cline + Continue.dev)~3200$2.24
25-dev mixed~12800$0.56
50-dev sustained~301,250$0.36
100-dev or batch processing~602,200$0.20
Production embeddings sustained~1284,715$0.095

The 100× number is the floor of what's possible. The realistic ceiling for typical developer-tooling teams is more like 10–25× cheaper than GPT-4o, not 100×. Still a significant gap, but a different story.

What about Foundry MaaS for the same model class?

Open-weights models on Foundry MaaS are priced more aggressively than Azure OpenAI:

ModelFoundry $/M outputSelf-host best case
Llama-3.3-70B (Foundry MaaS)$0.71$0.286 (NC80adis TP=2 at batch=128)
DeepSeek-V3.2-Special$1.68n/a (Foundry-only, 671B too big to self-host on a single H100)
Qwen2.5-Coder-32Bnot on Foundry$0.095 (NC40ads, batch=128)
GPT-4o (AOAI)$10.00n/a
GPT-5 (AOAI)$10.00n/a
GPT-5-codex (AOAI)$14.00n/a
gpt-5-pro (AOAI)$120.00n/a

The 100× headline holds against GPT-4o-class frontier APIs. Against Foundry MaaS open-weights models, the ratio is more like 3–8× at sustained utilisation. Against smaller open-weights models that aren't on Foundry at all (Qwen3.6, Gemma-4, Qwen2.5-Coder), self-hosting is the only option, there's no per-token alternative to compare against.

The actual rule for when this matters

The $0.095/M number is real but conditional. The decision rule that follows from it:

  1. If your sustained GPU utilisation is below 30%, ignore the headline. Foundry MaaS will be cheaper, simpler, and good enough on quality.
  2. If your sustained utilisation is above 60% on a workload Foundry serves, do the per-token math, not the per-hour math. At 60% utilisation 24/7, the spot VM still beats Foundry on most open-weights comparisons, but the ratio is 5–10×, not 100×.
  3. If your sustained utilisation is above 60% on a workload Foundry doesn't serve (e.g. a specific quantised variant, a code-specific fine-tune, a vision-language model), self-host. The 100× headline doesn't apply but the qualitative case is decisive.

The number that gets shared on LinkedIn is the floor of what's possible. The number that should drive your architecture decision is what your actual utilisation will be on day 90, not on your benchmark run.


Want help sizing self-hosted AI infrastructure against Foundry MaaS for your workload? Get in touch.

Need help with your Azure environment?

Get in touch for a free consultation.

Get in Touch