Back to Blog
4 min read

I benchmarked 17 frontier LLMs against a 50-task Terraform corpus for £19

AzureAI FoundryLLMTerraformFinOpsBenchmarking

Headline number first: 17 frontier LLMs × 50 Terraform tasks each = $74.62 (~£59). Strip out one outlier and you're at $23.61 (~£19) for 16 models. That outlier is gpt-5-pro at $51.01 alone, 68% of the entire session spend on a single model. More on that below.

I'd just spent two weeks running the same Terraform corpus on self-hosted H100 spot VMs. Cost per model, per benchmark run: ~£19. So Foundry ran 17 models for what one self-hosted run cost me, at higher quality, with zero infrastructure to maintain.

This post is just the cost data. The full decision framework, including when self-hosted still wins, comes in a follow-up.

What I ran

50 Terraform IaC tasks, each validated through terraform fmtvalidateplan. Two self-correction rounds per task. Concurrency = 5. Models hit via Azure AI Foundry (UK South for AOAI, Sweden Central for MaaS).

The 17 models tested on Foundry:

  • AOAI: gpt-5, gpt-5-codex, gpt-5-1-codex, gpt-5-2-codex, gpt-5-3-codex, gpt-5-pro, gpt-4-1, gpt-4o, o4-mini
  • Foundry MaaS: DeepSeek-V3.1, V3.2-Special, V4-Flash, R1, Llama-3.3-70B, Llama-4-Maverick, Mistral-Large-3, Kimi-K2.6

Cost per model, ranked

Cheapest first. All figures from the actual Azure usage CSV, MCPP partner pricing.

ModelTotal costPer taskScore
llama-3.3-70b$0.05$0.00180%
llama-4-maverick$0.09$0.002
mistral-large-3$0.13$0.003
deepseek-v4-flash$0.00$0.00090% (DS6 free-tier promo at run time; GA pricing now $1.03/$4.12 per M = 15× direct-API markup)
deepseek-v3-1$0.35$0.00775%
gpt-5-3-codex$0.74$0.01592%
deepseek-v3-2-special$0.86$0.01780%
gpt-5-2$1.01$0.020
deepseek-r1$1.73$0.03588%
o4-mini$1.99$0.040
gpt-5-1-codex$2.72$0.054
gpt-5-codex$3.80$0.07690%
gpt-5$4.16$0.08392%
gpt-5-2-codex$4.35$0.087
gpt-5-pro$51.01$1.02086%

Total: $74.62. Strip gpt-5-pro: $23.61 / ~£19.

Why gpt-5-pro alone cost £41

gpt-5-pro charged $120/M output tokens at MCPP partner rates (public list price is $300/M). It generated 416K output tokens across 50 tasks. Azure logged it as "gpt 5 pro out glbl Tokens": 416.3K, $49.96. Add input ($1.05) and you get $51.01.

Score: 86% (43/50). For comparison: deepseek-r1 scored 88% at $1.73 total, and gpt-5-codex scored 90% at $3.80. gpt-5-pro is the most expensive way to score 86% on this corpus by a factor of ~30.

That's worth saying out loud: the most expensive frontier model in this run was beaten on score by models 30× cheaper. The "pro" tier earns its keep on harder reasoning workloads, but for routine IaC generation it's pure cost without quality return.

Why the bottom of the table is so cheap

Two reasons llama-3.3-70b and llama-4-maverick come in under a penny per task:

  1. Open-weights models on Foundry MaaS are priced aggressively. Llama-3.3-70B at $0.71/M output tokens on Foundry. Same model self-hosted on a 2× H100 NC80adis spot VM works out at $0.286/M output tokens at peak utilisation, but only if you can hit batch=128 sustained. At realistic developer-tooling utilisation, the Foundry rate beats anything you can achieve on hardware you rent by the hour.

  2. Output token volume is the bill. 95% of the $74.62 was output tokens ($71.30 out vs $3.18 input). IaC generation is heavily output-skewed: long Terraform files per task. That makes the per-M-output-token rate the only number that matters for this workload.

What this means for your team

If your team is bursty (developer tooling, occasional batch jobs, exploratory work), you almost certainly want Foundry, not self-hosted spot GPU. The 17-model session ran in an afternoon. The equivalent self-hosted runs took me two weeks to set up properly, including:

  • Quota requests across regions
  • Cloud-init scripts that don't reformat your NVMe (lost 280GB once)
  • nvidia-smi NVML mismatch debugging after vLLM restart
  • FP8 mirror compatibility (FriendliAI OOMs at 91GB on a 94GB H100; RedHatAI loads clean)
  • Spot eviction handling

Self-hosting earns its keep at sustained high utilisation, models not yet on Foundry, or strict sovereignty requirements. For everything else: Foundry is the answer, and the cost story is more favourable than most architects assume.

The full decision framework, including the six legitimate reasons to still self-host, is the subject of a follow-up post.

Want help sizing AI workload cost on your actual usage? Request a free FinOps assessment.

Need help with your Azure environment?

Get in touch for a free consultation.

Get in Touch