Back to Blog
5 min read

A $3.50/hour open model matched Claude Opus on my Terraform benchmark

AzureLLMGLMClaudeSpot GPUOpen WeightsTerraformFinOps

Claude Opus-level intelligence. As many tokens as you want. About $3.50 an hour. Running in your own cloud, on your own data.

That's not a pitch, it's what I measured last week. And the part that surprised me wasn't the price. It was that nothing left my tenant to get it.

What I ran

GLM-5.2, open-weight, about a week old, not even on Azure AI Foundry yet, on a 4× A100 spot box (Standard_NC96ads_A100_v4) in Italy North at roughly $3.50/hour. That's the retail spot rate as of June 2026, and spot moves, so treat it as a ballpark, not a billed line.

That's 320 GB of GPU memory across four A100s. Worth noting: A100 is Ampere, two generations behind the current Blackwell parts. Then my Terraform benchmark: real tasks where the model has to write HCL that passes terraform plan clean against Azure. No multiple choice. Working infrastructure, or it fails.

There are two ways to run a 744-billion-parameter model on one box, and I tested both.

ConfigFits whereSpeedResult
UD-IQ2_M (2.40 bpw, 223 GB)Whole model in the 320 GB of GPU memory (~226 GB used, zero offload)~28 tok/s16/18
UD-Q4_K_XL (4.69 bpw, 436 GB)~a third spills to system RAM~7 tok/sre-ran the 2 misses, both passed

Squeezed down to Unsloth's UD-IQ2_M quant (223 GB, ~2.4 bits per weight), the whole thing fits in GPU memory, served with -ngl 99, no offload, around 226 GB across the four cards. It's fast, about 28 tokens a second, and it scored 16 out of 18, fumbling a MySQL server and a Synapse workspace.

At UD-Q4_K_XL (436 GB, ~4.7 bits per weight), the "full intelligence" quant, the model is too big for 320 GB, so about a third of it spills into system RAM. That's slower, around 7 tokens a second. I re-ran the exact two tasks it had missed at the lower quant, and it nailed both.

Where does that 4× slowdown come from? The moment layers spill out of GPU memory, the compute for those layers falls to the CPU. Every core pegs at 100% while the GPUs sit idle, waiting on system RAM. That CPU saturation is the ~7 tokens a second. It's the clearest illustration I have of why "just add more system RAM" isn't free performance: the bottleneck moves off the cards you paid for and onto the slowest path in the box.

One honest note on that, because it matters: the full-quant result is 16 measured at IQ2 plus the two misses confirmed fixed at Q4, not a fresh 18-task sweep at full precision. So read it as "the two it dropped were a quantisation artefact, not a capability gap", not as a clean 18/18 in a single run.

The comparison

On the tasks I tested, the full-quant open model tied Claude Opus 4.8, the strongest model on my benchmark.

The caveats matter, so here they are plainly. It's a small sample. It's unscientific. And this was not a clean same-run head-to-head: Opus was scored on an earlier pass of the harness, and GLM's full-quant figure is the composite I just described. Opus is still the model to beat across the wider corpus, and a focused 18-task run is not a leaderboard. But "a model you can self-host kept pace with the best closed model money can rent" was not a sentence I could write a year ago.

A clean, identical-harness head-to-head is coming in a follow-up. I'd rather publish that as its own thing than wave my hands here.

Now do the economics

Frontier-adjacent intelligence. Unlimited tokens. No per-token meter running. About $3.50 an hour, entirely inside your own tenant. And this was Ampere, two-generation-old silicon, so the newer parts only widen the gap.

The closed labs charge per token because frontier intelligence has been scarce. This is what it starts to look like when it stops being scarce, and when you can run it somewhere the data never leaves.

I'm not calling this production-ready as it stands. The $3.50/hour is spot: eval and dev economics, on an evictable VM, not something a regulated client should depend on without an eviction-resilient design behind it. Open models still trail on the hardest reasoning, the tail of edge cases is where closed frontier models still pull ahead, and running any of this well in production is real work: GPU scheduling, spot eviction handling, quantisation trade-offs, serving infrastructure. None of that is free.

But the moat just got a lot shallower. For a narrow, repetitive, high-volume workload like Terraform generation, a self-hosted open model on a spot box is now a genuine line item to compare against your API bill, not a science project.

What I'm testing next

Two questions this run deliberately left open. They're the ones that actually decide whether this goes anywhere near production:

  • How many people one box really serves at once. The number above is single-stream. The figure that matters commercially is genuine concurrency: how throughput holds up with eight users hitting the same box, and what that works out to per user-hour. That's the next benchmark.
  • How to run it hardened, in your own tenant. Private networking, no public egress, weights mirrored to internal storage so there's no standing external dependency, data that physically never leaves your environment. For regulated UK and EU workloads, that, not the price, is the real story. It's the post I'm most interested in writing.

If the data can't leave, and the per-token meter never starts, the calculus for a lot of routine AI work changes. That's where I'm pointing next.


Running AI or GPU workloads on Azure and not sure where the spend is going? Our free Cost Review shows your compute and GPU waste, sized in £/month.

Need help with your Azure environment?

Get in touch for a free consultation.

Get in Touch