The bigger Llama isn't worth the bigger VM, Qwen 27B beats Llama 70B for Terraform at a third of the cost

There's a default assumption in enterprise AI architecture: bigger models are better for hard tasks, so use the biggest model that fits your hardware budget. For Terraform IaC generation specifically, that assumption is wrong.

Two specific data points from our benchmark corpus:

Model	VM	$/hr spot	Plan-pass score
Qwen3.6-27B FP8	NC40ads_H100_v5 (1× H100 NVL)	$1.61	90% (45/50)
Llama-3.3-70B BF16 TP=2	NC80adis_H100_v5 (2× H100 NVL)	$3.15	80% (40/50)
Qwen2.5-72B BF16 TP=2	NC80adis_H100_v5 (2× H100 NVL)	$3.15	78% (39/50)

The 27B beats both 70-class models, on a VM that costs half as much, with zero tensor-parallelism complexity.

Why this is the case is more interesting than the headline.

What's actually happening

Three things converge to make the smaller model win:

1. The corpus is narrow

Terraform IaC generation is a constrained task. There are ~150 Azure resource types you'd realistically generate, each with a documented schema, in a syntax (HCL) that's regular and well-represented in training data. The hard part isn't generation; it's getting the right combination of arguments and dependency ordering for resources like managed identities and disk encryption sets.

Bigger models have more general knowledge. Smaller models that have been heavily fine-tuned on code (Qwen2.5-Coder, Qwen3.6 family, DeepSeek-Coder-V2) often outperform larger generalist models on code-specific tasks because their parameter capacity is better-allocated for the workload.

2. The Ethernet tax on TP=2

The 70B BF16 model needs ~140GB of VRAM, which doesn't fit on a single 94GB H100 NVL. Tensor parallelism (TP=2) splits the model across two GPUs and shares activations between them on every layer's forward pass. On Azure NC80adis VMs, the inter-GPU interconnect is standard Ethernet, not NVLink, not InfiniBand.

The result: TP=2 on Ethernet loses 50–70% of theoretical throughput vs an equivalent NVLink-connected pair. This shows up in tokens/sec, but it also shows up in batch-size limits and tail latency. The model serves slower at higher batch sizes, and you can't push the larger model to the same utilisation as a smaller one on a single GPU.

This affects cost-per-token too. A 70B at TP=2 produces 3,052 tok/s aggregate at batch=128 on NC80adis ($0.286/M output). A 32B FP8 on a single NC40ads produces 4,715 tok/s at batch=128 ($0.095/M). The 32B is 3× cheaper per token despite using less hardware.

3. FP8 vs BF16 for code generation

This is more nuanced. In our corpus, every FP8 model outscored its BF16 equivalent at the same parameter count. Mistral-Large-2407 FP8 (123B) beat all BF16 models tested. Qwen3.6 family FP8 beat the BF16 quantised checkpoints we tested.

The mechanism isn't fully understood, but the pattern is consistent: for code generation specifically, FP8 quantisation appears to be free quality. The slight numerical precision loss doesn't degrade syntactically-constrained outputs the way it would creative text generation. And FP8 doubles your effective KV cache, which lets you run higher batch sizes, which improves throughput, which improves cost.

What this means for your architecture choices

Three rules that follow from this:

Right-size to the workload, not the vendor's "biggest available"

The default "use the biggest model that fits your VM" framing puts you on the wrong side of the cost curve. The right framing is: what's the smallest model that hits your quality bar? For routine Azure IaC, that's a 27B-class FP8 model on a single H100 NVL.

Avoid TP=2 unless you have to

TP=2 only earns its keep when the model genuinely won't fit on one GPU. Llama-3.3-70B in BF16 doesn't fit on a 94GB H100 NVL. Llama-3.3-70B in FP8 does fit, and it's faster, cheaper, and scores within 2pp of the BF16 version.

If your model fits in FP8 on a single GPU, do that. Skip the TP=2 complexity entirely. Skip the Ethernet tax. Skip the proximity placement group requirements.

Use code-specific fine-tunes for code tasks

Qwen2.5-Coder-32B and Qwen3.6-27B aren't general-purpose chatbots. They're fine-tuned for code. They beat larger generalist models on code corpora. If your workload is dominated by code generation, run a code-specific model. If your workload is dominated by reasoning or general knowledge, a larger generalist model might be the right call.

The practical implication for cost

Under our benchmark corpus, on a single H100 NVL:

Workhorse model (Qwen3.6-27B FP8): $1.61/hr spot, 90% score, ~$0.30/M output at typical 25-dev utilisation.
Frontier escalation when needed (gpt-5-codex via Foundry): $14/M output, 90% score on the apply test.

That's the architecture. Cheap workhorse for the 95% routine generation, frontier API for the 5% hard cases. The 70B model doesn't fit anywhere in this stack for IaC, it's strictly worse than the 27B workhorse and strictly more expensive than the frontier fallback.

The only place the 70B does fit: when the task genuinely needs more reasoning than a 27B has, and the team can't tolerate frontier API calls (data sovereignty, air-gap, etc.). In that narrow band, the 70B at TP=2 is the best self-hosted option. For everyone else, it's the wrong VM size.

The companion data on apply tests shows the same pattern, Qwen3.6-27B at 93%, beating most 70B models in the field.

Running AI or GPU workloads on Azure and not sure where the spend is going? Our free Cost Review shows your compute and GPU waste, sized in £/month.