Four Azure AI Foundry pricing gotchas, and how to avoid each one

Azure AI Foundry's pricing is mostly transparent: input tokens × $/M + output tokens × $/M, billed per request. For most workloads that's the whole story.

But during a recent 17-model benchmark session (50 Terraform tasks per model, $74.62 total bill) we hit four specific gotchas that aren't obvious from the pricing page. Each one has cost real money for someone we know. Here they are, ranked by how expensive each one can get.

1. The `--sku-capacity 1` default: RateLimitReached on the first request

Cost impact: failed deployments, lost development time, retry storms.

Foundry MaaS deployments default to sku_capacity = 1, the smallest unit the SKU offers. For most Global Standard deployments that's around 1K tokens-per-minute. The first parallel benchmark run we tried hit 429 RateLimitReached immediately. We assumed the model was rate-limited at the publisher; it wasn't, our deployment quota was.

The fix: explicitly set sku_capacity = 100 (or higher) on every MaaS deployment. On Global Standard SKUs, capacity is a ceiling, not a reservation. You pay per token consumed, not per capacity unit allocated. Setting capacity = 100 costs nothing at rest and removes the throttling cliff.

az cognitiveservices account deployment create \
  --name <foundry-resource> \
  --deployment-name deepseek-v3-2-special \
  --model-name deepseek-v3-2-special \
  --sku-capacity 100 \
  --sku-name GlobalStandard

For dedicated/PTU SKUs, capacity is a reservation and you pay for it whether you use it or not. Different rules; different default.

Bumping deployment-level capacity does NOT fix every throttle. Some partner-MaaS models like DeepSeek V4 Flash also enforce a provider-side RPM ceiling on shared capacity that's invisible to your deployment quota and to az cognitiveservices usage list. Covered in Three things to check before you commit to Foundry.

2. The `az cognitiveservices model list` flat dump

Cost impact: deploying a deprecated or pre-release model that breaks at runtime.

az cognitiveservices model list returns every model ever registered in the catalogue (pre-release, deprecated, deprecated-but-still-billing, and active) all flat in one JSON blob with no status filter.

Specific cases we hit:

qwen3-32b appeared in the CLI output with full SKU metadata: GlobalStandard, capacity allowed, format ChatCompletion. Tried to deploy it: SKU 'GlobalStandard' is not supported for this model. The model was pre-release and not actually deployable.
mistral-large-2407 and mistral-large-2411 both appear in the list. Both are deprecated as of early 2026. Only Mistral-Large-3 is current.
GPT-3.5-Turbo variants still appear in the list. Many are deprecated and route to GPT-4o-mini under the hood.

The Foundry portal shows a padlock icon on staged-but-not-deployable models. The CLI doesn't expose this state. Trust the portal for "what's actually deployable today."

The fix: when scripting deployments, cross-check against the Foundry portal manually for any model you haven't deployed before. The CLI is fine for known-good model IDs you've deployed previously.

3. The output-token-dominated bill (especially for IaC and other code generation)

Cost impact: order-of-magnitude underestimates of total spend.

Most Foundry pricing tables list input and output rates side by side, with input around 4× cheaper than output. Architects estimate spend as "average of in + out × volume." That's wrong for any output-heavy workload.

In our 17-model benchmark, 95.4% of the $74.62 bill was output tokens. $71.30 of $74.62 went to output. $3.18 to input. $0.004 to cached input.

That's because IaC generation looks like:

Input prompt: ~1.4K tokens (corpus task description + system prompt)
Output: ~10-15K tokens of Terraform HCL per task

Code generation generally is output-heavy. So is content generation, document drafting, and creative writing. Only RAG-against-large-context, classification, and question-answering workloads have input-heavy bills.

The fix: estimate spend as (output tokens × $/M output) and ignore the input rate as a rounding error, unless you're in a confirmed input-heavy workload. For a code-generation team, the output rate is the only number that matters.

4. gpt-5-pro is a different kind of expensive

Cost impact: $51 of a $74 bill on one model.

This one's a category of its own. gpt-5-pro charges $120/M output tokens. It's the only model in the GPT-5 family priced this way; gpt-5 standard is $10/M output, gpt-5-codex variants are $14/M output.

In our 17-model run, gpt-5-pro generated 416K output tokens across 50 tasks and was billed $49.96 for output alone. Total spend on the model: $51.01. That's 68% of the entire 17-model session bill on one model.

The score? 86% (43/50). Beaten by gpt-5-codex at 90% ($3.80 total) and DeepSeek-V4-Flash at 90% ($0.00 due to the DS6 free-tier promo at run time; V4-Flash subsequently went GA at $1.03/$4.12 per M, a 15× output markup over the direct DeepSeek API). For our workload (Terraform IaC), gpt-5-pro is the most expensive way to get an 86%.

The fix: treat gpt-5-pro like a specialty tool. Use it on hard reasoning workloads where the score gap actually pays back the cost gap. Don't use it for routine code generation, content drafting, or anything where a 90%-scoring $3.80 model is on the menu.

What this looks like in a FinOps practice

If you're advising on Azure AI Foundry spend at any scale, build cost guardrails for these four specifically:

Capacity-validation lint: every Foundry MaaS deployment in your IaC must set sku_capacity ≥ 100.
Catalogue freshness: monthly check of deployable-vs-deprecated models in your most-used regions.
Cost-allocation by output tokens: chargeback should weight by output, not by request count or by input+output average.
gpt-5-pro spend cap: set a separate Azure Cost Management budget alert specifically for gpt-5-pro deployments. If the line item is >£100/week and it's not a confirmed reasoning workload, someone clicked the wrong dropdown.

These aren't theoretical. Every one of them cost us money during a single afternoon's benchmark session. The session bill was small ($74.62), but each gotcha scaled with team size would be £100s to £1,000s of unexpected spend.

The full cost data and per-model breakdown is in the 17-models-for-£19 post.

Want these guardrails built into your Azure environment? Request a free FinOps assessment.

Four Azure AI Foundry pricing gotchas, and how to avoid each one

1. The --sku-capacity 1 default: RateLimitReached on the first request

2. The az cognitiveservices model list flat dump

3. The output-token-dominated bill (especially for IaC and other code generation)

4. gpt-5-pro is a different kind of expensive

What this looks like in a FinOps practice

How mature is your cloud cost management?

1. The `--sku-capacity 1` default: RateLimitReached on the first request

2. The `az cognitiveservices model list` flat dump