Back to Blog
4 min read

LLMs hallucinate model names, verify before you download 100GB

LLMHallucinationHuggingFaceModel RegistryOperations

I'd just spun up an NC80adis spot VM in Indonesia Central. £3.15/hr, on the clock, model-pull-from-blob ready to go. I asked Claude (or GPT-5, can't remember which, both did this) what the best Qwen Coder model was for a Terraform code-generation workload.

The answer came back fast and confident: "Qwen2.5-Coder-72B-Instruct." Cited as the largest in the Qwen2.5-Coder family, recommended for sustained high-quality code generation, with quantised variants available on HuggingFace.

That model does not exist.

The Qwen2.5-Coder family caps at 32B (Qwen2.5-Coder-32B-Instruct). There is no 72B. There is a Qwen2.5-72B-Instruct (general, not code) and a Qwen2.5-Coder-32B-Instruct (code, smaller). The LLM had hallucinated a model name by combining two real model IDs that share the same vendor.

I'd already typed huggingface-cli download Qwen/Qwen2.5-Coder-72B-Instruct into the VM. It ran for about 30 seconds before HuggingFace returned 404. I caught it. If I'd written that line into a cloud-init script and let it run unattended, it would have failed silently and burned £6 of spot time before I noticed.

This is a class of failure, not an isolated case

Every LLM I've tested does this with model names. Specific patterns:

  • Family-merging hallucinations. Qwen has a Coder family (32B max) and a general family (72B max). LLMs combine them: Qwen2.5-Coder-72B, Qwen3-Coder-235B, etc. The combined name follows the vendor's naming convention so well that humans can't spot the hallucination by inspection.
  • Variant-suffix hallucinations. Real model: Llama-3.3-70B-Instruct. Hallucinated: Llama-3.3-70B-Instruct-Turbo, Llama-3.3-70B-Code, Llama-3.3-70B-FP8 (the FP8 mirrors exist but are vendor-specific paths like RedHatAI/Meta-Llama-3.3-70B-Instruct-FP8-dynamic, not just appended suffixes).
  • Quantisation-suffix hallucinations. "There's a Q4_K_M GGUF on HuggingFace." Sometimes there is; sometimes there isn't; sometimes there's only Q5_K_M and Q8_0. The LLM doesn't know, it pattern-matches on what usually exists.
  • Out-of-date catalogues. Model gets renamed or moved to a different org. LLM cites the old path. Download fails.

The reason this happens: model registries (HuggingFace, Ollama, Azure Foundry) move faster than LLM training data refreshes. Models are added, deprecated, renamed, and re-organised between training cutoffs. The LLM's "knowledge" of the registry is a snapshot frozen at a point in the past, often months or a year out of date.

Why this is expensive on a spot VM

A 100GB model download to NVMe takes roughly:

  • HuggingFace direct: 8–20 minutes (depends on rate limits and your peering)
  • Azure Premium blob → NVMe (after one-time HF→blob stage): 1–3 minutes

If you discover a hallucinated model name after writing the download into cloud-init:

  • Spot VM boots, runs cloud-init.
  • Cloud-init runs huggingface-cli download <fake-model>.
  • Download fails with 404.
  • Cloud-init either (a) retries forever, (b) exits with error and leaves the VM in an unconfigured state, or (c) silently continues to the next step.
  • vLLM service starts, fails because the model directory doesn't exist.
  • You SSH in 15 minutes later to debug.
  • Total wasted: £0.50–£1.50 of spot time per incident.

In a research session firing 5–10 model deployments in an afternoon, that adds up to real money. More importantly, it adds up to lost time on a clock.

The verification pattern that costs nothing

Before any model download, manual or scripted, verify against the actual registry API:

# HuggingFace
curl -sI "https://huggingface.co/api/models/Qwen/Qwen2.5-Coder-32B-Instruct" \
  | head -1
# HTTP/2 200  →  exists
# HTTP/2 404  →  hallucinated

# Ollama
curl -sf "https://registry.ollama.ai/v2/library/qwen3.6/tags/list" \
  | jq -r '.tags[]'
# Lists actual deployable tags

# Azure Foundry, trust the portal padlock icon, not the CLI
# az cognitiveservices model list returns deprecated and pre-release
# items mixed with active ones

For automated scripts, build verification into the deploy step:

#!/bin/bash
set -euo pipefail

MODEL_ID="${1:?usage: $0 <hf-model-id>}"

if ! curl -sIf "https://huggingface.co/api/models/${MODEL_ID}" >/dev/null; then
  echo "ERROR: ${MODEL_ID} does not exist on HuggingFace" >&2
  exit 1
fi

huggingface-cli download "${MODEL_ID}" --local-dir "/mnt/nvme/models/${MODEL_ID##*/}"

Two lines of verification, one bash function. Catches every hallucinated model name before the 100GB download starts.

The wider lesson

Don't take LLM output as authoritative for anything that's grounded in an external registry. Model names, package names on PyPI, image tags on Docker Hub, Azure resource provider versions, Helm chart versions, these all change faster than LLM training data. The LLM's knowledge of them is approximate.

For each of these, a one-line API check is a cheap insurance policy. The pattern generalises:

LLM suggests <thing-from-external-registry>
  ↓
verify <thing> exists in <registry>
  ↓
proceed if yes, fail loud if no

If your CI/CD pipeline takes LLM-generated code or LLM-suggested dependency lists and runs them unverified, you're one hallucinated name away from a confused, expensive failure. Especially in environments where the failure burns money on a clock, like a spot VM in a benchmark session, or a Kubernetes deploy at 2am.

The one-line verification is always cheaper than the alternative.

Need help with your Azure environment?

Get in touch for a free consultation.

Get in Touch