VLMs Are Confident Liars — Elena Kolmogorova

The Setup

One step in the pipeline was generating text descriptions from images. The vision-language model - Qwen2-VL-7B - would look at each image and describe what it saw: object type, condition, materials, notable features. Those descriptions would later feed a classifier that assigned semantic labels used for downstream prediction.

I was running this across 10 A100 GPUs, processing 12.8 million images. The rest of the team was waiting on this data before they could start model training. Getting the descriptions right mattered.

The first prompt I wrote was, in hindsight, asking for trouble.

The Prompt That Caused Hallucinations

The original prompt was structured and demanding. It asked the model to classify each image into specific predefined categories: object type from a fixed list, count of visible elements, presence or absence of specific features, style classification from a fixed taxonomy, overall condition assessment.

This seemed reasonable. These were exactly the features that mattered downstream. The model is a vision-language model - looking at images and describing them is what it does.

The problem was the images.

Thumbnail Quality at Scale

The 12.8 million images were thumbnail quality - compressed, low resolution, web-scaled. A model trained on high-resolution photography was being asked to make fine-grained judgments from images where you sometimes can’t clearly tell what’s in the frame.

When a model is forced to choose from a fixed set of categories and can’t see clearly enough to choose confidently, it doesn’t say “I’m not sure.” It picks one. Confidently.

The outputs looked authoritative. That was the problem.

What It Was Getting Wrong

Invented features. The model would describe specific features in images where those features weren’t visible - or didn’t exist. It had learned from training that certain object types typically come with certain features. When forced to classify, it filled in what it expected to see rather than what was actually there.

Wrong counts. Visible elements - things that could be directly counted from the image - were miscounted. The model was extrapolating from partial visual evidence and guessing.

Adjacent objects described as the subject. Many images showed the subject alongside neighboring objects. The model would describe adjacent items as part of the subject, because nothing in the prompt told it where the subject ended and the surroundings began.

None of these errors announced themselves. The output was fluent, specific, and wrong. A failing translation is obvious. A hallucinated feature is not.

Why This Class of Error Is Hard to Catch

With the translation pipeline, errors were detectable: Chinese text where English was expected, format artifacts, untranslated words. These could be caught programmatically.

Hallucinated visual content is different. A confident description of a nonexistent feature is grammatically correct, contextually plausible, and consistent with what the model knows about the domain. You can’t run a regex over it. You can only catch it by knowing what’s actually in the image - which defeats the purpose of having the model describe it.

At 12.8 million images, manual verification isn’t an option. The prompt had to be fixed.

The Fix: Stop Forcing the Model to Guess

The core problem was forced categorization. When you tell a model “classify this as A, B, or C,” you’re telling it to commit to one of three options regardless of what it can actually see. On a blurry thumbnail, that’s a coin flip delivered with confident language.

Two changes fixed most of the hallucinations.

Remove the forced categories. Instead of “classify as type A, B, or C,” the new prompt says “describe what you see.” The model isn’t pushed to choose when evidence is ambiguous. If it can see something clearly, it describes it. If it can’t tell, it describes what it can see without categorizing it.

Add explicit negative instructions. This worked better than expected for this model. Telling it specifically what not to do - “do NOT assume X is present, only mention if clearly visible,” “do NOT guess at counts, only state what you can directly see,” “do NOT use generic phrases” - was more effective than positive instructions about what to include. Negative instructions target the exact failure modes you’ve already observed.

Add scope rules. “Describe ONLY the subject, ignore adjacent objects visible in the frame.” The model had no rule telling it where the subject ended. Once told explicitly, it stopped describing surroundings as if they were part of the subject.

The Broader Lesson

VLMs are not lookup tables. They don’t retrieve facts about images. They generate text that is consistent with the image and with everything else they know. When the image is ambiguous, “everything else they know” fills in the gaps. A model that has been trained on millions of examples where feature X commonly appears alongside feature Y will generate descriptions mentioning X even when the image is too blurry to confirm it - because that’s the statistically likely completion.

This is not a bug in the model. It’s how language models work. The prompt has to account for it.

Forcing a model to categorize when it lacks the visual evidence to categorize reliably is the prompt designer’s error, not the model’s. Remove the forced choice, give it permission to describe rather than classify, and tell it explicitly which assumptions to avoid - and the same 7B model that was inventing features starts producing descriptions that are honest about what it can and can’t see.

The model didn’t get better. The instructions got better.

March 2026. Montreal.