The Problem With Labels Is Everything Before the Labels

The Goal

The project involved building a multimodal classification pipeline for a large dataset - 1.5 million records, each with structured tabular data, a text description, and on average 9-10 associated images. My job was the unstructured data side: extract meaning from the images and text, convert it into structured features, and feed those features into a prediction model alongside ~300 existing tabular columns.

One step in that process was classification: generate text descriptions from 12.8 million images using a vision-language model, then take those descriptions plus the original text for each record, and assign semantic labels. Up to 50 labels per record, binary yes/no.

The question nobody had fully answered: what should the 50 labels be?

Problem 1: Who Decides What Matters?

The obvious answer is domain experts. Ask the people who know the domain what signals matter.

But which domain experts? This is where it gets complicated.

The client was a data company - they aggregated records at scale, enriched them with macroeconomic indicators, and built structured datasets. They understood data pipelines. They did not have specialists who could tell you which visual features of a subject correlate with the outcome being predicted.

So you look for external domain experts. But domain experts are rarely neutral. People on opposite sides of the same transaction are trained to interpret the same evidence in opposite directions. A feature one side presents as a strength, the other side uses to negotiate. The same observable characteristic gets opposite labels depending on who’s describing it and why.

Even if you could find neutral observers, “what matters for the outcome” is a research question with no clean consensus. Some signals dominate regardless of context. Others depend on segment, demographics, geography, and timing.

Without a proper domain partner, the label design was necessarily a best guess. We analyzed the descriptions the VLM had produced, looked for recurring themes, cross-referenced with domain literature, and built a taxonomy of 50 labels that seemed to cover the space.

Problem 2: Description Is Not Impression

A vision-language model describes what it sees. That is not the same as how a human reacts to what they see.

“The space has dark surfaces and limited natural light” is a description. Whether that’s desirable depends entirely on the person evaluating it. For one person it’s a dealbreaker. For another it’s exactly what they wanted. The same objective description maps to opposite valuations depending on who’s reading it.

To capture human impression, you would need a model fine-tuned on human reactions - trained on data where real people rated real subjects and explained why. That data doesn’t exist publicly at scale. We weren’t going to collect it. And even if we had, human reaction is not a single thing - it’s a distribution across people with different preferences.

We tried two prompt strategies. The first asked the model to describe objects: what is visible, what type, what condition. The second asked for impressions: how does the space feel, what would someone notice first. The second approach produced more evocative text but also more hallucination - the model was generating what it expected people to feel rather than what was actually in the image. When you ask a model to feel something, it confabulates.

The labels from impression-style prompts were noisier. The labels from object-description prompts were more reliable but blander. We ran with the object-description approach and accepted that some of the signal we wanted was simply not accessible this way.

Problem 3: You Cannot See What You Need to See

The features most diagnostic of condition - the ones that most directly signal “this needs work” vs “this is ready to use” - are fine-grained visual details. Surface wear. Hairline cracks. Staining. Aging materials. Outdated components.

These are exactly the features that disappear at thumbnail resolution.

The images in the dataset were thumbnail quality - compressed WebP files, web-scaled. High-resolution versions may have existed somewhere upstream. What we had was what we had.

The most valuable condition signals were invisible to the model. Not because the model was inadequate - a larger model with better vision wouldn’t have helped. The information simply wasn’t in the pixels.

This meant the condition labels we most wanted were the hardest to produce reliably. The model could sometimes infer condition from overall aesthetic - visible aging in materials, style dating - but couldn’t observe the specific details that would most clearly signal it.

Problem 4: Half of What Survived Was Already in the Database

After working through the first three problems - making reasonable label design choices without domain experts, using description rather than impression, accepting that fine detail was invisible - we ran the full classification pipeline and produced 50 labels for 1.3 million records.

Then we cross-referenced those labels against the ~300 tabular columns that already existed.

The results were uncomfortable. Around 24 of the 50 labels were highly redundant with existing tabular features. Presence or absence of specific amenities, structural characteristics, location attributes, size indicators - all of these had direct equivalents in the structured data. The tabular columns were often more precise than what the VLM could produce from images.

This is not a beginner’s oversight. Experienced engineers routinely build feature extraction pipelines before auditing what the existing structured data already captures - because the two workstreams feel separate. One team owns the tabular data. Another builds the unstructured pipeline. In a small team or a solo effort, the same person does both but at different times, months apart. The label design happens early when the tabular data feels like background context. The redundancy only becomes visible late, when both datasets are finally sitting next to each other and you run the overlap analysis. The sequence is the problem, not the skill level.

The XGBoost model confirmed it empirically. SHAP values - which measure each feature’s contribution to the model’s predictions - showed those 24 labels with near-zero importance. The model had learned to ignore them. They weren’t adding signal because the signal was already there in a more precise form.

This is not obvious upfront. You design labels for what visually describes a subject. You don’t naturally think to audit 300 tabular columns first to check what’s already captured. The right sequence would have been: inventory the tabular data exhaustively, identify what it cannot capture, design labels specifically for those gaps. We did it in the other order and validated after the fact.

What Survived

After removing the redundant labels and the ones too subtle for thumbnail resolution, what remained was roughly 16 labels that were genuinely new - information the tabular data didn’t have and the VLM could actually observe.

They fell into a few categories:

Condition signals the database can’t capture. Whether the subject looked recently updated or dated. Whether the overall state suggested ready-to-use or needing work. The tabular data had construction year and material types, but not renovation recency or overall condition quality.

Style and aesthetic character. Whether the visual aesthetic read as contemporary and minimal, or traditional and classic, or something in between. These are judgments that don’t map to any structured field.

Spatial feel beyond raw measurements. Whether a space felt generous or cramped, light or dark - impressions the VLM could partially capture from composition and visible proportions even at low resolution.

The SHAP values confirmed it. The single most important label in the model was a condition assessment label - rank 21 out of 338 total features. Condition, absent from the tabular data, was exactly what the model was hungry for.

A +0.99% improvement in prediction accuracy came from 16 features out of 50. The other 34 contributed nothing.

What This Actually Taught Us

Running the classification pipeline was the easy part. Days of GPU time, a working VLM, a clean output file. The hard part was everything that came before the first prompt was written - and some of it has no clean solution.

Finding neutral domain experts is harder than it sounds. The people who know a domain best are usually invested in one side of it. Their knowledge is real but their labels are not neutral. Without an independent expert or a research-grade labeling effort, you’re making educated guesses dressed up as design decisions.

Even with perfect domain expertise, some things humans evaluate subjectively cannot be extracted from a model that has no concept of desirable vs undesirable. Description and impression are different tasks. A VLM trained on general vision data will describe what it sees. It will not tell you whether what it sees is good. Fine-tuning on human preference data would help - but that data rarely exists at the scale and specificity you need, and collecting it is a project in itself.

Even with the right labels and the right model, image quality sets a hard ceiling. The signals most worth capturing - condition, wear, subtle quality differences - are exactly the signals that compress away first. You can’t prompt your way around missing pixels.

And after all of that, you still need to check whether the structured data already has what you’re trying to extract. Labels designed in isolation from the existing feature set will duplicate what’s already there. The model will confirm this empirically, but only after weeks of compute.

The 16 labels that added real value were the ones that survived all five filters: a defensible design rationale, an observable visual signal, enough resolution to detect it, a non-subjective enough framing for the model to handle, and a genuine gap in the structured data. That’s a narrow target. Hitting it required working backwards from failure more than planning forward from first principles.

Next time the label design audit happens before the pipeline runs - not after.

March 2026. Montreal.