Our 68-Hour Estimate Took 7 Days

We benchmarked our VLM image-to-text pipeline in a notebook. 1,000 documents, a few runs, measured throughput. The math said 68 hours for the full 1.3 million documents on 8 GPUs. We told our timeline to the team. Then we started the production run.

It took 7 days. The estimate was off by over 900%.

This post is about why small-batch benchmarks systematically lie about production performance, and why the bottleneck was in the last place we expected.

The Setup

The project: extract structured text descriptions from 12.8 million images for a large-scale multimodal classification pipeline. I was using Qwen2.5-VL-7B-Instruct — a vision-language model — to look at each image and produce a text description of what it saw. Structured features: type, condition, notable characteristics.

The first run used native HuggingFace Transformers across 10 GPUs. It preserved vision embeddings (3584-dimensional vectors per image) alongside the text descriptions. Slow — about 0.08-0.2 documents/sec/GPU. The full run took approximately 25 days.

Then vLLM added support for multimodal models, including Qwen2.5-VL. vLLM uses continuous batching and PagedAttention to process many requests concurrently. For text-only LLMs, it typically delivers 2-5x speedups. For vision-language models, no one had published benchmarks yet.

We ran our own.

The Benchmark

1,000 documents. A few batch sizes. Measured throughput on a single GPU.

Batch Size	Throughput (docs/sec)
10	0.95
50	1.72
100	1.68
200	1.51

Batch size 50 was optimal: 1.72 documents/sec/GPU. On 8 GPUs: 1.72 × 8 = 13.76 documents/sec total.

1,320,000 documents ÷ 13.76 docs/sec = 95,930 seconds = 26.6 hours.

Even being conservative and assuming some overhead, we estimated about 68 hours, - or under 3 days. After the 25-day native Transformers run, this felt like a breakthrough.

We were wrong.

What Actually Happened

Metric	Benchmark	Production	Difference
Properties/sec/GPU	~3.5	0.25-0.32	-91% to -93%
Total time (1.3M, 8 GPUs)	~68 hours	6-7 days	+900% to +1100%
Effective speedup vs native Transformers	20-30x	3-4x	-85%

The speedup was still real — 3-4x faster than native Transformers is meaningful when the alternative is 25 days. But the “68 hours” number was fiction. We had committed to a timeline based on fiction.

Where the Time Went

The bottleneck was not inference.

Once images were preprocessed and fed to the model, batches of 256 images ran inference at approximately 170 images/sec. That’s fast. If the entire pipeline ran at inference speed, 68 hours would have been accurate.

But vLLM has an internal image preprocessor. It takes raw image bytes, decodes them, resizes them, normalizes them, and converts them to tensors before the model ever sees them. This preprocessor ran at 4-7 images per second.

4-7 images/sec. Not 170. The preprocessing was 25-40x slower than inference.

We tried to bypass it. We pre-computed tensors externally and tried to feed them directly to vLLM. It rejected them. vLLM’s multimodal pipeline is opaque — it insists on preprocessing images itself, and there’s no public API to skip that step. We tried parallel image loading with multiple workers feeding the preprocessor. Marginal improvement. The preprocessor itself was the ceiling, not the I/O.

The overall throughput — preprocessing + inference combined — landed at about 28 images/sec overall, or roughly 0.25-0.32 documents/sec/GPU (documents have an average of ~10 images each).

Why the Benchmark Lied

Three factors compounded, and none of them show up in a notebook test:

Factor 1: Preprocessing cost is amortized differently at scale. On 1,000 documents (~10,000 images), the preprocessing overhead is a small fraction of total runtime — maybe 20-30 minutes. The benchmark runs for an hour total, preprocessing is invisible, and inference throughput dominates your measurement. On 1.3 million documents (~13 million images), preprocessing becomes the majority of runtime. The fixed cost per image that was negligible at 10K images becomes the binding constraint at 13M.

Factor 2: Memory pressure accumulates over hours. The first hour of a production run looks like the benchmark. The 20th hour doesn’t. GPU memory fragments. CUDA caches fill. Garbage collection pauses accumulate. On one VM (VM4), sustained throughput was 25% slower than the other VMs — likely thermal throttling or resource contention from running 4 GPUs continuously at 85% utilization for days.

Factor 3: Documents are not uniform. Our benchmark sample of 1,000 documents was stratified, but the tails of the distribution matter at scale. Some documents have 3 images. Some have 40. A 40-image document takes 10x longer to preprocess but only counts as one document in the throughput measurement. At 1,000 documents, the variance averages out. At 1.3 million, dense regions of many-image documents create sustained slowdowns that never appeared in testing.

The Broader Lesson

After this experience, I changed how I estimate processing time for large-scale pipelines.

Run a mini-production test, not a notebook benchmark. A thousand items tells you your peak throughput. Ten thousand items, running for several hours, tells you your sustained throughput. The difference can be 10x.

Measure the whole pipeline, not the model. We measured inference speed and extrapolated. The bottleneck was preprocessing — a step we didn’t even think to benchmark because it was “just loading images.” At scale, “just loading images” was 25x slower than the model.

Add a 3x multiplier to any estimate based on small-batch testing. This isn’t scientific. It’s a scar. Every estimate I made from notebook benchmarks was dramatically optimistic. The 3x multiplier would have put us at 8.5 days — close to the actual 7. I’d rather under-promise.

Budget for heterogeneity. If your data has variable-size items (images per document, tokens per document, rows per group), your average-case throughput is not your production throughput. The heavy items dominate processing time, and they cluster — real data isn’t shuffled uniformly.

The Timeline We Actually Hit

Step	GPUs	Estimated	Actual
Image→text (native Transformers, Set 1)	10	”a few weeks”	~25-30 days
Image→text (vLLM, Set 2)	8	68 hours	6-7 days
Image→text condensation (LLM, Set 3)	6	—	3 days

The vLLM run was still 3-4x faster than native Transformers. That’s a real improvement. But “3-4x faster” and “68 hours” are very different promises.

The native Transformers run couldn’t have been replaced by vLLM entirely, by the way. We needed the 3584-dimensional vision embeddings for a downstream embedding-based classifier, and vLLM doesn’t expose internal model embeddings. It’s a closed inference engine — you get text output, nothing else. So we ran both: native Transformers for embeddings (slow but necessary) and vLLM for a second set of text descriptions (fast but opaque).

One More Thing

During the vLLM run, we had a mid-run observation table tracking the actual slowdown:

Metric	Initial Estimate	Actual (Sustained)	vs Native Transformers
Properties/sec/GPU	~3.5	0.25-0.32	3-4x faster
Total time (1.3M, 8 GPUs)	~68 hours	6-7 days	3-4x faster
Speedup claim	20-30x	3-4x	—

That “20-30x speedup” from the benchmark shrunk to 3-4x in production. Still worth it. But if I’d staffed a timeline around the 20-30x number, I would have missed the deadline by a week.

The next time someone shows you a benchmark on 1,000 items and extrapolates to millions, ask them: did you run it for 48 hours straight? Did you measure preprocessing separately from inference? Did you account for data heterogeneity?

If the answer is no, multiply their estimate by 3 and hope for the best.