What Happens When You Run an LLM on 1.5 Million Texts

I was translating 1.5 million French text descriptions to English using Qwen 2.5-7B-Instruct on 6 A100 GPUs. The task sounded simple. It was not.

This post is about everything that went wrong, every failure pattern I discovered, and the six-layer fallback system I built to get 1,410,640 clean translations out of 1,507,741 attempts. It took about 12-15 hours of wall-clock time for the main translation pass. It took weeks to clean up what came out the other end.

The Setup

The project: a large-scale multimodal pipeline - tabular data, French text descriptions, and 24 million associated images. My job was building the entire unstructured data pipeline — alone. Translation was the first step.

I used Qwen 2.5-7B-Instruct across 3 VMs (6 A100 GPUs total), batch size 256, with Flash Attention 2 on one VM and standard attention on the others. Deterministic settings: temperature 0, top_p 1.0, seed 42. Checkpoints every 1000 batches.

The optimized throughput was 16.71 texts/sec with Flash Attention 2, about 14 texts/sec without. Respectable. The translation itself was the easy part.

The Chinese Problem

About 6,035 translations came back in Chinese.

Not mistranslated French. Not garbled English. Clean, fluent Chinese. The model’s multilingual training meant that when it encountered ambiguous or malformed French input, it sometimes defaulted to Chinese instead of English. There’s no warning, no error. The output just isn’t in the language you asked for.

My first fix: strengthen the system prompt. “CRITICAL: You MUST output ONLY in English using Latin alphabet. Never output Chinese, Japanese, Korean, or any non-Latin characters.” This helped but didn’t eliminate the problem.

Second fix: force the assistant’s hand. I added a prefilled assistant message — {"role": "assistant", "content": "Here is the English translation:\n\n"} — so the model was already “speaking English” before it started generating. This reduced Chinese output significantly.

Third fix: automatic detection and retry. Any translation flagged as [CHINESE_DETECTED] got retried with the stronger prompt. If it failed again, it was marked [TRANSLATION_FAILED] for the fallback pipeline.

Fourth fix: MADLAD-400-3B-MT, a dedicated translation model, as a fallback for all Qwen failures. MADLAD fixed 11,367 out of 11,374 failed translations. That’s a 99.94% recovery rate.

The remaining 7 texts were fixed manually.

The Artifacts Were Worse Than the Failures

The Chinese translations were obvious — easy to detect, easy to fix. The real nightmare was the thousands of translations that were almost correct but contaminated with artifacts. I catalogued over 20 distinct failure patterns.

The “struggler” problem. The French text “TRES TRES PROPRE” (meaning “very very clean”) was sometimes translated as “struggling to be extremely clean,” and in some cases the model output began with “struggl,” “struggler,” or — memorably — “struggler struggler struggler” repeated before the actual translation. This pattern appeared in enough translations to need its own regex category.

Chat format leaks. The model is trained on chat data. Sometimes the chat format leaked: “Assistant:” appeared at the start of 207,000 translations. Not a subtle bug. Over 13% of outputs started with a role prefix that shouldn’t be there.

Meta-commentary. The model would sometimes add editorial notes: “Note: The phrase…” or “It seems there was a typo…” or “A more natural translation would be…” appended to the end of otherwise correct translations. It wasn’t translating — it was teaching.

Thinking out loud. Some translations included the model’s internal deliberation: “which literally means…”, “is often used to imply…”, “A common English equivalent might be…” — the model’s reasoning process leaked into the output as text.

Prompt injection. In some cases, the actual system prompt appeared in the output: “user Translate this French text to English.” followed by the translation. The model reproduced its own instructions.

All of these patterns occurred at low individual rates. But across 1.5 million texts, a 0.1% artifact rate means 1,500 contaminated translations. And I found over 20 different patterns.

The Postprocessing Pipeline

I built a multi-pass cleanup system:

Pass 1: Regex. A Python script (text_postprocessing.py) with three categories — PREFIX patterns (remove “struggl*”, “Assistant:”, “Note:”, “Here is the translation:”, prompt leaks from the start), SUFFIX patterns (remove “Note:…”, lecturing notes from the end), and MID patterns (remove “struggling” and compound words like “strugglingrightarrow” anywhere in the text). Plus sentence-level filtering for meta-deliberation.

Pass 2: Language detection. After regex cleanup, I ran langdetect across all 1.5M translations. Found 9,988 texts still entirely in French. The regex cleaned the artifacts but didn’t catch texts that were never translated at all.

Pass 3: OpenAI GPT-4o-mini. Re-translated those 8,600 French texts via the OpenAI API with 20 parallel workers. I used a commercial API as a cleanup layer for an open-source model’s failures — ironic but effective.

Pass 4: Google Translate fallback. 1,388 texts remained after hitting the OpenAI token limit. Google Translate handled those.

Pass 5: French word detection. A second scan found 20,115 texts that had French words mixed in — not enough to trigger langdetect as French, but enough to contain untranslated phrases. Most started with “user Translate…” — prompt leak variants that partially translated.

Pass 6: Second Google Translate pass. Re-translated those ~20K contaminated texts.

Pass 7: Final merge. Combined everything into a single clean parquet file.

Final result: 1,410,640 clean English translations. About 3,900 texts still contained some French contamination — mixed FR/EN that no automated system could cleanly separate.

Batch Size 512 Will Betray You

A side lesson from the translation run. On VM3, I tried batch_size=512 to speed things up. GPU memory sat at 83-87%. Throughput was decent. Then at 70% progress — OOM crash. 154,685 translations failed. A third of that VM’s workload, gone.

The problem: GPU memory at 85% looks stable for the first hour. Over several hours of continuous processing, memory fragments accumulate. The peak memory spikes that fit comfortably at first start hitting the ceiling.

I dropped to batch_size=256 everywhere and added preventive CUDA cache clearing every 50 batches. GPU memory stabilized at 57-67%. Failure rate went from 33% to 0.3-0.5%.

The irony: batch_size=256 with Flash Attention 2 gave 16.71 texts/sec. Batch_size=512 without FA2 gave 12.18 texts/sec before it crashed. The smaller, stabler configuration was actually faster.

What I Learned

If you’re running LLMs at scale on production data, here’s what notebook experiments won’t teach you:

Every possible failure mode will happen. If there’s a 0.01% chance of Chinese output, you’ll get 150 Chinese translations in 1.5 million. If there’s a 0.001% chance of the prompt leaking into the output, you’ll see it. Scale turns edge cases into guarantees.

The QA pipeline is as complex as the inference pipeline. I spent more engineering time on postprocessing, detection, and fallback logic than on the actual translation. This is normal and nobody talks about it.

Open-source LLMs have personality. Qwen likes to teach. It wants to explain its translation choices, note typos, suggest alternatives. This is helpful in a chat interface and catastrophic in a batch pipeline. You need to actively suppress these behaviors in production.

Have fallback models ready before you start. MADLAD-400-3B-MT saved the project. If I’d discovered the Chinese problem without a fallback model already identified, I would have lost days researching alternatives while GPUs sat idle.

Batch size headroom matters more than batch size throughput. The fastest configuration that crashes at 70% is worse than a slower one that runs to completion. Leave 30-40% GPU memory headroom for multi-hour runs.

The translation step was supposed to be the easy part of the pipeline. It took three weeks to get right. The image processing — 12.8 million photos through VLMs — was still ahead of me. But that’s a different post.