The Setup
I was running a vision-language model across 10 A100 GPUs on 4 VMs, converting 12.8 million images into text descriptions. Each GPU was processing a chunk of roughly 165,000 properties. A single property has on average 9-10 images. The model looks at all of them together and produces a structured description.
The full run was expected to take 25-30 days. I was running it alone - the entire unstructured data pipeline was my responsibility, from data ingestion to model output. Long enough, and complex enough, that things will go wrong.
On December 20th, something went wrong.
Why These Bugs Are Hard to See Coming
Before getting into what happened: this post is not about obvious mistakes. The failure modes described here are well-documented in distributed systems engineering - they’re named, written up, and discussed on Stack Overflow with thousands of upvotes - precisely because they catch experienced developers regularly. They share one property: they only appear at full scale, after days of processing. You cannot reproduce them in a notebook test on 100 properties. You cannot unit test a 25-day run.
Our code had also grown from a prototype. The pipeline started as a script that worked on a single GPU over a few hours. Checkpointing was added as the scale grew. The gap between “has checkpointing” and “has production-grade checkpointing” only becomes visible when the job runs long enough to encounter conditions a short run never hits.
This is how most real pipelines are built. The bugs weren’t hiding in plain sight. They were hiding behind each other, at a scale where no amount of upfront testing would have surfaced them.
The Mystery
VM3-GPU0 was at 98% progress. Then, without warning, it dropped to 18%.
Not a crash with a stack trace. Not an out-of-memory error. Just: progress gone. The process was still running. The checkpoint file existed. But when the script read the checkpoint, it thought only 18% of the work was done.
The cause was never definitively identified - possible checkpoint file corruption, a disk issue, or a process crash and restart that didn’t preserve state. It didn’t matter. 80% of VM3-GPU0’s work appeared to be gone.
What happened next was worse than the original problem. This is the cascade: each fix revealed the next bug. You cannot see all three at once.
Bug 1: The Checkpoint Backups That Weren’t
The script was designed to keep rotating backups - the current checkpoint plus two previous versions. If the main checkpoint corrupted, you’d have a fallback.
Except there was a filename bug. The backup files were being written to the wrong paths. The code thought it was saving checkpoint_backup_1.json and checkpoint_backup_2.json. It was saving them somewhere else entirely.
So when VM3-GPU0’s main checkpoint corrupted, there was no usable backup. The work from those 80% of properties that appeared done: unrecoverable.
Fix: correct the backup file paths. Add verification after each save - confirm the file exists, confirm its size is non-zero. Stop trusting that write operations succeed silently.
All machines were stopped to pull the updated code.
Bug 2: The Flag That Destroyed Everything
Once the checkpoint fix was deployed, three GPUs needed to be restarted: VM1-GPU0, VM1-GPU1, and VM2-GPU0. They were restarted without the --resume flag.
Without --resume, the script didn’t look for existing checkpoints. It started fresh. It overwrote the checkpoint file. It created a new, empty parquet file. It created a new, empty embeddings file.
Hundreds of thousands of already-processed properties, gone. Not corrupted. Not recoverable. Overwritten from scratch.
This one has a well-known cousin: Path.rename() in Python fails silently on cross-device moves. The script appears to work. The file ends up nowhere. It has its own Stack Overflow thread with thousands of upvotes from developers who were caught by it. The common thread: operations that look like they succeed but don’t, with no error raised. Silent failures are the expensive ones.
Fix: make resume automatic. The script now detects whether a checkpoint file exists for the current job. If it does, it resumes. No flag needed. The dangerous default was: start fresh. The new default: resume if possible, always.
Bug 3: The Resume That Didn’t Restore
After adding auto-resume, the GPUs were restarted. The script correctly detected the checkpoints and resumed from the right position. Progress counters looked right.
But the output files were empty.
The resume logic read the checkpoint to know which properties were already processed. Then it created new, empty output files - parquet and embeddings - and started appending from the current position. All the results from before the restart: gone again.
The checkpoint tracked which properties had been processed. It didn’t track where the results were. Resume logic that restores position but not data is not resume logic.
Fix: on resume, load the existing parquet file and the existing embeddings file before continuing. Append to what’s there. Three lines of code. Three weeks too late.
The Total
| What was lost | Amount |
|---|---|
| Properties | ~315,000 |
| Affected GPUs | 4 |
| Estimated compute lost | ~25 single-GPU days |
| Bugs involved | 3 |
| Bugs visible simultaneously | 1 |
The 25 GPU-days figure is the part that stings. But the structure of the failure is the part worth understanding: three bugs, each invisible until the previous one was fixed. Each fix created the conditions to see the next one.
After all three fixes were in place, the remaining unfinished properties were redistributed across all 8 available GPUs. Processing completed January 8th.
What the Pipeline Looked Like After
The checkpoint code is now longer than the inference code. These are the requirements it satisfies:
Write atomically. Never write directly to the checkpoint file. Write to a temp file, verify it, then rename it over the old checkpoint. A partial write to a temp file leaves the previous checkpoint intact.
Verify after every save. After every checkpoint operation: does the file exist? Is its size reasonable? Does it parse without errors? If any check fails, alert before the next batch runs.
Keep rotating backups with verified paths. Current checkpoint plus two previous versions. Not verified by reading the code - verified by confirming the files actually appear on disk during a test run.
Resume must restore data, not just position. If the pipeline produces parquet files and embeddings, resume must load those files before continuing. The checkpoint file tracks position. The output files hold the work. These are different things.
Auto-detect resume conditions. Never require a flag to resume. If a checkpoint exists and looks valid, resume. Make the safe behavior the default.
Log what was restored. On resume, log how many properties were in the loaded parquet file, how many embeddings were loaded, and what position the checkpoint reports. If these numbers are inconsistent, fail loudly before processing a single property.
The Lesson
In a long-running distributed pipeline, assume that every file operation can fail, assume that every restart loses state unless you explicitly prove otherwise, and assume that the thing you’re monitoring is not necessarily the thing that matters.
The cascade structure was the real difficulty here. These three bugs don’t appear together - they appear in sequence, each one masked by the one before it. That’s what makes distributed systems failures expensive to diagnose: you’re not solving one problem, you’re solving a series of problems that only become visible one at a time.
The checkpoint code that came out of this is bulletproof. It cost 25 GPU-days to get there. That’s an expensive lesson - but it’s also a common one, and the system that came out the other end reflects it.
March 2026. Montreal.