Four Ways the Outbox Bit Me

In the previous post I described an architecture where the Moments library doesn’t know that sync exists. Services record Mutation values into a one-method trait; one implementation writes to an outbox table, another does nothing. The library code is identical regardless of which backend it’s running under, and a third backend could be added without touching the library at all. The architecture is sound. I still believe in it.

The first six weeks of running it were a different story. Four bugs in particular are worth describing, because each one tells you something the diagrams don’t. Two are about the abstraction — assumptions the trait quietly makes about how the world works, and what breaks when the world doesn’t cooperate. Two are about the substrate — the boring queue-and-retry machinery underneath the trait, which has its own opinions about how things should be done. Clean architecture doesn’t relieve you of the substrate’s problems. It just gives you one place to deal with them.

1. The pull-delete loop

The first bug was the most embarrassing one, and the cleanest illustration of a hidden assumption in the design.

The symptom, as a user would see it: trash a photo on your phone via the Immich mobile app. Watch it disappear from the desktop a few seconds later when the next pull cycle arrives. Watch it reappear a few seconds after that. Watch it disappear again. The asset ping-pongs across the network, never settling, every thirty seconds, forever — or at least until you notice and start digging.

The mechanics: the pull-side handler for AssetDeleteV1 called MediaService::delete_permanently, which dutifully recorded an AssetDeleted mutation into the outbox, which the push manager dutifully tried to send back to the server, which had just told us the asset was gone. Server says delete → local deletes → outbox records delete → push tries to delete on server → server returns 404, push gives up. So far so harmless. But the next pull cycle then re-emits the original delete (because the server is still asserting “this asset has been deleted, here’s the event”), and we go around again.

The recorder-based design has a hidden assumption baked into it: mutations have a single source, and that source is the user sitting at the desktop. The whole point of recording mutations is to push them to the server. When the server starts producing mutations — and in a two-way sync, of course it does — the recorder happily reports those back to the server too, and the system feeds on its own output.

The fix (commit f097174) was a Library::delete_permanently_from_sync variant that takes the same code path except it skips the recorder. The _from_sync naming convention now appears wherever pull-side handlers mutate the library — there’s a _from_sync for trash, for restore, for the album mutations, and so on. Each one is a tiny duplicate of its normal counterpart, with the recorder call removed. It feels redundant, and it is, but the redundancy is the whole point: pull is the local equivalent of “don’t echo back what you were just told”, and the duplication makes the don’t-echo explicit at every call site.

There were two alternatives I considered. One was a thread-local “I am inside the sync handler” flag that the recorder would check; that’s a kind of magic that hides the directional question rather than answering it. The other was a parameter on every service method indicating origin; that pushes the question into every signature instead of into a single naming convention. The _from_sync approach won because it makes the asymmetry visible exactly where it matters and nowhere else.

The architectural lesson, stated generally: a recorder-based design implicitly assumes mutations have a single source. The moment a second source appears, you have to mark it in the code, or you get an infinite loop with a polite retry interval.

2. Retry, backoff, and the death of multi-id rows

The second bug is one of substrate, not abstraction. It would have bitten any outbox-based design, regardless of how the recorder was shaped.

The first version of the outbox stored multi-id mutations as a single row. AssetTrashed { ids: [a, b, c] } produced one outbox row with a JSON-encoded array of ids. It was simpler. The mutation type kept its natural shape — “the user trashed three photos” is a single mental event, after all — and the database had three fewer rows to deal with.

It was also wrong, in three independent ways.

The first was partial progress. The push call for that row was, internally, a loop over the ids making per-asset API calls. If the server accepted three of five ids and then the connection dropped, there was no way to record the partial success. The next retry would re-send all five, and the server’s behaviour on the already-trashed ones was implementation-defined enough that I didn’t trust it.

The second was blast radius. If a single id in the batch was the cause of a permanent failure — say, the user trashed an asset, then permanently deleted it elsewhere, then triggered a sync — the batch would fail forever on that one bad asset. The other four perfectly good operations were dragged into the dead-letter queue alongside it.

The third was per-batch backoff. Exponential backoff on the batch row meant one bad asset could starve the entire trash pipeline behind it. Asset X fails three times, the row sleeps for ten minutes, all five operations wait. Asset X fails again, the row sleeps for an hour. Operations Y and Z, which would have succeeded immediately, are blocked on Asset X’s bad behaviour.

The fix (commit 77d9647) rewrote Mutation::to_outbox_rows() to produce one row per entity, and added per-row retry/backoff machinery: attempts and next_attempt_at columns, exponential backoff capped at one hour, a DeadLetter status after ten consecutive failures so a permanently bad row stops being retried. AssetTrashed { ids: [a, b, c] } now produces three rows. Each one is independently retryable, independently dead-letterable, independently backed off.

I want to make this lesson explicit because it took me three iterations to internalise: the unit of retry should be the unit of failure. If a single id can fail independently, each id needs its own row. The temptation to compact the outbox is real — fewer rows feels cleaner, the JSON arrays are tidier in the schema browser — but the cost shows up under exactly the conditions you can’t easily reproduce in development.

This isn’t a recorder bug. It’s a queue-design bug, and would have happened with a channel, a Kafka topic, or a hand-rolled WAL. But it’s still a bug the architecture made me responsible for, because the trait that the architecture promises is only as good as the queue underneath it.

3. Identity across the round-trip

The third bug was the one that taught me the difference between decoupling and ignoring.

MediaId and AlbumId are local UUIDs that Moments generates when an asset or album is first created on the desktop. Immich also generates UUIDs for assets and albums on its end — those land in a column called external_id. The two are not the same. The local id is the row’s primary key forever; the external id is stamped onto the row once the push succeeds and the pull cycle confirms the asset is now visible server-side.

So far so reasonable. Now consider the lifecycle of a freshly imported photo. The user drops a JPEG into the import folder. MediaService::import generates a fresh MediaId, writes a row, records an AssetImported mutation. The push manager picks up the outbox row, uploads the file, receives the Immich asset id from the response, stamps it into media.external_id. Beautiful. Then the next pull cycle, dutifully fetching everything the server knows about, streams an AssetV1 event for that same asset.

The original handler for AssetV1 did INSERT OR REPLACE INTO media (id, external_id, …). The id it inserted was a freshly-generated local UUID — because that’s how the pull handler always created assets. The pull handler had no way to know that this asset already existed locally under a different local id, because the only thing connecting them was the external_id column it was about to overwrite.

The result: a brand-new row, with a brand-new MediaId, replacing the original one. Every album-membership row pointing at the original MediaId now dangled. The asset appeared in the timeline grid but not in any of the albums the user had just added it to. Worse, the original row’s on-disk thumbnail and cached EXIF data were now associated with an id that no longer existed in media.

Commits d43859f and 1692881 fixed this by making MediaId and AlbumId round-trip stable. On inbound sync, every handler looks up the existing row by external_id first. If a row already exists with that external id, the handler reuses its MediaId and updates fields in place, never minting a new one. The local id becomes the durable identity; the external id is the bridge that lets inbound sync find it again.

The general lesson — and this one took me by surprise — is that “an outbox decouples local mutations from remote sync” is technically true and architecturally meaningless if the identity of an entity doesn’t survive the round-trip. The decoupling is between the two flows of operations, not between the two namespaces of identifiers. Those namespaces still have to be reconciled somewhere, and the schema is the natural place. You need a column, or a constraint, or a lookup convention that lets inbound sync recognise “this thing already exists” rather than treating every server-side row as a first-time event.

In retrospect, the lesson is obvious. In practice, it costs you a weekend the first time, because the bug is invisible in any test where the asset is created locally or fetched from the server but never both. The minimum reproducer is a real round-trip, and you don’t usually write tests that span import → push → pull on the same asset.

4. Reset semantics, or: what does the server mean by “start over”

The fourth bug is the one that taught me sync protocols are not what they appear to be.

Immich’s sync protocol can send a SyncResetV1 message when the last checkpoint is more than thirty days stale. The semantics, in the spec, are roughly: “your understanding of my state is too old to be incrementally caught up. Treat what follows as a complete re-emission of everything I know about.”

The naive interpretation is “wipe the local cache and re-fetch everything.” That interpretation is catastrophic for an offline-first store. The user has locally-imported, not-yet-pushed assets sitting in their library. Those have no external_id because the server has never heard of them. A blind wipe would delete the user’s recent imports the moment the sync handshake decided their checkpoint was stale — and “stale” in this protocol can mean as little as a month of laptop-closed time.

The first version of the reset handler tried to be careful. It built a HashSet<MediaId> of every local asset at the start of the reset cycle, removed each id as the stream re-emitted it, and at end-of-stream deleted whatever was left over. That at least filtered to “things the server should have known about”. It still had three latent bugs, and I’ll only name them briefly because the full catalogue is in the design doc.

First, namespace confusion. media.id is the local UUID; the stream’s entity_id is the Immich UUID. The set was being keyed on one and reduced by the other, so the set never actually shrank — every reset cycle wiped the entire library. Second, locally-imported rows (external_id IS NULL) shouldn’t have been candidates at all, but the set didn’t filter them out. Third, albums and asset_faces had no orphan tracking, so they leaked across resets entirely.

Commits e54e037 and 3e24d9f rewrote the whole thing into a heartbeat-based reconciliation, which is the kind of state-machine design that’s interesting on its own merits. On SyncResetV1 the reset handler captures the current timestamp as a “checkpoint” and clears the per-entity-type ack cursors. Nothing gets deleted yet. Every entity-handler — AssetV1, AlbumV1, PersonV1, AssetFaceV1 — has an UPDATE table SET last_seen_at = now() WHERE id = ? baked into it. Any locally-driven server interaction (a push completion, a favorite round-trip) bumps the same column.

On SyncCompleteV1, four targeted sweeps run — on media, albums, people, and asset_faces — each filtering rows where last_seen_at < checkpoint. The media and albums sweeps add external_id IS NOT NULL as a guard, which is the single line of defence between a reset cycle and the user’s local-only imports. people and asset_faces don’t need that guard: nothing local creates them — they only exist once the server emits them. The whole scheme is crash-safe by accident: partial heartbeats persist across a process restart, so an interrupted reset cycle finishes correctly on the next pull rather than starting over.

The lesson is broader than reset. “Reset” in a sync protocol is not “delete everything.” It’s “re-establish ground truth without losing what only you know about.” The set of things only-you-know-about is — for an offline-first store — every row without an external_id plus every locally-mutated field that hasn’t yet been pushed. If your reset handler doesn’t have a name for that set, your reset handler is going to delete it.

This is the bug that has the least to do with the recorder trait itself. It belongs to the broader category of “things you only discover once you have the architecture, the outbox, and a real round-trip running long enough to encounter a thirty-day boundary.” But it earned its place on this list because it follows the same shape as the others: the architecture didn’t cause it, the architecture didn’t prevent it, but the architecture is what gave us one well-defined place to fix it.

What the four have in common

The bugs span the two categories the intro set out — abstraction (1 and 3) and substrate (2 and 4) — and what they share, across both, is that none of them required changing the architecture. Bug 1 added a _from_sync variant alongside the user-facing method, not a new trait or a new flag on the recorder. Bug 2 rewrote to_outbox_rows() and added retry columns to the schema; the recorder trait was untouched. Bug 3 changed how inbound handlers look up existing rows; nothing above the schema cared. Bug 4 added a single column (last_seen_at) and four sweep queries; the trait, the enum, the outbox table layout, the library services — all unchanged.

That’s the dividend the architecture actually pays. The bugs are real, expensive, and embarrassing in their own ways. None of them propagated upward. The library still doesn’t know sync exists. It just describes what it changed, and lets the substrate — and the people maintaining it — pay the round-trip’s price.

1. The pull-delete loop#

2. Retry, backoff, and the death of multi-id rows#

3. Identity across the round-trip#

4. Reset semantics, or: what does the server mean by “start over”#

What the four have in common#

1. The pull-delete loop

2. Retry, backoff, and the death of multi-id rows

3. Identity across the round-trip

4. Reset semantics, or: what does the server mean by “start over”

What the four have in common