fix: ADR-0006 — clear redo when new work commits without a snapshot

/runda found silent data loss: with the non-fatal snapshot-failure policy, a committed mutation whose snapshot couldn't be staged left the redo stack stale (redo-clear was only a side effect of finalize), so a later redo silently discarded the new work. Same gap in batches. - SnapshotStore::clear_redo() drops the redo stack + payloads - snapshot_then / end_batch call it when committed user work has no staged snapshot; for disk-full it succeeds where a full backup couldn't (tiny index write + payload deletes) - unit test + integration regression (forced staging failure) - ADR-0006 implementation note records the fix + residual edge 1698 passed / 0 failed / 1 ignored; clippy clean.
2026-05-24 21:10:44 +00:00
parent 5442cfc0b9
commit df6aa69155
4 changed files with 123 additions and 11 deletions
@@ -229,7 +229,22 @@ during implementation, user-confirmed where they extended the design:
 - **Snapshot-failure policy** (user-confirmed): staging / finalise /
  discard failures are **non-fatal** (logged) — the real persistence
  is the durable state and a backup hiccup must not fail the user's
-  work. Only *restore* failures surface (as `UndoFailed`).
+  work. Only *restore* failures surface (as `UndoFailed`). A `/runda`
+  review found that this policy left a **data-loss edge**: a committed
+  mutation whose snapshot could not be staged added no undo entry and
+  did not clear the redo stack (clearing was a side effect of
+  `finalize`), so a later `redo` could silently discard the new work.
+  Fixed: any committed user mutation (and any batch with ≥1 committed
+  mutation) now **clears the redo stack even when its snapshot could
+  not be staged**, via an explicit `SnapshotStore::clear_redo`
+  (`src/db.rs` `snapshot_then` / `end_batch`). For the realistic
+  failure (disk full), `clear_redo` — which deletes redo payloads and
+  rewrites a tiny index — succeeds even when a full backup couldn't.
+  **Residual edge** (accepted): if the *entire* `.snapshots/`
+  directory is unwritable (so `clear_redo` itself fails), a stale redo
+  can survive; but that state means the whole undo subsystem is
+  broken, which the user would already observe. Regression-tested in
+  `tests/undo_snapshots.rs::redo_is_cleared_when_new_work_commits_without_a_snapshot`.
 - **Batch** uses `BeginBatch`/`EndBatch` worker requests; `replay`
  wraps its loop so a multi-command replay is one undo step,
  finalised only if a mutation committed.