fix: ADR-0006 — clear redo when new work commits without a snapshot

/runda found silent data loss: with the non-fatal snapshot-failure
policy, a committed mutation whose snapshot couldn't be staged left
the redo stack stale (redo-clear was only a side effect of finalize),
so a later redo silently discarded the new work. Same gap in batches.

- SnapshotStore::clear_redo() drops the redo stack + payloads
- snapshot_then / end_batch call it when committed user work has no
  staged snapshot; for disk-full it succeeds where a full backup
  couldn't (tiny index write + payload deletes)
- unit test + integration regression (forced staging failure)
- ADR-0006 implementation note records the fix + residual edge

1698 passed / 0 failed / 1 ignored; clippy clean.
This commit is contained in:
claude@clouddev1
2026-05-24 21:10:44 +00:00
parent 5442cfc0b9
commit df6aa69155
4 changed files with 123 additions and 11 deletions
+16 -1
View File
@@ -229,7 +229,22 @@ during implementation, user-confirmed where they extended the design:
- **Snapshot-failure policy** (user-confirmed): staging / finalise /
discard failures are **non-fatal** (logged) — the real persistence
is the durable state and a backup hiccup must not fail the user's
work. Only *restore* failures surface (as `UndoFailed`).
work. Only *restore* failures surface (as `UndoFailed`). A `/runda`
review found that this policy left a **data-loss edge**: a committed
mutation whose snapshot could not be staged added no undo entry and
did not clear the redo stack (clearing was a side effect of
`finalize`), so a later `redo` could silently discard the new work.
Fixed: any committed user mutation (and any batch with ≥1 committed
mutation) now **clears the redo stack even when its snapshot could
not be staged**, via an explicit `SnapshotStore::clear_redo`
(`src/db.rs` `snapshot_then` / `end_batch`). For the realistic
failure (disk full), `clear_redo` — which deletes redo payloads and
rewrites a tiny index — succeeds even when a full backup couldn't.
**Residual edge** (accepted): if the *entire* `.snapshots/`
directory is unwritable (so `clear_redo` itself fails), a stale redo
can survive; but that state means the whole undo subsystem is
broken, which the user would already observe. Regression-tested in
`tests/undo_snapshots.rs::redo_is_cleared_when_new_work_commits_without_a_snapshot`.
- **Batch** uses `BeginBatch`/`EndBatch` worker requests; `replay`
wraps its loop so a multi-command replay is one undo step,
finalised only if a mutation committed.