add ADR-0024: unified grammar tree execution plan (accepted)

Concrete specification for the direction in ADR-0023, landed during the round-6 design pass. Resolves all four rounds of open design questions: walker as single source of truth, scannerless terminal vocabulary (~8 building blocks), typed value slots with content validators, WalkContext for schema- aware narrowing from day one, WalkOutcome multi-purpose return, HintMode per-node, ranker as separate layer, static + dynamic sub-grammars, aliases as Word annotations, IdentSource taxonomy, six-phase per-command migration with chumsky and walker side-by-side during the transition. Key shifts from ADR-0023's sketch: - Lexer dissolves entirely. Walker operates on bytes directly. dsl/lexer.rs, dsl/keyword.rs go away in Phase F. - Schema-aware parse from day one (not phased). Typed value slots reject mis-shaped input at parse time with localised wording. Completion narrows per column type. - Sub-grammars: static (fn() -> Node) for composition; dynamic (fn(&WalkContext) -> Node) for schema-dependent expansion. No global named registry. - Path-bearing commands: BarePath becomes a routine non-whitespace terminal. Paths with spaces require quoting via StringLit (UX simplification, aligns with standard CLI convention). - 13-node taxonomy: Word, Punct, Ident, NumberLit, StringLit, BlobLit, Flag, BarePath, Choice, Seq, Optional, Repeated, DynamicSubgrammar. Migration plan: Phase A (walker scaffolding + app-lifecycle commands), Phase B (DDL without value literals), Phase C (create table), Phase D (data commands with full schema awareness -- the design's central claim landing), Phase E (replay), Phase F (delete chumsky + lexer + legacy parser modules, simplify catalog). Estimated ~4 sessions total. Also: rename ADR-0023 from 0023-proposed-unified-grammar-tree.md to 0023-unified-grammar-tree.md (git mv preserves history) and update its status to reflect the direction-accepted-but- superseded-for-execution-detail relationship with ADR-0024. Index updated.
2026-05-14 21:52:10 +00:00
parent 3b36bbb4d6
commit 74c3ec1edf
3 changed files with 719 additions and 14 deletions
@@ -0,0 +1,430 @@
+# ADR-0023: Unified declarative grammar tree (direction)
+
+## Status
+
+**Accepted in direction, superseded for execution detail by
+ADR-0024.** 2026-05-14.
+
+This ADR captures the architectural critique (the "10-place
+edit" scatter problem with the current parser shape) and the
+direction (a unified declarative grammar tree). The round-6
+design pass turned that direction into a concrete specification,
+which ships as ADR-0024. ADR-0024 makes some refinements
+beyond what's sketched here — notably the decision to drop the
+lexer module entirely (scannerless walker) and to put schema-
+aware narrowing into round 1 rather than phasing it. Read
+ADR-0024 for the executable plan; this ADR remains for the
+institutional memory of why the change is happening.
+
+The filename was renamed from `0023-proposed-unified-grammar-tree.md`
+to `0023-unified-grammar-tree.md` when the direction was
+accepted. History is preserved through the `git mv`.
+
+## Context
+
+### What hurt
+
+The round-5 session removed the (small, accidental) `q` alias
+for the `quit` command. Removing one keyword required edits in
+ten places:
+
+1. `define_keywords!` entry in `src/dsl/keyword.rs`
+2. Parser combinator branch in `src/dsl/parser.rs`
+3. `UsageEntry` row in `src/dsl/usage.rs::REGISTRY`
+4. Hardcoded keyword array in `usage.rs::every_command_has_a_registry_entry` test
+5. Hardcoded keyword array in `usage.rs::entry_keywords_alphabetised_returns_unique_sorted_commands` test
+6. `KEYS_AND_PLACEHOLDERS` declaration in `src/friendly/keys.rs`
+7. `parse.token.keyword.q` entry in `src/friendly/strings/en-US.yaml`
+8. `help.cli_banner` prose in `en-US.yaml`
+9. `help.in_app_body` prose in `en-US.yaml`
+10. Hardcoded keyword array in `completion.rs::empty_input_offers_app_command_entry_keywords` test
+
+Adding a brand-new app command in the same session required a
+similar number of touches across the same files plus
+`Command` enum extension, dispatch handler wiring, and a
+per-command usage template entry. The pattern is: every new
+command or keyword incurs a cross-file scatter of typically 7
+files for a normal addition, 10+ when tests and help text
+catch the change.
+
+### Why this happened
+
+The current architecture is the accretion of features added
+across separate ADRs, each locally sensible:
+
+- **ADR-0001** chose chumsky for parsing — a general-purpose
+  parser-combinator library oriented at programming-language
+  grammars with expression precedence, error recovery, and
+  ambiguous-grammar handling.
+- **ADR-0019** introduced the i18n catalog as a flat YAML +
+  Rust-side validator (`keys.rs`) for two-sided typo
+  protection.
+- **ADR-0020** added the lexer + `define_keywords!` macro
+  when the unified-token requirement bit. The macro
+  consolidated *keyword* definitions but didn't tackle the
+  broader command surface.
+- **ADR-0021** added a per-command `UsageEntry` registry so
+  parse errors could surface usage templates.
+- **ADR-0022** added completion by *introspecting* chumsky's
+  expected-token set at parse failure points, rather than by
+  consulting the grammar declaration directly.
+
+Each step solved its presenting problem. None of them
+restructured the grammar declaration to be the single source
+of truth for completion + highlighting + parse + help + i18n.
+
+Requirements for completion, highlighting, help, and i18n
+were all known from project start. A design pass at the
+beginning asking "what unified data structure carries all of
+this?" would not have landed at the current scattered shape.
+The trajectory is a process critique, not an inevitability.
+
+### What chumsky earned
+
+For the DSL we actually built — deterministic prefix-keyword
+commands with a small set of clauses — chumsky's general-purpose
+machinery is mostly unused:
+
+- We have no arbitrary expression grammar (no arithmetic
+  precedence, no function calls, no recursion).
+- We have no multi-error recovery requirement; we fail on the
+  first error and ask the user to fix it.
+- We have no ambiguous-grammar handling needs.
+
+Chumsky's `try_map` custom-error machinery is exercised (for
+"unknown type", "tables need at least one column", flag
+mutual-exclusion). These are pre-shape and post-shape
+validators that fit naturally into chumsky's combinator
+model. They could be expressed equally cleanly as per-node
+validator functions in a trie-based design — the placement
+matters less than the shape.
+
+### What a unified grammar would look like
+
+The proposed structure (sketched by the project owner during
+the round-5 design discussion):
+
+```js
+commandGrammar = {
+  add: {
+    helpId: "add-command",
+    shortHelp: "Add structure to a table or relationship",
+    highlightType: "top-level-command",
+    cont: {
+      "1:n": {
+        highlightType: "sub-command",
+        cont: {
+          relationship: {
+            cont: {
+              from: {
+                cont: {
+                  // qualified column slot: <Table>.<Column>
+                  // bound to parent_table + parent_column
+                  // through the command's extractor.
+                  ...
+                  completionSource: "table-names",
+                }
+              }
+            }
+          }
+        }
+      }
+    }
+  },
+  import: {
+    helpId: "import-command",
+    cont: {
+      completionSource: "current-folder-zip-file-names",
+      // …
+    }
+  },
+  // …
+}
+```
+
+This declaration carries everything the current system spreads
+across many files: grammar shape (`cont`), completion sources
+(`completionSource`), highlight classes (`highlightType`),
+help references (`helpId`). The same declaration drives
+completion (walk to current node, list its children), syntax
+highlighting (each node's class), parse-error usage rendering
+(walk to failure point, list valid continuations), and AST
+construction (per-command extractor walks the matched path).
+
+The shape generalises to most SQL surface a teaching tool
+would expose. Where-clauses and similar reusable chunks can
+be named and registered separately, then referenced by ID
+from anywhere they're needed. True expression grammars
+(arithmetic, function calls, precedence climbing) — if
+they're ever needed — fit as opaque leaf nodes whose
+*structure* the trie validates, with the actual interpretation
+delegated to a downstream module or simply passed through to
+the SQL engine.
+
+## Proposed direction
+
+### Data model
+
+A single grammar registry, structurally similar to the
+sketch above, declared once in Rust. Per node:
+
+- `entry: &'static str` — the literal that selects this
+  branch (for keyword nodes) OR a typed slot descriptor (for
+  literal / identifier / completion-source nodes).
+- `cont: &'static [Node]` — child nodes representing valid
+  continuations from this point. Empty for terminal nodes.
+- `highlight: HighlightClass` — colour role for the input
+  pane and echo line. Inherits from parent if not specified.
+- `completion_source: Option<CompletionSource>` — for
+  identifier slots, the schema-cache key or static list
+  that drives Tab candidates and known-set validity checks.
+- `help_id: Option<&'static str>` — reference into the help
+  catalog (decoupled from grammar so wording changes don't
+  touch grammar).
+- `validator: Option<NodeValidator>` — per-position
+  validator function (e.g., "this identifier must be a
+  valid new name", or "this slot can occur at most once").
+- `extractor_role: Option<&'static str>` — names the role
+  this slot plays in the command's typed output (e.g.,
+  `"parent_table"`). Read by the command's extractor at
+  AST-construction time. Optional because the *positional*
+  shape of a command's tree is usually enough — the
+  extractor knows the command's structure and reads the
+  walked path in order.
+
+Per command (top-level node):
+
+- `ast_builder: fn(WalkedPath) -> Command` — walks the
+  matched path and produces the typed AST variant. Replaces
+  the per-command chumsky combinator's `.map(...)` closure.
+- `dispatch: fn(&mut App, Command) -> Vec<Action>` — the
+  dispatch handler. Replaces the per-command `match` arm in
+  `dispatch_input` / `dispatch_app_command`.
+- `help_id` — root help reference for the command family.
+
+### Named sub-grammars
+
+For composable chunks (where-clauses, projection lists,
+qualified column references, value literals), the registry
+supports named sub-grammars:
+
+```rust
+register_subgrammar("where_clause", &[
+    // structure declaration …
+]);
+
+// Referenced from any command:
+SubgrammarRef("where_clause"),
+```
+
+The walker treats a `SubgrammarRef` as a transparent
+expansion at parse / completion / highlight time. The
+extractor reads the sub-grammar's matched path and applies
+the sub-grammar's own AST-fragment builder.
+
+### Walker functions
+
+A single walker module exposes:
+
+- `complete(input, cursor) -> Vec<Candidate>` — walk the
+  trie alongside the typed prefix; at the cursor's
+  position, return the union of (a) literal children of the
+  current node, (b) candidates from the active node's
+  `completion_source`. Replaces `completion.rs`.
+- `highlight(input) -> Vec<StyledRun>` — walk producing the
+  highlight class per token range. Replaces the
+  ad-hoc lookups in `input_render.rs`.
+- `parse(input) -> Result<Command, ParseError>` — walk
+  consuming tokens, running per-node `validator`s, applying
+  the command's `ast_builder` at completion. Replaces
+  `dsl::parser::parse_command`.
+- `usage_at(input, position) -> UsageBlock` — walk to the
+  failure point, render the valid continuations as a
+  usage template. Replaces `usage::matched_entry`.
+
+All four operations read the same registry.
+
+### Value-leaf parsers
+
+Literal types (number, string, date, bool) and identifier
+shape validators (new-name checks) remain as small standalone
+functions referenced by leaf nodes. They don't pretend to be
+parser combinators — they're predicate-plus-builder pairs.
+The chumsky machinery they currently use is shed.
+
+### i18n integration
+
+The catalog (en-US.yaml) stays. The `keys.rs`
+`KEYS_AND_PLACEHOLDERS` validator stays. The
+`parse.token.keyword.*` entries can collapse to a default
+formatter (every keyword renders as `` `{literal}` `` unless
+the catalog explicitly overrides for a specific keyword).
+Adding a normal keyword no longer requires a catalog entry
+unless its wording deviates from the default.
+
+The grammar tree references `help_id` strings, not catalog
+keys directly, so help wording lives in its own catalog
+section without touching grammar declarations.
+
+## Trade-offs
+
+### What we give up
+
+- **chumsky as the DSL parser.** Library dependency stays
+  if it's still used elsewhere (it isn't, currently). The
+  recovery and ambiguous-grammar features go unused, so
+  losing access to them costs nothing concrete.
+- **Single-file grammar entry per command.** The current
+  per-command combinator in `parser.rs` was always a
+  separate function; in the new model the command's
+  `ast_builder` is colocated with its grammar declaration.
+  This is a gain, not a loss, but it means every command's
+  current parser function gets rewritten.
+- **No automatic backtracking on alternative branches.**
+  The trie design is greedy (the first child node that
+  matches wins). For our deterministic grammar this is fine
+  — `drop column`, `drop relationship`, `drop table` are
+  disambiguated by their second keyword, so the walker
+  picks the right branch on the first token after `drop`.
+  Pathological grammars that require backtracking are out
+  of scope.
+
+### What we gain
+
+- **One block per command.** Adding a new command =
+  declare a top-level node with its `cont`, `ast_builder`,
+  `dispatch`, and `help_id`. No edits to a separate
+  registry, no edits to a separate catalog list, no edits
+  to a separate dispatch match, no edits to tests (which
+  iterate the registry).
+- **Adding a keyword = one node literal.** No
+  `define_keywords!` macro entry, no `parse.token.keyword.*`
+  catalog entry (default formatter handles it), no
+  `keys.rs` declaration (the same default handles it).
+- **Completion + highlight + parse + usage rendering all
+  come from one source.** Drift is structurally impossible
+  because they all walk the same tree.
+- **Aliases as a single annotation.** A keyword node
+  declares `aliases: &["q"]` and the walker accepts any
+  of them; no new variant, no new dispatch wiring.
+- **Tests focus on behaviour, not enumeration.** Tests
+  that previously asserted on hardcoded keyword lists
+  iterate the registry. Adding/removing a command leaves
+  test code untouched.
+- **Documentation discoverability.** The grammar
+  registry IS the spec. Reading `commandGrammar` tells you
+  every command, every option, every continuation.
+
+### Migration cost
+
+Estimated at ~4 sessions:
+
+- Session 1: design the walker + registry data model in
+  detail; build a stub with one command migrated end-to-end
+  alongside the existing chumsky path.
+- Session 2: migrate the data-command family (create,
+  drop, add, rename, change, show, insert, update,
+  delete, replay). Tests at each step verify the
+  walker-driven parse produces the same `Command` as the
+  current chumsky parse.
+- Session 3: migrate the app-command family (quit, help,
+  rebuild, save / save as, new, load, export, import,
+  mode, messages). Drop the parallel chumsky path.
+- Session 4: clean up — remove dead modules
+  (`usage.rs::REGISTRY`, expected-set introspection in
+  `completion.rs`, the ad-hoc lookups in `input_render.rs`),
+  remove `keys.rs` entries that the default formatter now
+  covers, simplify the catalog.
+
+Steady-state cost after migration: a new command is one
+block. A new keyword is one node literal. A typo in either
+fails the test suite because tests iterate the registry.
+
+## Why not now
+
+The project has a non-trivial feature backlog (Query DSL,
+constraint management, V-series UX projects, A1 CI workflow,
+multi-line input, readline shortcuts, undo/snapshot, …).
+Doing this refactor now would freeze feature work for ~4
+sessions and would interact disruptively with any in-flight
+ADRs that touch grammar surfaces.
+
+The "scatter cost" is bearable for the near-term command
+count. Most commands are already in place; we're not
+adding ten new ones in the next few sessions. Each new
+command incurs ~7-file scatter; that's a modest
+recurring tax, not a crisis.
+
+The right moment to execute is when:
+
+- Feature backlog quiets down, OR
+- Cumulative scatter cost from new commands becomes
+  visibly painful, OR
+- The grammar needs to extend in ways the current shape
+  fights against (e.g., a real SQL parser landing in
+  advanced mode would benefit from this restructuring
+  more than from another bolt-on).
+
+Until then: leave it.
+
+## Migration plan (when executed)
+
+Per-command migration, not big-bang. The new walker runs
+alongside the chumsky path during the transition. Each
+command is migrated in sequence:
+
+1. Declare the command's grammar node in the new registry.
+2. Write its `ast_builder`. Verify it produces the same
+   `Command` variant as the chumsky parser for every
+   existing test input.
+3. Route the command's entry keyword to the walker. The
+   chumsky parser's branch is gated off for that keyword.
+4. Run the full test suite. If green, commit.
+5. Move to the next command.
+
+When all commands are migrated:
+
+1. Delete the chumsky parser combinator module.
+2. Delete the expected-set introspection completion path.
+3. Delete the `UsageEntry` registry.
+4. Simplify `keys.rs` and the catalog per the default-formatter
+   rules.
+5. Delete the chumsky dependency.
+
+Tests cover behaviour throughout — every command has
+existing tests asserting both successful parses and error
+messages. The migration is safe because the test suite
+guards regressions at each step.
+
+## Open questions (resolve at design-pass time)
+
+- Should the registry be a const declaration or built at
+  runtime (e.g., from a static slice)? Const composes with
+  testing; runtime allows reuse across crates. The
+  registry as it stands works as `const`.
+- Should `extractor_role` annotations be mandatory or
+  positional-fallback? Mandatory is explicit; positional
+  is terser. Recommend positional with `extractor_role`
+  as escape hatch.
+- How do we represent multi-keyword sequences (`save as`,
+  `on delete`)? As a nested `cont` chain, or as a
+  composite keyword? Recommend nested `cont` — it falls
+  out of the same data model and surfaces correctly in
+  completion (after `save`, candidate `as`; after `on`,
+  candidate `delete`).
+- How do we expose grammar to external tooling (LSP,
+  syntax highlighter for editor integration)? The
+  registry as a single `pub static` is trivially
+  introspectable; serialising it to JSON for external
+  consumption is mechanical.
+
+## References
+
+- ADR-0001 — Language and TUI framework (chumsky choice)
+- ADR-0019 — Friendly error layer and i18n catalog
+- ADR-0020 — Tokenization layer for the DSL parser
+- ADR-0021 — Parser-as-source-of-truth for H1a
+- ADR-0022 — Ambient typing assistance (I3 + I4 unified)
+- Round-5 session transcript — design discussion that
+  produced the trie sketch and the critique of the
+  current shape.