add ADR-0023 (proposed): unified declarative grammar tree

Captures the architectural critique surfaced during round-5 manual testing — that adding a keyword or command currently requires edits in 7-10 files across parser, completion, usage registry, catalog, and tests — and the proposed direction: a single declarative trie registry that drives parse, completion, highlight, and usage rendering from one source. Status: Proposed. Not yet accepted. Filename carries the `-proposed-` segment so status is visible at directory-listing time; rename to `0023-unified-grammar-tree.md` on acceptance. Estimated cost: ~4 sessions, per-command migration. Why not now: feature backlog and bearable scatter cost. Right moment to execute when backlog quiets or scatter cost becomes visibly painful.
2026-05-13 22:36:42 +00:00
parent a55b6a7a05
commit d6e138169f
2 changed files with 428 additions and 0 deletions
@@ -0,0 +1,427 @@
 # ADR-0023: Unified declarative grammar tree (proposed direction)
 ## Status
 **Proposed.**
 Not yet accepted. Captures a researched direction for a future
 refactor that supersedes the parts of ADR-0001 (chumsky as the
 DSL parser), ADR-0019 (separated catalog declaration), ADR-0020
 (lexer + keyword macro), ADR-0021 (per-command usage registry),
 and ADR-0022 (completion via expected-set introspection,
 highlighting via lexer) that this ADR identifies as accreted
 rather than designed.
 Filename carries the `-proposed-` segment so the status is
 visible at directory listing time; on acceptance, rename to
 `0023-unified-grammar-tree.md` via `git mv` (history
 preserved).
 ## Context
 ### What hurt
 The round-5 session removed the (small, accidental) `q` alias
 for the `quit` command. Removing one keyword required edits in
 ten places:
 1. `define_keywords!` entry in `src/dsl/keyword.rs`
 2. Parser combinator branch in `src/dsl/parser.rs`
 3. `UsageEntry` row in `src/dsl/usage.rs::REGISTRY`
 4. Hardcoded keyword array in `usage.rs::every_command_has_a_registry_entry` test
 5. Hardcoded keyword array in `usage.rs::entry_keywords_alphabetised_returns_unique_sorted_commands` test
 6. `KEYS_AND_PLACEHOLDERS` declaration in `src/friendly/keys.rs`
 7. `parse.token.keyword.q` entry in `src/friendly/strings/en-US.yaml`
 8. `help.cli_banner` prose in `en-US.yaml`
 9. `help.in_app_body` prose in `en-US.yaml`
 10. Hardcoded keyword array in `completion.rs::empty_input_offers_app_command_entry_keywords` test
 Adding a brand-new app command in the same session required a
 similar number of touches across the same files plus
 `Command` enum extension, dispatch handler wiring, and a
 per-command usage template entry. The pattern is: every new
 command or keyword incurs a cross-file scatter of typically 7
 files for a normal addition, 10+ when tests and help text
 catch the change.
 ### Why this happened
 The current architecture is the accretion of features added
 across separate ADRs, each locally sensible:
 - **ADR-0001** chose chumsky for parsing — a general-purpose
  parser-combinator library oriented at programming-language
  grammars with expression precedence, error recovery, and
  ambiguous-grammar handling.
 - **ADR-0019** introduced the i18n catalog as a flat YAML +
  Rust-side validator (`keys.rs`) for two-sided typo
  protection.
 - **ADR-0020** added the lexer + `define_keywords!` macro
  when the unified-token requirement bit. The macro
  consolidated *keyword* definitions but didn't tackle the
  broader command surface.
 - **ADR-0021** added a per-command `UsageEntry` registry so
  parse errors could surface usage templates.
 - **ADR-0022** added completion by *introspecting* chumsky's
  expected-token set at parse failure points, rather than by
  consulting the grammar declaration directly.
 Each step solved its presenting problem. None of them
 restructured the grammar declaration to be the single source
 of truth for completion + highlighting + parse + help + i18n.
 Requirements for completion, highlighting, help, and i18n
 were all known from project start. A design pass at the
 beginning asking "what unified data structure carries all of
 this?" would not have landed at the current scattered shape.
 The trajectory is a process critique, not an inevitability.
 ### What chumsky earned
 For the DSL we actually built — deterministic prefix-keyword
 commands with a small set of clauses — chumsky's general-purpose
 machinery is mostly unused:
 - We have no arbitrary expression grammar (no arithmetic
  precedence, no function calls, no recursion).
 - We have no multi-error recovery requirement; we fail on the
  first error and ask the user to fix it.
 - We have no ambiguous-grammar handling needs.
 Chumsky's `try_map` custom-error machinery is exercised (for
 "unknown type", "tables need at least one column", flag
 mutual-exclusion). These are pre-shape and post-shape
 validators that fit naturally into chumsky's combinator
 model. They could be expressed equally cleanly as per-node
 validator functions in a trie-based design — the placement
 matters less than the shape.
 ### What a unified grammar would look like
 The proposed structure (sketched by the project owner during
 the round-5 design discussion):
 ```js
 commandGrammar = {
  add: {
    helpId: "add-command",
    shortHelp: "Add structure to a table or relationship",
    highlightType: "top-level-command",
    cont: {
      "1:n": {
        highlightType: "sub-command",
        cont: {
          relationship: {
            cont: {
              from: {
                cont: {
                  // qualified column slot: <Table>.<Column>
                  // bound to parent_table + parent_column
                  // through the command's extractor.
                  ...
                  completionSource: "table-names",
                }
              }
            }
          }
        }
      }
    }
  },
  import: {
    helpId: "import-command",
    cont: {
      completionSource: "current-folder-zip-file-names",
      // …
    }
  },
  // …
 }
 ```
 This declaration carries everything the current system spreads
 across many files: grammar shape (`cont`), completion sources
 (`completionSource`), highlight classes (`highlightType`),
 help references (`helpId`). The same declaration drives
 completion (walk to current node, list its children), syntax
 highlighting (each node's class), parse-error usage rendering
 (walk to failure point, list valid continuations), and AST
 construction (per-command extractor walks the matched path).
 The shape generalises to most SQL surface a teaching tool
 would expose. Where-clauses and similar reusable chunks can
 be named and registered separately, then referenced by ID
 from anywhere they're needed. True expression grammars
 (arithmetic, function calls, precedence climbing) — if
 they're ever needed — fit as opaque leaf nodes whose
 *structure* the trie validates, with the actual interpretation
 delegated to a downstream module or simply passed through to
 the SQL engine.
 ## Proposed direction
 ### Data model
 A single grammar registry, structurally similar to the
 sketch above, declared once in Rust. Per node:
 - `entry: &'static str` — the literal that selects this
  branch (for keyword nodes) OR a typed slot descriptor (for
  literal / identifier / completion-source nodes).
 - `cont: &'static [Node]` — child nodes representing valid
  continuations from this point. Empty for terminal nodes.
 - `highlight: HighlightClass` — colour role for the input
  pane and echo line. Inherits from parent if not specified.
 - `completion_source: Option<CompletionSource>` — for
  identifier slots, the schema-cache key or static list
  that drives Tab candidates and known-set validity checks.
 - `help_id: Option<&'static str>` — reference into the help
  catalog (decoupled from grammar so wording changes don't
  touch grammar).
 - `validator: Option<NodeValidator>` — per-position
  validator function (e.g., "this identifier must be a
  valid new name", or "this slot can occur at most once").
 - `extractor_role: Option<&'static str>` — names the role
  this slot plays in the command's typed output (e.g.,
  `"parent_table"`). Read by the command's extractor at
  AST-construction time. Optional because the *positional*
  shape of a command's tree is usually enough — the
  extractor knows the command's structure and reads the
  walked path in order.
 Per command (top-level node):
 - `ast_builder: fn(WalkedPath) -> Command` — walks the
  matched path and produces the typed AST variant. Replaces
  the per-command chumsky combinator's `.map(...)` closure.
 - `dispatch: fn(&mut App, Command) -> Vec<Action>` — the
  dispatch handler. Replaces the per-command `match` arm in
  `dispatch_input` / `dispatch_app_command`.
 - `help_id` — root help reference for the command family.
 ### Named sub-grammars
 For composable chunks (where-clauses, projection lists,
 qualified column references, value literals), the registry
 supports named sub-grammars:
 ```rust
 register_subgrammar("where_clause", &[
    // structure declaration …
 ]);
 // Referenced from any command:
 SubgrammarRef("where_clause"),
 ```
 The walker treats a `SubgrammarRef` as a transparent
 expansion at parse / completion / highlight time. The
 extractor reads the sub-grammar's matched path and applies
 the sub-grammar's own AST-fragment builder.
 ### Walker functions
 A single walker module exposes:
 - `complete(input, cursor) -> Vec<Candidate>` — walk the
  trie alongside the typed prefix; at the cursor's
  position, return the union of (a) literal children of the
  current node, (b) candidates from the active node's
  `completion_source`. Replaces `completion.rs`.
 - `highlight(input) -> Vec<StyledRun>` — walk producing the
  highlight class per token range. Replaces the
  ad-hoc lookups in `input_render.rs`.
 - `parse(input) -> Result<Command, ParseError>` — walk
  consuming tokens, running per-node `validator`s, applying
  the command's `ast_builder` at completion. Replaces
  `dsl::parser::parse_command`.
 - `usage_at(input, position) -> UsageBlock` — walk to the
  failure point, render the valid continuations as a
  usage template. Replaces `usage::matched_entry`.
 All four operations read the same registry.
 ### Value-leaf parsers
 Literal types (number, string, date, bool) and identifier
 shape validators (new-name checks) remain as small standalone
 functions referenced by leaf nodes. They don't pretend to be
 parser combinators — they're predicate-plus-builder pairs.
 The chumsky machinery they currently use is shed.
 ### i18n integration
 The catalog (en-US.yaml) stays. The `keys.rs`
 `KEYS_AND_PLACEHOLDERS` validator stays. The
 `parse.token.keyword.*` entries can collapse to a default
 formatter (every keyword renders as `` `{literal}` `` unless
 the catalog explicitly overrides for a specific keyword).
 Adding a normal keyword no longer requires a catalog entry
 unless its wording deviates from the default.
 The grammar tree references `help_id` strings, not catalog
 keys directly, so help wording lives in its own catalog
 section without touching grammar declarations.
 ## Trade-offs
 ### What we give up
 - **chumsky as the DSL parser.** Library dependency stays
  if it's still used elsewhere (it isn't, currently). The
  recovery and ambiguous-grammar features go unused, so
  losing access to them costs nothing concrete.
 - **Single-file grammar entry per command.** The current
  per-command combinator in `parser.rs` was always a
  separate function; in the new model the command's
  `ast_builder` is colocated with its grammar declaration.
  This is a gain, not a loss, but it means every command's
  current parser function gets rewritten.
 - **No automatic backtracking on alternative branches.**
  The trie design is greedy (the first child node that
  matches wins). For our deterministic grammar this is fine
  — `drop column`, `drop relationship`, `drop table` are
  disambiguated by their second keyword, so the walker
  picks the right branch on the first token after `drop`.
  Pathological grammars that require backtracking are out
  of scope.
 ### What we gain
 - **One block per command.** Adding a new command =
  declare a top-level node with its `cont`, `ast_builder`,
  `dispatch`, and `help_id`. No edits to a separate
  registry, no edits to a separate catalog list, no edits
  to a separate dispatch match, no edits to tests (which
  iterate the registry).
 - **Adding a keyword = one node literal.** No
  `define_keywords!` macro entry, no `parse.token.keyword.*`
  catalog entry (default formatter handles it), no
  `keys.rs` declaration (the same default handles it).
 - **Completion + highlight + parse + usage rendering all
  come from one source.** Drift is structurally impossible
  because they all walk the same tree.
 - **Aliases as a single annotation.** A keyword node
  declares `aliases: &["q"]` and the walker accepts any
  of them; no new variant, no new dispatch wiring.
 - **Tests focus on behaviour, not enumeration.** Tests
  that previously asserted on hardcoded keyword lists
  iterate the registry. Adding/removing a command leaves
  test code untouched.
 - **Documentation discoverability.** The grammar
  registry IS the spec. Reading `commandGrammar` tells you
  every command, every option, every continuation.
 ### Migration cost
 Estimated at ~4 sessions:
 - Session 1: design the walker + registry data model in
  detail; build a stub with one command migrated end-to-end
  alongside the existing chumsky path.
 - Session 2: migrate the data-command family (create,
  drop, add, rename, change, show, insert, update,
  delete, replay). Tests at each step verify the
  walker-driven parse produces the same `Command` as the
  current chumsky parse.
 - Session 3: migrate the app-command family (quit, help,
  rebuild, save / save as, new, load, export, import,
  mode, messages). Drop the parallel chumsky path.
 - Session 4: clean up — remove dead modules
  (`usage.rs::REGISTRY`, expected-set introspection in
  `completion.rs`, the ad-hoc lookups in `input_render.rs`),
  remove `keys.rs` entries that the default formatter now
  covers, simplify the catalog.
 Steady-state cost after migration: a new command is one
 block. A new keyword is one node literal. A typo in either
 fails the test suite because tests iterate the registry.
 ## Why not now
 The project has a non-trivial feature backlog (Query DSL,
 constraint management, V-series UX projects, A1 CI workflow,
 multi-line input, readline shortcuts, undo/snapshot, …).
 Doing this refactor now would freeze feature work for ~4
 sessions and would interact disruptively with any in-flight
 ADRs that touch grammar surfaces.
 The "scatter cost" is bearable for the near-term command
 count. Most commands are already in place; we're not
 adding ten new ones in the next few sessions. Each new
 command incurs ~7-file scatter; that's a modest
 recurring tax, not a crisis.
 The right moment to execute is when:
 - Feature backlog quiets down, OR
 - Cumulative scatter cost from new commands becomes
  visibly painful, OR
 - The grammar needs to extend in ways the current shape
  fights against (e.g., a real SQL parser landing in
  advanced mode would benefit from this restructuring
  more than from another bolt-on).
 Until then: leave it.
 ## Migration plan (when executed)
 Per-command migration, not big-bang. The new walker runs
 alongside the chumsky path during the transition. Each
 command is migrated in sequence:
 1. Declare the command's grammar node in the new registry.
 2. Write its `ast_builder`. Verify it produces the same
   `Command` variant as the chumsky parser for every
   existing test input.
 3. Route the command's entry keyword to the walker. The
   chumsky parser's branch is gated off for that keyword.
 4. Run the full test suite. If green, commit.
 5. Move to the next command.
 When all commands are migrated:
 1. Delete the chumsky parser combinator module.
 2. Delete the expected-set introspection completion path.
 3. Delete the `UsageEntry` registry.
 4. Simplify `keys.rs` and the catalog per the default-formatter
   rules.
 5. Delete the chumsky dependency.
 Tests cover behaviour throughout — every command has
 existing tests asserting both successful parses and error
 messages. The migration is safe because the test suite
 guards regressions at each step.
 ## Open questions (resolve at design-pass time)
 - Should the registry be a const declaration or built at
  runtime (e.g., from a static slice)? Const composes with
  testing; runtime allows reuse across crates. The
  registry as it stands works as `const`.
 - Should `extractor_role` annotations be mandatory or
  positional-fallback? Mandatory is explicit; positional
  is terser. Recommend positional with `extractor_role`
  as escape hatch.
 - How do we represent multi-keyword sequences (`save as`,
  `on delete`)? As a nested `cont` chain, or as a
  composite keyword? Recommend nested `cont` — it falls
  out of the same data model and surfaces correctly in
  completion (after `save`, candidate `as`; after `on`,
  candidate `delete`).
 - How do we expose grammar to external tooling (LSP,
  syntax highlighter for editor integration)? The
  registry as a single `pub static` is trivially
  introspectable; serialising it to JSON for external
  consumption is mechanical.
 ## References
 - ADR-0001 — Language and TUI framework (chumsky choice)
 - ADR-0019 — Friendly error layer and i18n catalog
 - ADR-0020 — Tokenization layer for the DSL parser
 - ADR-0021 — Parser-as-source-of-truth for H1a
 - ADR-0022 — Ambient typing assistance (I3 + I4 unified)
 - Round-5 session transcript — design discussion that
  produced the trie sketch and the critique of the
  current shape.
@@ -28,3 +28,4 @@ This directory contains the project's ADRs, recorded per
 - [ADR-0020 — Tokenization layer for the DSL parser](0020-tokenization-layer-for-the-dsl-parser.md)
 - [ADR-0021 — Parser-as-source-of-truth for H1a (per-command usage in parse errors)](0021-parser-as-source-of-truth-for-h1a.md)
 - [ADR-0022 — Ambient typing assistance: colour, hint panel, completion (I3 + I4)](0022-ambient-typing-assistance.md)
 - [ADR-0023 — Unified declarative grammar tree](0023-proposed-unified-grammar-tree.md) — **Proposed** (researched direction, not yet accepted)