From d6e138169f58ed0c79a1cdef08b70ff507c538bf Mon Sep 17 00:00:00 2001 From: "claude@clouddev1" Date: Wed, 13 May 2026 22:36:42 +0000 Subject: [PATCH] add ADR-0023 (proposed): unified declarative grammar tree MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Captures the architectural critique surfaced during round-5 manual testing — that adding a keyword or command currently requires edits in 7-10 files across parser, completion, usage registry, catalog, and tests — and the proposed direction: a single declarative trie registry that drives parse, completion, highlight, and usage rendering from one source. Status: Proposed. Not yet accepted. Filename carries the `-proposed-` segment so status is visible at directory-listing time; rename to `0023-unified-grammar-tree.md` on acceptance. Estimated cost: ~4 sessions, per-command migration. Why not now: feature backlog and bearable scatter cost. Right moment to execute when backlog quiets or scatter cost becomes visibly painful. --- .../adr/0023-proposed-unified-grammar-tree.md | 427 ++++++++++++++++++ docs/adr/README.md | 1 + 2 files changed, 428 insertions(+) create mode 100644 docs/adr/0023-proposed-unified-grammar-tree.md diff --git a/docs/adr/0023-proposed-unified-grammar-tree.md b/docs/adr/0023-proposed-unified-grammar-tree.md new file mode 100644 index 0000000..5d2e825 --- /dev/null +++ b/docs/adr/0023-proposed-unified-grammar-tree.md @@ -0,0 +1,427 @@ +# ADR-0023: Unified declarative grammar tree (proposed direction) + +## Status + +**Proposed.** + +Not yet accepted. Captures a researched direction for a future +refactor that supersedes the parts of ADR-0001 (chumsky as the +DSL parser), ADR-0019 (separated catalog declaration), ADR-0020 +(lexer + keyword macro), ADR-0021 (per-command usage registry), +and ADR-0022 (completion via expected-set introspection, +highlighting via lexer) that this ADR identifies as accreted +rather than designed. + +Filename carries the `-proposed-` segment so the status is +visible at directory listing time; on acceptance, rename to +`0023-unified-grammar-tree.md` via `git mv` (history +preserved). + +## Context + +### What hurt + +The round-5 session removed the (small, accidental) `q` alias +for the `quit` command. Removing one keyword required edits in +ten places: + +1. `define_keywords!` entry in `src/dsl/keyword.rs` +2. Parser combinator branch in `src/dsl/parser.rs` +3. `UsageEntry` row in `src/dsl/usage.rs::REGISTRY` +4. Hardcoded keyword array in `usage.rs::every_command_has_a_registry_entry` test +5. Hardcoded keyword array in `usage.rs::entry_keywords_alphabetised_returns_unique_sorted_commands` test +6. `KEYS_AND_PLACEHOLDERS` declaration in `src/friendly/keys.rs` +7. `parse.token.keyword.q` entry in `src/friendly/strings/en-US.yaml` +8. `help.cli_banner` prose in `en-US.yaml` +9. `help.in_app_body` prose in `en-US.yaml` +10. Hardcoded keyword array in `completion.rs::empty_input_offers_app_command_entry_keywords` test + +Adding a brand-new app command in the same session required a +similar number of touches across the same files plus +`Command` enum extension, dispatch handler wiring, and a +per-command usage template entry. The pattern is: every new +command or keyword incurs a cross-file scatter of typically 7 +files for a normal addition, 10+ when tests and help text +catch the change. + +### Why this happened + +The current architecture is the accretion of features added +across separate ADRs, each locally sensible: + +- **ADR-0001** chose chumsky for parsing — a general-purpose + parser-combinator library oriented at programming-language + grammars with expression precedence, error recovery, and + ambiguous-grammar handling. +- **ADR-0019** introduced the i18n catalog as a flat YAML + + Rust-side validator (`keys.rs`) for two-sided typo + protection. +- **ADR-0020** added the lexer + `define_keywords!` macro + when the unified-token requirement bit. The macro + consolidated *keyword* definitions but didn't tackle the + broader command surface. +- **ADR-0021** added a per-command `UsageEntry` registry so + parse errors could surface usage templates. +- **ADR-0022** added completion by *introspecting* chumsky's + expected-token set at parse failure points, rather than by + consulting the grammar declaration directly. + +Each step solved its presenting problem. None of them +restructured the grammar declaration to be the single source +of truth for completion + highlighting + parse + help + i18n. + +Requirements for completion, highlighting, help, and i18n +were all known from project start. A design pass at the +beginning asking "what unified data structure carries all of +this?" would not have landed at the current scattered shape. +The trajectory is a process critique, not an inevitability. + +### What chumsky earned + +For the DSL we actually built — deterministic prefix-keyword +commands with a small set of clauses — chumsky's general-purpose +machinery is mostly unused: + +- We have no arbitrary expression grammar (no arithmetic + precedence, no function calls, no recursion). +- We have no multi-error recovery requirement; we fail on the + first error and ask the user to fix it. +- We have no ambiguous-grammar handling needs. + +Chumsky's `try_map` custom-error machinery is exercised (for +"unknown type", "tables need at least one column", flag +mutual-exclusion). These are pre-shape and post-shape +validators that fit naturally into chumsky's combinator +model. They could be expressed equally cleanly as per-node +validator functions in a trie-based design — the placement +matters less than the shape. + +### What a unified grammar would look like + +The proposed structure (sketched by the project owner during +the round-5 design discussion): + +```js +commandGrammar = { + add: { + helpId: "add-command", + shortHelp: "Add structure to a table or relationship", + highlightType: "top-level-command", + cont: { + "1:n": { + highlightType: "sub-command", + cont: { + relationship: { + cont: { + from: { + cont: { + // qualified column slot: . + // bound to parent_table + parent_column + // through the command's extractor. + ... + completionSource: "table-names", + } + } + } + } + } + } + } + }, + import: { + helpId: "import-command", + cont: { + completionSource: "current-folder-zip-file-names", + // … + } + }, + // … +} +``` + +This declaration carries everything the current system spreads +across many files: grammar shape (`cont`), completion sources +(`completionSource`), highlight classes (`highlightType`), +help references (`helpId`). The same declaration drives +completion (walk to current node, list its children), syntax +highlighting (each node's class), parse-error usage rendering +(walk to failure point, list valid continuations), and AST +construction (per-command extractor walks the matched path). + +The shape generalises to most SQL surface a teaching tool +would expose. Where-clauses and similar reusable chunks can +be named and registered separately, then referenced by ID +from anywhere they're needed. True expression grammars +(arithmetic, function calls, precedence climbing) — if +they're ever needed — fit as opaque leaf nodes whose +*structure* the trie validates, with the actual interpretation +delegated to a downstream module or simply passed through to +the SQL engine. + +## Proposed direction + +### Data model + +A single grammar registry, structurally similar to the +sketch above, declared once in Rust. Per node: + +- `entry: &'static str` — the literal that selects this + branch (for keyword nodes) OR a typed slot descriptor (for + literal / identifier / completion-source nodes). +- `cont: &'static [Node]` — child nodes representing valid + continuations from this point. Empty for terminal nodes. +- `highlight: HighlightClass` — colour role for the input + pane and echo line. Inherits from parent if not specified. +- `completion_source: Option` — for + identifier slots, the schema-cache key or static list + that drives Tab candidates and known-set validity checks. +- `help_id: Option<&'static str>` — reference into the help + catalog (decoupled from grammar so wording changes don't + touch grammar). +- `validator: Option` — per-position + validator function (e.g., "this identifier must be a + valid new name", or "this slot can occur at most once"). +- `extractor_role: Option<&'static str>` — names the role + this slot plays in the command's typed output (e.g., + `"parent_table"`). Read by the command's extractor at + AST-construction time. Optional because the *positional* + shape of a command's tree is usually enough — the + extractor knows the command's structure and reads the + walked path in order. + +Per command (top-level node): + +- `ast_builder: fn(WalkedPath) -> Command` — walks the + matched path and produces the typed AST variant. Replaces + the per-command chumsky combinator's `.map(...)` closure. +- `dispatch: fn(&mut App, Command) -> Vec` — the + dispatch handler. Replaces the per-command `match` arm in + `dispatch_input` / `dispatch_app_command`. +- `help_id` — root help reference for the command family. + +### Named sub-grammars + +For composable chunks (where-clauses, projection lists, +qualified column references, value literals), the registry +supports named sub-grammars: + +```rust +register_subgrammar("where_clause", &[ + // structure declaration … +]); + +// Referenced from any command: +SubgrammarRef("where_clause"), +``` + +The walker treats a `SubgrammarRef` as a transparent +expansion at parse / completion / highlight time. The +extractor reads the sub-grammar's matched path and applies +the sub-grammar's own AST-fragment builder. + +### Walker functions + +A single walker module exposes: + +- `complete(input, cursor) -> Vec` — walk the + trie alongside the typed prefix; at the cursor's + position, return the union of (a) literal children of the + current node, (b) candidates from the active node's + `completion_source`. Replaces `completion.rs`. +- `highlight(input) -> Vec` — walk producing the + highlight class per token range. Replaces the + ad-hoc lookups in `input_render.rs`. +- `parse(input) -> Result` — walk + consuming tokens, running per-node `validator`s, applying + the command's `ast_builder` at completion. Replaces + `dsl::parser::parse_command`. +- `usage_at(input, position) -> UsageBlock` — walk to the + failure point, render the valid continuations as a + usage template. Replaces `usage::matched_entry`. + +All four operations read the same registry. + +### Value-leaf parsers + +Literal types (number, string, date, bool) and identifier +shape validators (new-name checks) remain as small standalone +functions referenced by leaf nodes. They don't pretend to be +parser combinators — they're predicate-plus-builder pairs. +The chumsky machinery they currently use is shed. + +### i18n integration + +The catalog (en-US.yaml) stays. The `keys.rs` +`KEYS_AND_PLACEHOLDERS` validator stays. The +`parse.token.keyword.*` entries can collapse to a default +formatter (every keyword renders as `` `{literal}` `` unless +the catalog explicitly overrides for a specific keyword). +Adding a normal keyword no longer requires a catalog entry +unless its wording deviates from the default. + +The grammar tree references `help_id` strings, not catalog +keys directly, so help wording lives in its own catalog +section without touching grammar declarations. + +## Trade-offs + +### What we give up + +- **chumsky as the DSL parser.** Library dependency stays + if it's still used elsewhere (it isn't, currently). The + recovery and ambiguous-grammar features go unused, so + losing access to them costs nothing concrete. +- **Single-file grammar entry per command.** The current + per-command combinator in `parser.rs` was always a + separate function; in the new model the command's + `ast_builder` is colocated with its grammar declaration. + This is a gain, not a loss, but it means every command's + current parser function gets rewritten. +- **No automatic backtracking on alternative branches.** + The trie design is greedy (the first child node that + matches wins). For our deterministic grammar this is fine + — `drop column`, `drop relationship`, `drop table` are + disambiguated by their second keyword, so the walker + picks the right branch on the first token after `drop`. + Pathological grammars that require backtracking are out + of scope. + +### What we gain + +- **One block per command.** Adding a new command = + declare a top-level node with its `cont`, `ast_builder`, + `dispatch`, and `help_id`. No edits to a separate + registry, no edits to a separate catalog list, no edits + to a separate dispatch match, no edits to tests (which + iterate the registry). +- **Adding a keyword = one node literal.** No + `define_keywords!` macro entry, no `parse.token.keyword.*` + catalog entry (default formatter handles it), no + `keys.rs` declaration (the same default handles it). +- **Completion + highlight + parse + usage rendering all + come from one source.** Drift is structurally impossible + because they all walk the same tree. +- **Aliases as a single annotation.** A keyword node + declares `aliases: &["q"]` and the walker accepts any + of them; no new variant, no new dispatch wiring. +- **Tests focus on behaviour, not enumeration.** Tests + that previously asserted on hardcoded keyword lists + iterate the registry. Adding/removing a command leaves + test code untouched. +- **Documentation discoverability.** The grammar + registry IS the spec. Reading `commandGrammar` tells you + every command, every option, every continuation. + +### Migration cost + +Estimated at ~4 sessions: + +- Session 1: design the walker + registry data model in + detail; build a stub with one command migrated end-to-end + alongside the existing chumsky path. +- Session 2: migrate the data-command family (create, + drop, add, rename, change, show, insert, update, + delete, replay). Tests at each step verify the + walker-driven parse produces the same `Command` as the + current chumsky parse. +- Session 3: migrate the app-command family (quit, help, + rebuild, save / save as, new, load, export, import, + mode, messages). Drop the parallel chumsky path. +- Session 4: clean up — remove dead modules + (`usage.rs::REGISTRY`, expected-set introspection in + `completion.rs`, the ad-hoc lookups in `input_render.rs`), + remove `keys.rs` entries that the default formatter now + covers, simplify the catalog. + +Steady-state cost after migration: a new command is one +block. A new keyword is one node literal. A typo in either +fails the test suite because tests iterate the registry. + +## Why not now + +The project has a non-trivial feature backlog (Query DSL, +constraint management, V-series UX projects, A1 CI workflow, +multi-line input, readline shortcuts, undo/snapshot, …). +Doing this refactor now would freeze feature work for ~4 +sessions and would interact disruptively with any in-flight +ADRs that touch grammar surfaces. + +The "scatter cost" is bearable for the near-term command +count. Most commands are already in place; we're not +adding ten new ones in the next few sessions. Each new +command incurs ~7-file scatter; that's a modest +recurring tax, not a crisis. + +The right moment to execute is when: + +- Feature backlog quiets down, OR +- Cumulative scatter cost from new commands becomes + visibly painful, OR +- The grammar needs to extend in ways the current shape + fights against (e.g., a real SQL parser landing in + advanced mode would benefit from this restructuring + more than from another bolt-on). + +Until then: leave it. + +## Migration plan (when executed) + +Per-command migration, not big-bang. The new walker runs +alongside the chumsky path during the transition. Each +command is migrated in sequence: + +1. Declare the command's grammar node in the new registry. +2. Write its `ast_builder`. Verify it produces the same + `Command` variant as the chumsky parser for every + existing test input. +3. Route the command's entry keyword to the walker. The + chumsky parser's branch is gated off for that keyword. +4. Run the full test suite. If green, commit. +5. Move to the next command. + +When all commands are migrated: + +1. Delete the chumsky parser combinator module. +2. Delete the expected-set introspection completion path. +3. Delete the `UsageEntry` registry. +4. Simplify `keys.rs` and the catalog per the default-formatter + rules. +5. Delete the chumsky dependency. + +Tests cover behaviour throughout — every command has +existing tests asserting both successful parses and error +messages. The migration is safe because the test suite +guards regressions at each step. + +## Open questions (resolve at design-pass time) + +- Should the registry be a const declaration or built at + runtime (e.g., from a static slice)? Const composes with + testing; runtime allows reuse across crates. The + registry as it stands works as `const`. +- Should `extractor_role` annotations be mandatory or + positional-fallback? Mandatory is explicit; positional + is terser. Recommend positional with `extractor_role` + as escape hatch. +- How do we represent multi-keyword sequences (`save as`, + `on delete`)? As a nested `cont` chain, or as a + composite keyword? Recommend nested `cont` — it falls + out of the same data model and surfaces correctly in + completion (after `save`, candidate `as`; after `on`, + candidate `delete`). +- How do we expose grammar to external tooling (LSP, + syntax highlighter for editor integration)? The + registry as a single `pub static` is trivially + introspectable; serialising it to JSON for external + consumption is mechanical. + +## References + +- ADR-0001 — Language and TUI framework (chumsky choice) +- ADR-0019 — Friendly error layer and i18n catalog +- ADR-0020 — Tokenization layer for the DSL parser +- ADR-0021 — Parser-as-source-of-truth for H1a +- ADR-0022 — Ambient typing assistance (I3 + I4 unified) +- Round-5 session transcript — design discussion that + produced the trie sketch and the critique of the + current shape. diff --git a/docs/adr/README.md b/docs/adr/README.md index e75082d..4a410f8 100644 --- a/docs/adr/README.md +++ b/docs/adr/README.md @@ -28,3 +28,4 @@ This directory contains the project's ADRs, recorded per - [ADR-0020 — Tokenization layer for the DSL parser](0020-tokenization-layer-for-the-dsl-parser.md) - [ADR-0021 — Parser-as-source-of-truth for H1a (per-command usage in parse errors)](0021-parser-as-source-of-truth-for-h1a.md) - [ADR-0022 — Ambient typing assistance: colour, hint panel, completion (I3 + I4)](0022-ambient-typing-assistance.md) +- [ADR-0023 — Unified declarative grammar tree](0023-proposed-unified-grammar-tree.md) — **Proposed** (researched direction, not yet accepted)