Files
rdbms-playground/docs/adr/0023-proposed-unified-grammar-tree.md
T
claude@clouddev1 d6e138169f add ADR-0023 (proposed): unified declarative grammar tree
Captures the architectural critique surfaced during round-5
manual testing — that adding a keyword or command currently
requires edits in 7-10 files across parser, completion, usage
registry, catalog, and tests — and the proposed direction: a
single declarative trie registry that drives parse, completion,
highlight, and usage rendering from one source.

Status: Proposed. Not yet accepted. Filename carries the
`-proposed-` segment so status is visible at directory-listing
time; rename to `0023-unified-grammar-tree.md` on acceptance.

Estimated cost: ~4 sessions, per-command migration. Why not
now: feature backlog and bearable scatter cost. Right moment
to execute when backlog quiets or scatter cost becomes
visibly painful.
2026-05-13 22:36:42 +00:00

16 KiB

ADR-0023: Unified declarative grammar tree (proposed direction)

Status

Proposed.

Not yet accepted. Captures a researched direction for a future refactor that supersedes the parts of ADR-0001 (chumsky as the DSL parser), ADR-0019 (separated catalog declaration), ADR-0020 (lexer + keyword macro), ADR-0021 (per-command usage registry), and ADR-0022 (completion via expected-set introspection, highlighting via lexer) that this ADR identifies as accreted rather than designed.

Filename carries the -proposed- segment so the status is visible at directory listing time; on acceptance, rename to 0023-unified-grammar-tree.md via git mv (history preserved).

Context

What hurt

The round-5 session removed the (small, accidental) q alias for the quit command. Removing one keyword required edits in ten places:

  1. define_keywords! entry in src/dsl/keyword.rs
  2. Parser combinator branch in src/dsl/parser.rs
  3. UsageEntry row in src/dsl/usage.rs::REGISTRY
  4. Hardcoded keyword array in usage.rs::every_command_has_a_registry_entry test
  5. Hardcoded keyword array in usage.rs::entry_keywords_alphabetised_returns_unique_sorted_commands test
  6. KEYS_AND_PLACEHOLDERS declaration in src/friendly/keys.rs
  7. parse.token.keyword.q entry in src/friendly/strings/en-US.yaml
  8. help.cli_banner prose in en-US.yaml
  9. help.in_app_body prose in en-US.yaml
  10. Hardcoded keyword array in completion.rs::empty_input_offers_app_command_entry_keywords test

Adding a brand-new app command in the same session required a similar number of touches across the same files plus Command enum extension, dispatch handler wiring, and a per-command usage template entry. The pattern is: every new command or keyword incurs a cross-file scatter of typically 7 files for a normal addition, 10+ when tests and help text catch the change.

Why this happened

The current architecture is the accretion of features added across separate ADRs, each locally sensible:

  • ADR-0001 chose chumsky for parsing — a general-purpose parser-combinator library oriented at programming-language grammars with expression precedence, error recovery, and ambiguous-grammar handling.
  • ADR-0019 introduced the i18n catalog as a flat YAML + Rust-side validator (keys.rs) for two-sided typo protection.
  • ADR-0020 added the lexer + define_keywords! macro when the unified-token requirement bit. The macro consolidated keyword definitions but didn't tackle the broader command surface.
  • ADR-0021 added a per-command UsageEntry registry so parse errors could surface usage templates.
  • ADR-0022 added completion by introspecting chumsky's expected-token set at parse failure points, rather than by consulting the grammar declaration directly.

Each step solved its presenting problem. None of them restructured the grammar declaration to be the single source of truth for completion + highlighting + parse + help + i18n.

Requirements for completion, highlighting, help, and i18n were all known from project start. A design pass at the beginning asking "what unified data structure carries all of this?" would not have landed at the current scattered shape. The trajectory is a process critique, not an inevitability.

What chumsky earned

For the DSL we actually built — deterministic prefix-keyword commands with a small set of clauses — chumsky's general-purpose machinery is mostly unused:

  • We have no arbitrary expression grammar (no arithmetic precedence, no function calls, no recursion).
  • We have no multi-error recovery requirement; we fail on the first error and ask the user to fix it.
  • We have no ambiguous-grammar handling needs.

Chumsky's try_map custom-error machinery is exercised (for "unknown type", "tables need at least one column", flag mutual-exclusion). These are pre-shape and post-shape validators that fit naturally into chumsky's combinator model. They could be expressed equally cleanly as per-node validator functions in a trie-based design — the placement matters less than the shape.

What a unified grammar would look like

The proposed structure (sketched by the project owner during the round-5 design discussion):

commandGrammar = {
  add: {
    helpId: "add-command",
    shortHelp: "Add structure to a table or relationship",
    highlightType: "top-level-command",
    cont: {
      "1:n": {
        highlightType: "sub-command",
        cont: {
          relationship: {
            cont: {
              from: {
                cont: {
                  // qualified column slot: <Table>.<Column>
                  // bound to parent_table + parent_column
                  // through the command's extractor.
                  ...
                  completionSource: "table-names",
                }
              }
            }
          }
        }
      }
    }
  },
  import: {
    helpId: "import-command",
    cont: {
      completionSource: "current-folder-zip-file-names",
      // …
    }
  },
  // …
}

This declaration carries everything the current system spreads across many files: grammar shape (cont), completion sources (completionSource), highlight classes (highlightType), help references (helpId). The same declaration drives completion (walk to current node, list its children), syntax highlighting (each node's class), parse-error usage rendering (walk to failure point, list valid continuations), and AST construction (per-command extractor walks the matched path).

The shape generalises to most SQL surface a teaching tool would expose. Where-clauses and similar reusable chunks can be named and registered separately, then referenced by ID from anywhere they're needed. True expression grammars (arithmetic, function calls, precedence climbing) — if they're ever needed — fit as opaque leaf nodes whose structure the trie validates, with the actual interpretation delegated to a downstream module or simply passed through to the SQL engine.

Proposed direction

Data model

A single grammar registry, structurally similar to the sketch above, declared once in Rust. Per node:

  • entry: &'static str — the literal that selects this branch (for keyword nodes) OR a typed slot descriptor (for literal / identifier / completion-source nodes).
  • cont: &'static [Node] — child nodes representing valid continuations from this point. Empty for terminal nodes.
  • highlight: HighlightClass — colour role for the input pane and echo line. Inherits from parent if not specified.
  • completion_source: Option<CompletionSource> — for identifier slots, the schema-cache key or static list that drives Tab candidates and known-set validity checks.
  • help_id: Option<&'static str> — reference into the help catalog (decoupled from grammar so wording changes don't touch grammar).
  • validator: Option<NodeValidator> — per-position validator function (e.g., "this identifier must be a valid new name", or "this slot can occur at most once").
  • extractor_role: Option<&'static str> — names the role this slot plays in the command's typed output (e.g., "parent_table"). Read by the command's extractor at AST-construction time. Optional because the positional shape of a command's tree is usually enough — the extractor knows the command's structure and reads the walked path in order.

Per command (top-level node):

  • ast_builder: fn(WalkedPath) -> Command — walks the matched path and produces the typed AST variant. Replaces the per-command chumsky combinator's .map(...) closure.
  • dispatch: fn(&mut App, Command) -> Vec<Action> — the dispatch handler. Replaces the per-command match arm in dispatch_input / dispatch_app_command.
  • help_id — root help reference for the command family.

Named sub-grammars

For composable chunks (where-clauses, projection lists, qualified column references, value literals), the registry supports named sub-grammars:

register_subgrammar("where_clause", &[
    // structure declaration …
]);

// Referenced from any command:
SubgrammarRef("where_clause"),

The walker treats a SubgrammarRef as a transparent expansion at parse / completion / highlight time. The extractor reads the sub-grammar's matched path and applies the sub-grammar's own AST-fragment builder.

Walker functions

A single walker module exposes:

  • complete(input, cursor) -> Vec<Candidate> — walk the trie alongside the typed prefix; at the cursor's position, return the union of (a) literal children of the current node, (b) candidates from the active node's completion_source. Replaces completion.rs.
  • highlight(input) -> Vec<StyledRun> — walk producing the highlight class per token range. Replaces the ad-hoc lookups in input_render.rs.
  • parse(input) -> Result<Command, ParseError> — walk consuming tokens, running per-node validators, applying the command's ast_builder at completion. Replaces dsl::parser::parse_command.
  • usage_at(input, position) -> UsageBlock — walk to the failure point, render the valid continuations as a usage template. Replaces usage::matched_entry.

All four operations read the same registry.

Value-leaf parsers

Literal types (number, string, date, bool) and identifier shape validators (new-name checks) remain as small standalone functions referenced by leaf nodes. They don't pretend to be parser combinators — they're predicate-plus-builder pairs. The chumsky machinery they currently use is shed.

i18n integration

The catalog (en-US.yaml) stays. The keys.rs KEYS_AND_PLACEHOLDERS validator stays. The parse.token.keyword.* entries can collapse to a default formatter (every keyword renders as `{literal}` unless the catalog explicitly overrides for a specific keyword). Adding a normal keyword no longer requires a catalog entry unless its wording deviates from the default.

The grammar tree references help_id strings, not catalog keys directly, so help wording lives in its own catalog section without touching grammar declarations.

Trade-offs

What we give up

  • chumsky as the DSL parser. Library dependency stays if it's still used elsewhere (it isn't, currently). The recovery and ambiguous-grammar features go unused, so losing access to them costs nothing concrete.
  • Single-file grammar entry per command. The current per-command combinator in parser.rs was always a separate function; in the new model the command's ast_builder is colocated with its grammar declaration. This is a gain, not a loss, but it means every command's current parser function gets rewritten.
  • No automatic backtracking on alternative branches. The trie design is greedy (the first child node that matches wins). For our deterministic grammar this is fine — drop column, drop relationship, drop table are disambiguated by their second keyword, so the walker picks the right branch on the first token after drop. Pathological grammars that require backtracking are out of scope.

What we gain

  • One block per command. Adding a new command = declare a top-level node with its cont, ast_builder, dispatch, and help_id. No edits to a separate registry, no edits to a separate catalog list, no edits to a separate dispatch match, no edits to tests (which iterate the registry).
  • Adding a keyword = one node literal. No define_keywords! macro entry, no parse.token.keyword.* catalog entry (default formatter handles it), no keys.rs declaration (the same default handles it).
  • Completion + highlight + parse + usage rendering all come from one source. Drift is structurally impossible because they all walk the same tree.
  • Aliases as a single annotation. A keyword node declares aliases: &["q"] and the walker accepts any of them; no new variant, no new dispatch wiring.
  • Tests focus on behaviour, not enumeration. Tests that previously asserted on hardcoded keyword lists iterate the registry. Adding/removing a command leaves test code untouched.
  • Documentation discoverability. The grammar registry IS the spec. Reading commandGrammar tells you every command, every option, every continuation.

Migration cost

Estimated at ~4 sessions:

  • Session 1: design the walker + registry data model in detail; build a stub with one command migrated end-to-end alongside the existing chumsky path.
  • Session 2: migrate the data-command family (create, drop, add, rename, change, show, insert, update, delete, replay). Tests at each step verify the walker-driven parse produces the same Command as the current chumsky parse.
  • Session 3: migrate the app-command family (quit, help, rebuild, save / save as, new, load, export, import, mode, messages). Drop the parallel chumsky path.
  • Session 4: clean up — remove dead modules (usage.rs::REGISTRY, expected-set introspection in completion.rs, the ad-hoc lookups in input_render.rs), remove keys.rs entries that the default formatter now covers, simplify the catalog.

Steady-state cost after migration: a new command is one block. A new keyword is one node literal. A typo in either fails the test suite because tests iterate the registry.

Why not now

The project has a non-trivial feature backlog (Query DSL, constraint management, V-series UX projects, A1 CI workflow, multi-line input, readline shortcuts, undo/snapshot, …). Doing this refactor now would freeze feature work for ~4 sessions and would interact disruptively with any in-flight ADRs that touch grammar surfaces.

The "scatter cost" is bearable for the near-term command count. Most commands are already in place; we're not adding ten new ones in the next few sessions. Each new command incurs ~7-file scatter; that's a modest recurring tax, not a crisis.

The right moment to execute is when:

  • Feature backlog quiets down, OR
  • Cumulative scatter cost from new commands becomes visibly painful, OR
  • The grammar needs to extend in ways the current shape fights against (e.g., a real SQL parser landing in advanced mode would benefit from this restructuring more than from another bolt-on).

Until then: leave it.

Migration plan (when executed)

Per-command migration, not big-bang. The new walker runs alongside the chumsky path during the transition. Each command is migrated in sequence:

  1. Declare the command's grammar node in the new registry.
  2. Write its ast_builder. Verify it produces the same Command variant as the chumsky parser for every existing test input.
  3. Route the command's entry keyword to the walker. The chumsky parser's branch is gated off for that keyword.
  4. Run the full test suite. If green, commit.
  5. Move to the next command.

When all commands are migrated:

  1. Delete the chumsky parser combinator module.
  2. Delete the expected-set introspection completion path.
  3. Delete the UsageEntry registry.
  4. Simplify keys.rs and the catalog per the default-formatter rules.
  5. Delete the chumsky dependency.

Tests cover behaviour throughout — every command has existing tests asserting both successful parses and error messages. The migration is safe because the test suite guards regressions at each step.

Open questions (resolve at design-pass time)

  • Should the registry be a const declaration or built at runtime (e.g., from a static slice)? Const composes with testing; runtime allows reuse across crates. The registry as it stands works as const.
  • Should extractor_role annotations be mandatory or positional-fallback? Mandatory is explicit; positional is terser. Recommend positional with extractor_role as escape hatch.
  • How do we represent multi-keyword sequences (save as, on delete)? As a nested cont chain, or as a composite keyword? Recommend nested cont — it falls out of the same data model and surfaces correctly in completion (after save, candidate as; after on, candidate delete).
  • How do we expose grammar to external tooling (LSP, syntax highlighter for editor integration)? The registry as a single pub static is trivially introspectable; serialising it to JSON for external consumption is mechanical.

References

  • ADR-0001 — Language and TUI framework (chumsky choice)
  • ADR-0019 — Friendly error layer and i18n catalog
  • ADR-0020 — Tokenization layer for the DSL parser
  • ADR-0021 — Parser-as-source-of-truth for H1a
  • ADR-0022 — Ambient typing assistance (I3 + I4 unified)
  • Round-5 session transcript — design discussion that produced the trie sketch and the critique of the current shape.