diff --git a/docs/adr/0020-tokenization-layer-for-the-dsl-parser.md b/docs/adr/0020-tokenization-layer-for-the-dsl-parser.md new file mode 100644 index 0000000..fab1700 --- /dev/null +++ b/docs/adr/0020-tokenization-layer-for-the-dsl-parser.md @@ -0,0 +1,638 @@ +# ADR-0020: Tokenization layer for the DSL parser + +## Status + +Accepted. + +Amends ADR-0001 (language and TUI framework) by adding a +tokenization layer between the source string and the chumsky +grammar. The chumsky choice itself is unchanged; what changes +is the input to chumsky — `&[Token]` instead of `&str`. + +Foundation for ADR-0021 (parser-as-source-of-truth pedagogy / +H1a). Also the load-bearing piece for I3 (tab completion) and +I4 (syntax highlighting) once those land. + +## Context + +### What the user sees today + +Typing `create` at the prompt produces: + +``` +parse error: after `create`, expected `table` +``` + +Typing `frobulate Customers` produces: + +``` +parse error: expected `create`, found `frobulate` +``` + +Both are technically true and both are bad. The first hides +that `create` is the entry to a single command (`create +table …`) that the user almost certainly does not yet know +the shape of. The second points at one of ten possible +command-starting keywords and silently picks one. A user +hitting either error gets a sharper "what comes next?" answer +from a 1980s `lex`/`yacc` toolchain than from this app. + +### Two technical roots + +The bad UX traces to two parser-level decisions, both +correctable but not without architectural change: + +1. **`keyword_ci` emits `Rich::custom` errors instead of + structural ones.** Chumsky's `choice` combinator merges the + `expected` sets of its alternatives when they all fail — + that is exactly what produces "expected `data` or `table`" + instead of just "expected `table`". Custom errors don't + participate in that merge: only one wins, deterministically + the first. So our top-level `choice((create_table, drop_*, + add_*, …))` collapses to whichever branch reported its + custom error first, throwing the others away. + +2. **The parser has no concept of a "command" beyond its + chumsky combinator graph.** There is no place to attach a + per-command grammar template (the "what does `create + table` actually want?" answer). Adding one is possible but + awkward without an explicit handle on "the entry token + for this command". + +ADR-0021 addresses (2). This ADR addresses (1) by removing +the underlying cause: `keyword_ci`'s `try_map → Rich::custom` +shape only exists because the parser operates on raw +characters and has to hand-write keyword recognition. Over a +token stream, a keyword is just a token, and `just(...)` over +tokens aggregates naturally. + +### Bespoke machinery in `parser.rs` + +`parser.rs` carries ~180 lines of error-handling helpers +(`humanise()`, `consumed_context()`, `oxford_or()`, +`describe_pattern()`, `describe_char()`, `format_expected_found()`, +`first_custom_message()`, the `prefer-custom-over-structural` +selection in `into_parse_error()`). Most of it exists because +chumsky-over-`&str` produces character-level error patterns +that need humanising before a user can read them +(`RichPattern::Token('e')` → "`e`" is not what the user +needed to know). + +Over a token stream, the patterns are token kinds — `Keyword(Create)`, +`Identifier`, `Punct(Colon)` — which render directly via the +i18n catalog without per-character translation. Roughly half +of that machinery dissolves. + +### Honest history + +ADR-0001 chose Rust + chumsky for the DSL. It did not name +"no lexer" as a design choice — the no-lexer shape grew +incrementally inside `parser.rs` without an ADR. + +The known requirements at the time included **H1a** (friendly +error layer for parse errors), **I3** (tab completion), and +**I4** (syntax highlighting). All three are easier with a +token stream than without one — H1a needs aggregated expected +sets, I3 needs "what tokens are valid at cursor", I4 needs +token classification on potentially-invalid input. The +lexer-shape question should have been considered against +these requirements when the parser was built. It wasn't. + +The user has previously raised this — pointing out that +`lex`/`yacc` already handled this kind of error reporting +better, and asking why we weren't getting comparable +behaviour. That observation was correct on the merits and +should have triggered an ADR amendment then. ADR-0020 is that +amendment, late but in the right place. + +This is not a "we now realise" framing. It is a "we should +have decided this earlier and didn't" framing. The cost of +acting on it now is real but bounded; the cost of acting on +it after the query DSL or constraint-management commands have +landed would be substantially larger. We have no users; the +agility cost of refactoring is at its lowest. + +### What is and isn't this ADR + +This ADR is purely about the **input layer to the DSL +parser**. It does not specify per-command usage rendering, +catalog keys for parse-error wording, or the renderer +composition of caret + structural error + usage hint. Those +are ADR-0021's scope. The two ADRs share an implementation +session; the dependency runs one way (ADR-0021 needs +ADR-0020's tokens). + +This ADR also does not touch SQL parsing in advanced mode. +That path uses `sqlparser-rs` (per ADR-0001) and has its own +tokenization built in. A future ADR for the SQL subset (Q4) +will decide whether to share the DSL lexer's token model or +keep `sqlparser-rs`'s token surface as-is — that's not +prejudged here. + +## Decision + +### 1. Two-phase parse: lex → parse + +```rust +pub fn parse_command(input: &str) -> Result { + let tokens = lex(input)?; // Stage 1 + parse_tokens(&tokens, input) // Stage 2 +} +``` + +Stage 1 (`lex`) produces a span-tagged token stream. Stage 2 +(`parse_tokens`) is a chumsky parser whose input type is +`&[Token]` instead of `&str`. The two stages are separately +testable; the lexer has its own test surface that doesn't +exercise the parser. + +`parse_tokens` takes the original `&str` as a second +argument purely so the parser can consume bare path arguments +for the `replay` command directly from source (see §6). All +other parser logic operates over the token slice. + +### 2. Token model + +```rust +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct Token { + pub kind: TokenKind, + pub span: Span, // (start, end) byte offsets +} + +#[derive(Debug, Clone, PartialEq, Eq)] +pub enum TokenKind { + Keyword(Keyword), // case-folded reserved word + Identifier(String), // case-preserving (ADR-0009) + Number(String), // raw text; numeric parse is a parser concern + StringLiteral(String), // unquoted, escapes processed + Punct(Punct), // : ( ) , = . - one char each + Flag(String), // --name (e.g. "--all-rows") + Error(LexError), // unrecognised char / unterminated string + // — a token kind, not a Result variant +} + +// Keyword is a closed set declared via a macro that is the +// single source of truth (§2a). Punct follows the same pattern. +``` + +The `Keyword` and `Punct` enums are not hand-declared. They +come out of macros described in §2a — one declaration site +that generates the enum, the lex-side string→variant +mapping, the variant→literal rendering, and the catalog-key +derivation in one place. + +Notes on the model: + +- **Keywords are an enum, not a string.** This makes + `kw(Keyword::Create)` a single-token chumsky match with + exact identity — fastest path, cleanest error patterns, + no string allocation in the hot path. +- **Type names stay as identifiers.** `int`, `text`, + `serial`, `varchar` all lex as `Identifier(_)`. The + parser-level `type_keyword()` continues to call + `Type::from_str` which produces the existing "unknown type + 'varchar' (expected one of: text, int, real, …)" + message. That custom-error path is correct and stays. The + closed-set Keyword enum is for the *grammar's* reserved + words — words whose presence determines which command is + being parsed. Type names are *content*. +- **Number is raw text, not parsed.** `Value::Number(String)` + per ADR-0014 stays string-backed; the lexer doesn't try to + validate or convert. `1`, `-3.14`, `1e10`, `1abc` all + produce candidates for the parser to decide on. (The + current parser rejects `1abc` already; that doesn't change.) +- **`StringLiteral` is post-escape.** The lexer processes + `''` → `'` per the existing string syntax. The original + span covers the surrounding quotes; the payload is the + unescaped content. (Same convention as the current + `string_literal()` parser.) +- **`Flag(String)` is `--name` exactly.** No further parsing. + The parser matches `Flag(s)` and checks `s == "all-rows"` + etc. against a small set. +- **`Error(_)` is a token kind, not a `Result` variant.** + Lex always succeeds in producing `Vec` — even on + unterminated strings or unrecognised characters, a + diagnostic Error token is emitted in place. Rationale: I4 + (syntax highlighting) needs to colour partial / invalid + input; treating lex errors as data instead of control flow + lets the highlighter walk a token stream uniformly. The + parser sees `Error(_)` tokens and raises a structural + error that names the underlying cause via the catalog. + +### 2a. Single source of truth for keywords and punctuation + +The `Keyword` enum, the lexer's reserved-word table, the +variant→literal rendering, and the catalog-key derivation +all come from one declaration. A `define_keywords!` +`macro_rules!` invocation in `src/dsl/keyword.rs`: + +```rust +macro_rules! define_keywords { + ( $( $variant:ident => $literal:literal ),+ $(,)? ) => { + #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)] + pub enum Keyword { $( $variant ),+ } + + impl Keyword { + pub const ALL: &'static [(Keyword, &'static str)] = &[ + $( (Keyword::$variant, $literal) ),+ + ]; + + /// Lex-side mapping. Case-insensitive per ADR-0009. + pub fn from_word(s: &str) -> Option { + Self::ALL.iter() + .find(|(_, lit)| s.eq_ignore_ascii_case(lit)) + .map(|(kw, _)| *kw) + } + + /// Canonical lowercase literal for this variant. + pub fn as_str(self) -> &'static str { + Self::ALL.iter() + .find(|(kw, _)| *kw == self) + .map(|(_, lit)| *lit) + .unwrap() + } + + /// `parse.token.keyword.` — the catalog key + /// the renderer looks up for the expected-set + /// vocabulary (ADR-0021 §4). + pub fn catalog_token_key(self) -> String { + format!("parse.token.keyword.{}", self.as_str()) + } + } + }; +} + +define_keywords! { + Create => "create", + Drop => "drop", + Add => "add", + // ... one line per keyword ... +} +``` + +`Punct` follows the same pattern via `define_punct!` — +Colon, OpenParen, CloseParen, Comma, Equals, Dot, each +generated from one line of the invocation. + +Adding a new keyword is **one line** in the +`define_keywords!` invocation, plus one line in the +en-US YAML under `parse.token.keyword.` (the catalog +validator catches a missing entry at test time per ADR-0021 +§7). Everything else — every parser combinator that +references the keyword, every usage-registry entry, every +catalog-key lookup — is a *use* of the source of truth, not +a duplicate, and is compile-time-checked by the type system +(typos in `kw(Keyword::Creaate)` don't compile). + +### 3. Span carriage + +Every `Token` carries `(start, end)` byte offsets in the +original source string. The lexer is byte-exact for +multi-byte UTF-8 sequences (ASCII-only is the realistic +input but the lexer does not panic on Unicode in identifiers +or string literals). + +`ParseError::Invalid` continues to carry `position: usize`, +which is the `start` offset of the failing token (or the +end-of-input position when the failure is at EOF). Caret +rendering (in `app.rs`) is unchanged — same byte-position +contract. + +### 4. Lexer-vs-parser error split + +The lexer is responsible for: + +- **Tokenization shape errors**: unterminated string literal, + unrecognised character. Both surface as `TokenKind::Error(_)` + tokens with span coverage of the offending region. +- **Nothing else.** Specifically: type names, command + keywords, value validity, flag names, numeric overflow, + identifier collisions are all parser- or higher-layer + concerns. + +The parser is responsible for: + +- **Grammar-shape errors**: missing tokens, wrong tokens, + unexpected tokens, end-of-input mid-command. These come + out of chumsky's structural error machinery and aggregate + across `choice` naturally. +- **Content errors via `try_map`**: unknown type name + ("unknown type 'varchar'"), mutually-exclusive flags + ("`--force-conversion` and `--dont-convert` are mutually + exclusive"), "with pk needs at least one column", + referential-clause repeated. These keep their hand-written + messages — the `Rich::custom` path is correct for content, + it was only wrong for keyword-shape. + +The render layer (humanise) handles both uniformly via the +catalog (ADR-0021). + +### 5. The chumsky-over-tokens combinator surface + +Chumsky 0.13 parses arbitrary input types, not just `&str`. +The combinator surface for `Parser<'a, &'a [Token], …>` is +the same as for `&str`: `just`, `choice`, `then`, +`then_ignore`, `ignore_then`, `or_not`, `repeated`, +`separated_by`, `try_map`, `labelled`. The only differences: + +- `just(Token { kind: Keyword::Create, .. })` doesn't work + literally because spans differ. We define helpers + `kw(Keyword::Create)`, `punct(Punct::Colon)`, `ident()`, + `number()`, `string_lit()`, `flag(name)` that match by + kind and ignore span on the input side, returning either + `()` or the carried payload. +- `padded()` (which strips whitespace) is no longer needed — + the lexer skips whitespace already. This simplifies a lot + of combinator chains (`just(':').padded()` → `punct(Punct::Colon)`). +- `text::keyword(...)`, `any().filter(...)`, etc. — all the + character-level helpers — are gone from the parser. They + belong to the lexer if anywhere. + +The parser keeps producing `Rich` errors. The +`expected` set members are `RichPattern`; the +catalog's `parse.token.*` keys (ADR-0021 §4) translate them. + +### 6. The `replay` path-argument wart + +The current grammar admits bare paths: `replay history.log`, +`replay /tmp/seed.commands`. Bare paths contain `.` and `/`, +which are punctuation tokens. A naive token-based parser +would see `Keyword(Replay) Identifier("history") Punct(Dot) +Identifier("log")` and have to reassemble — annoyingly +context-dependent. + +Decision: keep the existing bare-path UX. The parser +special-cases the `replay` command at one point: after +matching `Keyword(Replay)`, it consumes its argument +**directly from the original source string** by reading from +the next token's start byte to end-of-input (or to the next +unambiguous terminator, of which today there is none in the +DSL). The lexer's job for the path argument is essentially +to identify *where the path starts*; the parser does the +rest by source-slicing. + +The quoted-path form (`replay 'my project/seed.commands'`) +still goes through the lexer's normal `StringLiteral` path +and is matched as `Keyword(Replay) StringLiteral(s)`. + +This special-casing is documented inline in the parser. It +costs ~10 lines of code and avoids a breaking change to the +DSL surface. No other DSL command takes a free-form +filesystem-path argument; if a future command does, it +either uses the same special-case or accepts only quoted +paths. + +### 7. Whitespace + +The lexer skips ASCII whitespace (` \t\r\n`) between tokens. +This honours ADR-0009's "whitespace is liberal" rule +unchanged — the user can put any amount of whitespace +between any two tokens, and the parser doesn't see it. + +The lexer does **not** track whitespace as tokens. I4 (syntax +highlighting) recovers whitespace as the gaps between token +spans, computed at render time. + +### 8. Comments + +The DSL has no comment syntax today and this ADR doesn't +introduce one. If a comment syntax is added later (likely +`--` line comments to match SQL, but that conflicts with +flag prefixes — this is a real design decision a future ADR +will need to settle), the lexer is the right layer to skip +them. Out of scope here. + +### 9. I3 (tab completion) hook + +This ADR commits to making the parser's expected-token-set +queryable at any point in the input. Concretely: + +- `parse_tokens(&[Token], &str)` returns + `(Result, ParseDiagnostics)` where + `ParseDiagnostics` carries the raw chumsky output — + including the merged `expected` set at the failure point. +- Truncating the token stream at the cursor and re-parsing + yields the expected-token-set at the cursor position. I3 + uses this for completion. +- The lexer is independently useful to I3: tokenizing the + in-progress input lets the completion logic see what + *partial* token the cursor is in (mid-keyword, mid-identifier, + mid-string). + +The full I3 surface (cursor positioning, completion menu, +disambiguation, completion of identifiers from schema) is +out of scope here. ADR-0020 commits only to the parser +contract. + +### 10. I4 (syntax highlighting) hook + +The token kinds are the natural classification for syntax +highlighting: + +- `Keyword(_)` → keyword colour. +- `Identifier(_)` → identifier colour. +- `Number(_)`, `StringLiteral(_)` → literal colours + (probably distinct from each other). +- `Punct(_)` → punctuation colour (or no special colouring). +- `Flag(_)` → flag colour. +- `Error(_)` → error highlight (red underline / squiggle). + +The lexer succeeds on partial / invalid input (§2). I4 does +not need a successful parse to render highlighting — only +successful tokenization. This is the load-bearing property +for "live" highlighting as the user types. + +The colour scheme, theme integration, render layer, and +performance budget are out of scope here. ADR-0020 commits +only to "the lexer always produces a token stream over which +highlighting can iterate". + +### 11. Recovery-based partial AST — explicitly deferred + +Chumsky supports parser combinators with recovery, producing +partial ASTs alongside errors. Useful for I3 ("the user has +typed `create table Foo with pk` and now I want to know +what the partial AST is so I can suggest column-spec +completions") but the design space is large (which recovery +strategies, what the partial AST shape is, how downstream +consumers handle it). + +This ADR keeps the parser non-recovering: a failed parse +returns `Err(ParseError)` and no partial AST. I3's ADR will +decide whether recovery is needed; if so, the change is +local to `parse_tokens` and doesn't ripple. + +### 12. Migration of existing parser tests + +The 50+ unit tests in `dsl/parser.rs::tests` are the spec of +the current grammar. The migration is mechanical: + +- Tests that call `parse_command(input)` keep doing so — + `parse_command` is the public boundary and its signature + doesn't change. +- Tests that assert on `ParseError::Invalid { message, .. }` + may need wording updates if the new error layer rewords + them, but anchor substrings ("unknown type", "specified + twice", "mutually exclusive", "varchar", "expected one + of") stay intact (those come from `try_map` content errors + that survive unchanged). +- Two existing tests assert on chumsky's structural wording: + `structural_error_for_show_data_without_arg` + ("after `show data`", "expected identifier", "found end of + input") and `structural_error_for_change_column_with_swapped_args` + ("after `change column Rich`", "expected `:`"). The new + rendering preserves the same shape — `after , + expected , found ` — so these tests should + port with at most minor adjustments. ADR-0021 specifies + the rendering precisely. + +A new test surface for the lexer itself: lex output for +representative inputs, lex error tokens for unterminated +strings + unknown chars, span correctness. + +## Out of scope + +Deliberately deferred to keep this ADR focused: + +1. **Per-command usage templates.** ADR-0021. +2. **`parse.usage.*` and `parse.token.*` catalog keys.** + ADR-0021. +3. **Error renderer composition** (caret + structural error + + usage hint). ADR-0021. +4. **I3 completion UI** + cursor logic + identifier + completion from schema. Future I3 ADR. +5. **I4 colour scheme** + theme integration. Future I4 ADR. +6. **Recovery-based partial AST.** §11 — re-opens with I3. +7. **Comment syntax.** §8. +8. **Sharing the lexer with `sqlparser-rs`** (advanced-mode + SQL). The two parsers stay separate today; a future SQL + subset ADR may revisit. + +## Consequences + +### Positive + +- **Aggregation works.** Top-level `choice((create_*, drop_*, + add_*, …))` failures emit "expected `create`, `drop`, + `add`, …" structurally. The user sees the family of + available commands rather than one branch's report. +- **The bespoke `humanise()` machinery shrinks.** Roughly + half the helpers in `parser.rs` (~80-100 lines) are no + longer needed because token-level error patterns render + directly via the catalog. Less code is less code to + maintain. +- **I3 / I4 inherit a clean foundation.** Their ADRs can + focus on UX/UI rather than re-litigating parser shape. +- **Lex errors and parse errors share a render path.** + Unterminated strings and missing keywords both surface + through the same catalog-driven layer. +- **The token stream is a natural API for future tooling**: + schema-aware highlighting, structural editing, command + history rendering with token-level colour. + +### Costs + +- **One-time migration of `dsl/parser.rs`.** Every combinator + rewrites against `&[Token]`. Estimated 600-900 lines of + parser.rs change including the lexer module; the lexer + itself is probably 200-300 lines plus 150-250 lines of + unit tests. +- **Catalog grows by one entry per keyword and per punct.** + The macro-driven `Keyword` / `Punct` enums (§2a) collapse + the enum + lex table + as-str + catalog-key derivation into + one declaration site, so adding a keyword is one line of + Rust + one line of YAML. The catalog validator enforces + completeness at test time (ADR-0021 §7). Compared to today + — where adding a keyword means a new `keyword_ci("...")` + call site (or several, if used in multiple commands) — the + per-keyword cost is comparable; what shifts is *where* the + edit happens. A unit test additionally asserts every + `Keyword::ALL` variant is referenced by some parser + combinator, catching dead enum entries. +- **Span model needs care for UTF-8.** Byte offsets are the + contract; the lexer must split tokens at UTF-8 boundaries. + Identifier/string tests should include at least one + multi-byte case. +- **`replay` special-case** in the parser (§6). One command, + ~10 lines, documented inline. Acceptable; not a precedent + for other commands. +- **Tests churn.** Two structural-error tests need new wording + (mechanical port). Existing content-error tests stay as-is. + +### Neutral + +- **chumsky stays.** No framework change, no new dependency. + The parser still expresses the grammar declaratively; only + the input atoms change. +- **`Command` AST is unchanged.** The parser produces the + same `Command` enum it does today; downstream code (runtime, + app, db) is untouched. +- **Public API of `dsl::parser` is unchanged.** + `parse_command(&str) → Result` keeps + its signature. ADR-0021 may extend the public surface (e.g. + expose `lex()` and `parse_tokens()` for I3/I4) but does so + additively. + +## Implementation notes + +These are sketch-level — implementation will produce more +detailed work, but they're enough that a session picking +this up has direction. + +### Order of operations + +1. **Lexer module.** New file `src/dsl/lexer.rs`. Token / + TokenKind / Keyword / Punct types. `lex(input: &str) → + Vec` (always succeeds; embeds `Error(_)` tokens + for shape errors). Unit tests for representative inputs, + span correctness, error-token cases, multi-byte UTF-8. +2. **Token-aware combinator helpers.** Probably in + `src/dsl/parser.rs` next to the existing combinators. + `kw(Keyword)`, `punct(Punct)`, `ident()`, `number()`, + `string_lit()`, `flag(&str)`, `eof()`. Each parses a + single token by kind and returns its payload (or `()`). +3. **Rewrite `command_parser()` and its sub-parsers.** One + sub-parser at a time; run the existing tests after each. + Aim for green-after-every-step rather than a big-bang + port. +4. **The `replay` source-slice special case** (§6). +5. **Strip `humanise()` machinery** that's no longer needed. + The render path in `into_parse_error()` simplifies. + ADR-0021 owns the new render shape; until it lands, keep + a minimal humaniser that produces the existing wording. +6. **Public API check.** `parse_command` signature unchanged. + Add `lex(&str) → Vec` and `parse_tokens(&[Token], + &str) → Result` as `pub` so I3/I4 + can hook in later (their ADRs will use these). + +### Things that interact subtly + +- **The `try_map` content errors** survive unchanged. They + fire on tokens (e.g. `try_map` on a parsed `Identifier(s)` + that's expected to be a type name) but their messages and + classification are identical to today. The catalog + vocabulary they use ("unknown type", "expected one of", + "mutually exclusive", "specified twice") stays. +- **The `1:n` cardinality** lexes as `Number("1")`, + `Punct(Colon)`, `Identifier("n"|"N")`. The parser composes + these into the relationship-cardinality assertion as + today; no special token kind for it. +- **Negative number literals.** Lexer emits `Punct(Minus)`? + No — there is no Minus in the punct set above because the + current grammar has no need for a unary `-` outside number + literals. Decision: the lexer recognises `-` as part of a + number literal when followed by an ASCII digit, producing a + single `Number("-5")` token. A bare `-` not followed by a + digit is a `Punct(Minus)` only if a `Minus` variant is + added — for now, treat as `Error(UnknownChar)`. This + matches the current grammar (which only accepts `-` as a + number sign). +- **The `--` flag prefix vs. a future `--` line comment.** + Today `--all-rows` etc. are flags. If a future ADR + introduces SQL-style `--` line comments in advanced mode + only, the lexer may need a mode parameter. ADR-0020 doesn't + pre-empt that. +- **The hard-coded `"running: "` prefix in `app.rs`** for + caret padding: unchanged. The parser still reports a + byte-position; the caret math is the same. diff --git a/docs/adr/0021-parser-as-source-of-truth-for-h1a.md b/docs/adr/0021-parser-as-source-of-truth-for-h1a.md new file mode 100644 index 0000000..244f7f4 --- /dev/null +++ b/docs/adr/0021-parser-as-source-of-truth-for-h1a.md @@ -0,0 +1,452 @@ +# ADR-0021: Parser-as-source-of-truth for H1a (per-command usage in parse errors) + +## Status + +Accepted. + +Builds on ADR-0020 (tokenization layer). Addresses H1a from +`requirements.md` — the parse-error pedagogy gap that +ADR-0019's friendly-error layer left untouched. + +Cross-references ADR-0019 (i18n catalog conventions; H1a's +output goes through the same catalog) and ADR-0009 (DSL +syntax conventions; usage templates render in the project's +documented surface form). + +## Context + +ADR-0019 dramatically improved engine-error wording. +Parse-error wording is now the visibly-weakest user surface — +the user-reported gap was concrete: typing `create` produces + +``` +parse error: after `create`, expected `table` +``` + +The error is *structurally* correct (chumsky has consumed +`create` and is now looking for the next required token) but +*pedagogically* silent. A learner who got this far typed +`create` because they'd been told that's how new tables are +made; what they need next is the shape of the command, not +a single missing-token pointer. + +Comparable observations apply across the whole DSL surface: + +- `add` → expected `column` or `1` (uninformative; user + needs the shape of `add column …` AND `add 1:n + relationship …`). +- `update Customers` → expected `set` (true; but `update`'s + full grammar with `set …`, `where …`, `--all-rows` is what + the user wants illustrated). +- `frobulate Customers` → expected one of `create`, `drop`, + `add`, `rename`, `change`, `show`, `insert`, `update`, + `delete`, `replay` (true after ADR-0020; the available-commands + list is now informative, but the no-prefix case wants its + own framing — "available commands" rather than "expected"). + +H1a's job is to close that gap by surfacing the **grammar** +of the command at the point of error, not just the next +token. + +### What ADR-0020 supplies + +ADR-0020 lands the lexer + parser-over-tokens architecture. +What that buys H1a: + +- **Aggregated `expected` sets at the failure point** (top-level + `choice` failures now list every command-starting keyword, + not just one). The user-visible "available commands" list + becomes correct without any work in this ADR. +- **Token-kind error patterns** (`RichPattern` instead + of `RichPattern`). Each pattern renders via a stable + catalog key — no per-character humanising. +- **A canonical entry-token for each command** (the first + `Keyword(_)` consumed). H1a keys per-command usage + templates off this token. + +### What this ADR adds on top + +- A registry of per-command **usage templates** (one + declaration per command). +- A renderer that composes the parse error with: caret + + structural error wording + matching usage template(s). +- New catalog keys under `parse.usage.*` (templates) and + `parse.token.*` (single-token rendering for expected-set + joins). +- A "no commands consumed" fallback that renders an + available-commands list under a different prefix + ("available commands:" rather than "expected:") for the + zero-prefix case. + +## Decision + +### 1. Per-command usage template registry + +Each command parser is paired with a `UsageEntry`: + +```rust +pub struct UsageEntry { + /// First keyword that distinguishes this command. Used + /// as the registry key. + pub entry: Keyword, + /// Catalog key for the grammar template body (under + /// `parse.usage.*`). One key per command. + pub catalog_key: &'static str, +} +``` + +The registry is a `&'static [UsageEntry]` declared in one +place (`src/dsl/usage.rs`). Lookup: given a consumed entry +keyword, return all entries whose `entry == keyword`. For +`Keyword::Add` the registry returns the `add column` and +`add 1:n relationship` entries; for `Keyword::Drop` it +returns `drop table`, `drop column`, `drop relationship`; +for unique-entry keywords (e.g. `Keyword::Create` today) it +returns one. + +The catalog key is what gets translated. Template bodies +live in `src/friendly/strings/en-US.yaml` under +`parse.usage.*`: + +```yaml +parse: + usage: + create_table: "create table with pk [:[, ...]]" + drop_table: "drop table " + add_column: "add column [to] [table] : ()" + add_relationship: | + add 1:n relationship [as ] + from .to .+ [on delete ] [on update ] + [--create-fk] + rename_column: "rename column [in] [table]
: to " + change_column: | + change column [in] [table]
: () + [--force-conversion | --dont-convert] + show_data: "show data
" + show_table: "show table
" + insert: "insert into
[([, ...])] [values] ([, ...])" + update: "update
set =[, ...] (where = | --all-rows)" + delete: "delete from
(where = | --all-rows)" + drop_column: "drop column [from] [table]
: " + drop_relationship: | + drop relationship + drop relationship from .to .+ replay: "replay | replay ''" +``` + +(Wording is illustrative; exact phrasing settled at +implementation time. The bracket convention `[...]` for +optional parts and angle-bracket `<...>` for placeholders +matches ADR-0009's documentation surface.) + +### 2. The renderer composes three blocks + +A parse error renders as: + +``` +running: + ^ ← caret (existing, unchanged) +parse error: +usage: + ← when multiple entries share the entry keyword +``` + +Block 1 (the echo + caret) is unchanged from today. + +Block 2 is the structural or content error. ADR-0020 +guarantees the structural error is now properly aggregated +("expected `data` or `table`" not "expected `table`"). The +content errors (unknown type, mutually-exclusive flags) are +unchanged in voice. + +Block 3 (usage:) is new. It is rendered if and only if **at +least one keyword token was consumed** before the parser +failed AND that keyword is a registered entry. If no keyword +was consumed (e.g., `frobulate Customers`, where `frobulate` +is an `Identifier`, not a `Keyword`), Block 3 is replaced +with the no-prefix fallback (§5). + +If multiple entries match (e.g., the `add` family), all are +listed under a single `usage:` prefix, one per line. + +### 3. Identifying the consumed entry keyword + +The parser surfaces, alongside the `ParseError`, the +**deepest successfully-consumed keyword token**. Mechanism: + +- `parse_tokens` returns `(Result, + ParseDiagnostics)` where `ParseDiagnostics` carries the + furthest position chumsky reached AND a snapshot of the + consumed prefix. +- The renderer walks the consumed prefix backward to find the + first `Keyword(_)` token. (Almost always the first token, + but a future grammar where a command starts with a + literal — none today — would still resolve correctly.) + +This logic lives in `src/dsl/usage.rs::matched_entry()` so +the registry and the lookup sit together. + +### 4. `parse.token.*` — single-token catalog vocabulary + +Chumsky's expected-set rendering needs a name for each token +kind. Today `humanise()` hand-codes these +(`describe_pattern` returns "`create`", "identifier", etc.). +ADR-0021 moves the vocabulary into the catalog: + +```yaml +parse: + token: + # Keywords — one entry per Keyword enum variant. + keyword.create: "`create`" + keyword.table: "`table`" + keyword.with: "`with`" + # ... one per Keyword variant ... + + # Punctuation. + punct.colon: "`:`" + punct.open_paren: "`(`" + punct.close_paren: "`)`" + punct.comma: "`,`" + punct.equals: "`=`" + punct.dot: "`.`" + + # Token-class labels. + identifier: "identifier" + number: "number" + string_literal: "string literal" + flag: "flag (--name)" + end_of_input: "end of input" + + # Lexer-error tokens. + error.unterminated_string: "unterminated string starting at column {column}" + error.unknown_char: "unrecognised character {found}" +``` + +Joining ("`a`, `b`, or `c`") stays in code (`oxford_or` from +the current humanise machinery, lifted intact). Wording of +each token is in the catalog. + +`parse.error` (existing wrapper key) stays. Its `{detail}` +placeholder is filled by: + +``` +{consumed_prefix} expected {oxford_or(expected)}, found {found_token} +``` + +— each piece sourced from the catalog, joined in code. + +`parse.caret` (existing) and `parse.empty` (existing) +unchanged. + +### 5. No-prefix fallback: "available commands" + +When the parser fails with **no keyword consumed**, the +"expected" set lists every top-level command-starting +keyword. That's correct but the framing should be +"available commands" rather than "expected". + +Renderer detects this case (consumed-keyword count == 0) and +substitutes Block 3 with: + +``` +available commands: create, drop, add, rename, change, + show, insert, update, delete, replay +``` + +via a new catalog key: + +```yaml +parse: + available_commands: "available commands: {commands}" +``` + +The list is the alphabetised set of `entry` keywords from +the usage registry, each rendered via its `parse.token.keyword.*` +catalog entry (so the strings are catalog-sourced, not +hard-coded). + +This case only fires when the user typed something the +parser couldn't classify as any known command keyword — the +"frobulate Customers" case. It's both rarer and more useful +than the with-prefix case: a user this lost benefits more +from the full menu than from a missing-token pointer. + +### 6. Anchor-phrase compliance (ADR-0019 §10) + +ADR-0019's anchor-phrase list contains nine substrings the +catalog commits to keeping stable. None are parse-error-specific, +so this ADR doesn't add to the list. The existing parser +test that asserts on "unknown type" and "expected one of" +substrings stays — those come from `Type::from_str`'s custom +error message which ADR-0020 §4 keeps unchanged. + +The current structural-error tests assert on substrings like +"after `show data`", "expected identifier", "found end of +input", "after `change column Rich`", "expected `:`". The +new render shape preserves all of these — the rendering +template is `{prefix} expected {set}, found {token}` and +the prefix / set / token come from the catalog with the same +wording. Tests should port unchanged or with at most minor +adjustments. + +### 7. Catalog validator covers the new keys + +ADR-0019 §8.6's `KEYS_AND_PLACEHOLDERS` validator extends +to cover: + +- Every `parse.usage.` key referenced from the + registry exists. +- Every `parse.token.keyword.` key for every + `Keyword` enum variant exists. +- Every `parse.token.punct.` key for every `Punct` + variant exists. +- The `parse.token.{identifier, number, string_literal, + flag, end_of_input}` keys exist. +- The `parse.token.error.*` keys exist for every + `LexErrorKind` variant. +- The `parse.available_commands` key exists. +- No format specifiers (already enforced). +- No engine vocabulary (already enforced). + +### 8. The `usage:` block respects the verbosity setting? + +No. The `messages (short|verbose)` setting (ADR-0019) +governs *engine-error* verbosity (whether to render the +hint block of a `FriendlyError`). Parse errors don't go +through `FriendlyError`; they have their own render path, +and the usage block is always shown. Rationale: a learner +toggling to `messages short` is signalling they recognise +the engine-error patterns and want less explanation around +those — they're not signalling that they want less +parse-help. Parse errors mean the user couldn't even +formulate a runnable command; that's exactly the moment to +maximise pedagogical surface, regardless of the +engine-error verbosity preference. + +If experience shows this is wrong, a future amendment can +gate the usage block on a separate setting. Doesn't need +to be designed now. + +## Out of scope + +1. **Tab completion (I3) and syntax highlighting (I4)** + themselves. ADR-0020 §9-10 commits to the parser + contract; ADR-0021 doesn't extend it. +2. **Schema-aware suggestions** ("did you mean `Customers`?" + when the user typed `Customrs`). Useful but a separate + feature; would land in I3 territory (completion + spell + check share a candidate list). +3. **Suggested fixes** ("change `crete` to `create`"). Same + bucket as schema-aware suggestions. +4. **Multi-error reporting.** Today and after this ADR, the + parser reports the first error and stops. Recovery-based + multi-error parsing is out of scope and re-opens with + I3's ADR (ADR-0020 §11). +5. **Persisting the verbosity setting** (which doesn't + affect parse errors anyway, per §8). ADR-0019 deferred it + to a future settings ADR. + +## Consequences + +### Positive + +- **Per-command usage at point of failure.** A learner who + types `create` sees the full `create table` grammar + instead of "expected `table`". The user-reported gap + closes. +- **Aggregated `available commands` for cold starts.** + `frobulate Customers` now lists the ten command-starting + keywords under a sensible framing. +- **Vocabulary lives in the catalog, not in code.** Renaming + a keyword's user-facing wording is one YAML edit. Adding + a new keyword adds two lines (registry + token-name key); + the validator catches both if forgotten. +- **The render path simplifies.** `humanise()` shrinks to a + small composer over catalog lookups — no per-character + description, no `RichPattern` walking, no + prefer-custom-over-structural switching (the latter + becomes "render the structural error and append the usage + template"). +- **Composes with ADR-0019's `FriendlyError`.** Engine + errors and parse errors are rendered through different + paths but both go through the catalog, so vocabulary + drift between them is impossible. + +### Costs + +- **A second registry to keep in sync** with the parser. The + validator (§7) catches missing usage entries / missing + token keys at test time, but adding a new command means + three steps (parser combinator, usage-registry entry, + catalog YAML edit). Mitigation: a unit test asserts every + command in the parser has a registry entry (catches + forgotten entries; matches the friendly-module pattern). +- **Catalog grows by ~30-40 entries** (one usage template + per command, one keyword name per `Keyword` variant, a + handful of token-class names, a handful of error names). + Each entry is one line of YAML; total catalog grows from + ~170 entries to ~210. Within budget. +- **Wording iteration** on the usage templates will probably + happen post-merge. This is normal for pedagogical text + and the catalog makes it cheap. + +### Neutral + +- **Public parser API is unchanged.** `parse_command(&str)` + signature stable. The new `lex` and `parse_tokens` + functions exposed by ADR-0020 are the I3/I4 hook; + ADR-0021 doesn't add to that surface. +- **`AppEvent` shape unchanged.** Parse errors continue to + flow through `dispatch_dsl`'s existing path (push echo, + push caret, push error). This ADR's render changes are + internal to that function plus the `t!()` calls inside it. + +## Implementation notes + +### Order of operations (within the joint ADR-0020 + ADR-0021 implementation session) + +1. Land ADR-0020 (lexer + parser refactor + minimal + humaniser). +2. Add `src/dsl/usage.rs` with the registry struct, the + static table, and `matched_entry()`. +3. Populate `parse.usage.*` and `parse.token.*` catalog + sections. +4. Extend `friendly::keys::KEYS_AND_PLACEHOLDERS` with the + new keys. +5. Rewrite `dispatch_dsl`'s error-render arm in `app.rs` to + compose the three blocks per §2 (or §5 fallback). +6. Add tests: + - Unit: every registered usage entry resolves through the + catalog. Every `Keyword` variant has a `parse.token.keyword.*` + entry. + - Integration (`tests/parse_error_pedagogy.rs`, new): + `create`, `add`, `update Customers`, `frobulate + Customers`, `create table` (no PK clause), `insert into + T` (no values), each producing the expected + three-block output. +7. Update or port the two existing structural-error tests + in `parser.rs::tests` to the new render shape. + +### Things that interact subtly + +- **The "deepest consumed keyword" mechanism** (§3) walks + the prefix once per parse failure. Cheap; no perf concern. + But it must not pick up keywords from inside content that + is itself part of a partial AST (e.g. an identifier the + user is typing that happens to be the first letters of a + keyword); since the lexer commits to identifier-vs-keyword + classification before the parser sees tokens, this isn't a + real risk. Documented inline. +- **Multiple usage entries per `add` / `drop`** are rendered + under one `usage:` prefix per §2. This is one of the + pedagogically-best parts of the change: the user gets the + full family rather than guessing which sibling they + wanted. +- **`replay`'s special-case parsing** (ADR-0020 §6) is + invisible to the usage layer. The user typing `replay` + with no path gets the `parse.usage.replay` template. +- **`messages` is an app-level command, not a DSL command**, + so it is not in the parser registry and doesn't appear in + `available commands:`. Same posture as `mode`, `help`, + `quit`. Documented in the registry's prelude. diff --git a/docs/adr/README.md b/docs/adr/README.md index 48cdcca..5b03a68 100644 --- a/docs/adr/README.md +++ b/docs/adr/README.md @@ -25,3 +25,5 @@ This directory contains the project's ADRs, recorded per - [ADR-0017 — Column type-change compatibility](0017-column-type-change-compatibility.md) - [ADR-0018 — Auto-fill contracts for `serial` and `shortid` columns](0018-auto-fill-contracts-for-serial-and-shortid.md) - [ADR-0019 — Friendly error layer (H1) and i18n message catalog](0019-friendly-error-layer-and-i18n.md) +- [ADR-0020 — Tokenization layer for the DSL parser](0020-tokenization-layer-for-the-dsl-parser.md) +- [ADR-0021 — Parser-as-source-of-truth for H1a (per-command usage in parse errors)](0021-parser-as-source-of-truth-for-h1a.md)