# ADR-0020: Tokenization layer for the DSL parser ## Status Accepted. Amends ADR-0001 (language and TUI framework) by adding a tokenization layer between the source string and the chumsky grammar. The chumsky choice itself is unchanged; what changes is the input to chumsky — `&[Token]` instead of `&str`. Foundation for ADR-0021 (parser-as-source-of-truth pedagogy / H1a). Also the load-bearing piece for I3 (tab completion) and I4 (syntax highlighting) once those land. ## Context ### What the user sees today Typing `create` at the prompt produces: ``` parse error: after `create`, expected `table` ``` Typing `frobulate Customers` produces: ``` parse error: expected `create`, found `frobulate` ``` Both are technically true and both are bad. The first hides that `create` is the entry to a single command (`create table …`) that the user almost certainly does not yet know the shape of. The second points at one of ten possible command-starting keywords and silently picks one. A user hitting either error gets a sharper "what comes next?" answer from a 1980s `lex`/`yacc` toolchain than from this app. ### Two technical roots The bad UX traces to two parser-level decisions, both correctable but not without architectural change: 1. **`keyword_ci` emits `Rich::custom` errors instead of structural ones.** Chumsky's `choice` combinator merges the `expected` sets of its alternatives when they all fail — that is exactly what produces "expected `data` or `table`" instead of just "expected `table`". Custom errors don't participate in that merge: only one wins, deterministically the first. So our top-level `choice((create_table, drop_*, add_*, …))` collapses to whichever branch reported its custom error first, throwing the others away. 2. **The parser has no concept of a "command" beyond its chumsky combinator graph.** There is no place to attach a per-command grammar template (the "what does `create table` actually want?" answer). Adding one is possible but awkward without an explicit handle on "the entry token for this command". ADR-0021 addresses (2). This ADR addresses (1) by removing the underlying cause: `keyword_ci`'s `try_map → Rich::custom` shape only exists because the parser operates on raw characters and has to hand-write keyword recognition. Over a token stream, a keyword is just a token, and `just(...)` over tokens aggregates naturally. ### Bespoke machinery in `parser.rs` `parser.rs` carries ~180 lines of error-handling helpers (`humanise()`, `consumed_context()`, `oxford_or()`, `describe_pattern()`, `describe_char()`, `format_expected_found()`, `first_custom_message()`, the `prefer-custom-over-structural` selection in `into_parse_error()`). Most of it exists because chumsky-over-`&str` produces character-level error patterns that need humanising before a user can read them (`RichPattern::Token('e')` → "`e`" is not what the user needed to know). Over a token stream, the patterns are token kinds — `Keyword(Create)`, `Identifier`, `Punct(Colon)` — which render directly via the i18n catalog without per-character translation. Roughly half of that machinery dissolves. ### Honest history ADR-0001 chose Rust + chumsky for the DSL. It did not name "no lexer" as a design choice — the no-lexer shape grew incrementally inside `parser.rs` without an ADR. The known requirements at the time included **H1a** (friendly error layer for parse errors), **I3** (tab completion), and **I4** (syntax highlighting). All three are easier with a token stream than without one — H1a needs aggregated expected sets, I3 needs "what tokens are valid at cursor", I4 needs token classification on potentially-invalid input. The lexer-shape question should have been considered against these requirements when the parser was built. It wasn't. The user has previously raised this — pointing out that `lex`/`yacc` already handled this kind of error reporting better, and asking why we weren't getting comparable behaviour. That observation was correct on the merits and should have triggered an ADR amendment then. ADR-0020 is that amendment, late but in the right place. This is not a "we now realise" framing. It is a "we should have decided this earlier and didn't" framing. The cost of acting on it now is real but bounded; the cost of acting on it after the query DSL or constraint-management commands have landed would be substantially larger. We have no users; the agility cost of refactoring is at its lowest. ### What is and isn't this ADR This ADR is purely about the **input layer to the DSL parser**. It does not specify per-command usage rendering, catalog keys for parse-error wording, or the renderer composition of caret + structural error + usage hint. Those are ADR-0021's scope. The two ADRs share an implementation session; the dependency runs one way (ADR-0021 needs ADR-0020's tokens). This ADR also does not touch SQL parsing in advanced mode. That path uses `sqlparser-rs` (per ADR-0001) and has its own tokenization built in. A future ADR for the SQL subset (Q4) will decide whether to share the DSL lexer's token model or keep `sqlparser-rs`'s token surface as-is — that's not prejudged here. ## Decision ### 1. Two-phase parse: lex → parse ```rust pub fn parse_command(input: &str) -> Result { let tokens = lex(input)?; // Stage 1 parse_tokens(&tokens, input) // Stage 2 } ``` Stage 1 (`lex`) produces a span-tagged token stream. Stage 2 (`parse_tokens`) is a chumsky parser whose input type is `&[Token]` instead of `&str`. The two stages are separately testable; the lexer has its own test surface that doesn't exercise the parser. `parse_tokens` takes the original `&str` as a second argument purely so the parser can consume bare path arguments for the `replay` command directly from source (see §6). All other parser logic operates over the token slice. ### 2. Token model ```rust #[derive(Debug, Clone, PartialEq, Eq)] pub struct Token { pub kind: TokenKind, pub span: Span, // (start, end) byte offsets } #[derive(Debug, Clone, PartialEq, Eq)] pub enum TokenKind { Keyword(Keyword), // case-folded reserved word Identifier(String), // case-preserving (ADR-0009) Number(String), // raw text; numeric parse is a parser concern StringLiteral(String), // unquoted, escapes processed Punct(Punct), // : ( ) , = . - one char each Flag(String), // --name (e.g. "--all-rows") Error(LexError), // unrecognised char / unterminated string // — a token kind, not a Result variant } // Keyword is a closed set declared via a macro that is the // single source of truth (§2a). Punct follows the same pattern. ``` The `Keyword` and `Punct` enums are not hand-declared. They come out of macros described in §2a — one declaration site that generates the enum, the lex-side string→variant mapping, the variant→literal rendering, and the catalog-key derivation in one place. Notes on the model: - **Keywords are an enum, not a string.** This makes `kw(Keyword::Create)` a single-token chumsky match with exact identity — fastest path, cleanest error patterns, no string allocation in the hot path. - **Type names stay as identifiers.** `int`, `text`, `serial`, `varchar` all lex as `Identifier(_)`. The parser-level `type_keyword()` continues to call `Type::from_str` which produces the existing "unknown type 'varchar' (expected one of: text, int, real, …)" message. That custom-error path is correct and stays. The closed-set Keyword enum is for the *grammar's* reserved words — words whose presence determines which command is being parsed. Type names are *content*. - **Number is raw text, not parsed.** `Value::Number(String)` per ADR-0014 stays string-backed; the lexer doesn't try to validate or convert. `1`, `-3.14`, `1e10`, `1abc` all produce candidates for the parser to decide on. (The current parser rejects `1abc` already; that doesn't change.) - **`StringLiteral` is post-escape.** The lexer processes `''` → `'` per the existing string syntax. The original span covers the surrounding quotes; the payload is the unescaped content. (Same convention as the current `string_literal()` parser.) - **`Flag(String)` is `--name` exactly.** No further parsing. The parser matches `Flag(s)` and checks `s == "all-rows"` etc. against a small set. - **`Error(_)` is a token kind, not a `Result` variant.** Lex always succeeds in producing `Vec` — even on unterminated strings or unrecognised characters, a diagnostic Error token is emitted in place. Rationale: I4 (syntax highlighting) needs to colour partial / invalid input; treating lex errors as data instead of control flow lets the highlighter walk a token stream uniformly. The parser sees `Error(_)` tokens and raises a structural error that names the underlying cause via the catalog. ### 2a. Single source of truth for keywords and punctuation The `Keyword` enum, the lexer's reserved-word table, the variant→literal rendering, and the catalog-key derivation all come from one declaration. A `define_keywords!` `macro_rules!` invocation in `src/dsl/keyword.rs`: ```rust macro_rules! define_keywords { ( $( $variant:ident => $literal:literal ),+ $(,)? ) => { #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)] pub enum Keyword { $( $variant ),+ } impl Keyword { pub const ALL: &'static [(Keyword, &'static str)] = &[ $( (Keyword::$variant, $literal) ),+ ]; /// Lex-side mapping. Case-insensitive per ADR-0009. pub fn from_word(s: &str) -> Option { Self::ALL.iter() .find(|(_, lit)| s.eq_ignore_ascii_case(lit)) .map(|(kw, _)| *kw) } /// Canonical lowercase literal for this variant. pub fn as_str(self) -> &'static str { Self::ALL.iter() .find(|(kw, _)| *kw == self) .map(|(_, lit)| *lit) .unwrap() } /// `parse.token.keyword.` — the catalog key /// the renderer looks up for the expected-set /// vocabulary (ADR-0021 §4). pub fn catalog_token_key(self) -> String { format!("parse.token.keyword.{}", self.as_str()) } } }; } define_keywords! { Create => "create", Drop => "drop", Add => "add", // ... one line per keyword ... } ``` `Punct` follows the same pattern via `define_punct!` — Colon, OpenParen, CloseParen, Comma, Equals, Dot, each generated from one line of the invocation. Adding a new keyword is **one line** in the `define_keywords!` invocation, plus one line in the en-US YAML under `parse.token.keyword.` (the catalog validator catches a missing entry at test time per ADR-0021 §7). Everything else — every parser combinator that references the keyword, every usage-registry entry, every catalog-key lookup — is a *use* of the source of truth, not a duplicate, and is compile-time-checked by the type system (typos in `kw(Keyword::Creaate)` don't compile). ### 3. Span carriage Every `Token` carries `(start, end)` byte offsets in the original source string. The lexer is byte-exact for multi-byte UTF-8 sequences (ASCII-only is the realistic input but the lexer does not panic on Unicode in identifiers or string literals). `ParseError::Invalid` continues to carry `position: usize`, which is the `start` offset of the failing token (or the end-of-input position when the failure is at EOF). Caret rendering (in `app.rs`) is unchanged — same byte-position contract. ### 4. Lexer-vs-parser error split The lexer is responsible for: - **Tokenization shape errors**: unterminated string literal, unrecognised character. Both surface as `TokenKind::Error(_)` tokens with span coverage of the offending region. - **Nothing else.** Specifically: type names, command keywords, value validity, flag names, numeric overflow, identifier collisions are all parser- or higher-layer concerns. The parser is responsible for: - **Grammar-shape errors**: missing tokens, wrong tokens, unexpected tokens, end-of-input mid-command. These come out of chumsky's structural error machinery and aggregate across `choice` naturally. - **Content errors via `try_map`**: unknown type name ("unknown type 'varchar'"), mutually-exclusive flags ("`--force-conversion` and `--dont-convert` are mutually exclusive"), "with pk needs at least one column", referential-clause repeated. These keep their hand-written messages — the `Rich::custom` path is correct for content, it was only wrong for keyword-shape. The render layer (humanise) handles both uniformly via the catalog (ADR-0021). ### 5. The chumsky-over-tokens combinator surface Chumsky 0.13 parses arbitrary input types, not just `&str`. The combinator surface for `Parser<'a, &'a [Token], …>` is the same as for `&str`: `just`, `choice`, `then`, `then_ignore`, `ignore_then`, `or_not`, `repeated`, `separated_by`, `try_map`, `labelled`. The only differences: - `just(Token { kind: Keyword::Create, .. })` doesn't work literally because spans differ. We define helpers `kw(Keyword::Create)`, `punct(Punct::Colon)`, `ident()`, `number()`, `string_lit()`, `flag(name)` that match by kind and ignore span on the input side, returning either `()` or the carried payload. - `padded()` (which strips whitespace) is no longer needed — the lexer skips whitespace already. This simplifies a lot of combinator chains (`just(':').padded()` → `punct(Punct::Colon)`). - `text::keyword(...)`, `any().filter(...)`, etc. — all the character-level helpers — are gone from the parser. They belong to the lexer if anywhere. The parser keeps producing `Rich` errors. The `expected` set members are `RichPattern`; the catalog's `parse.token.*` keys (ADR-0021 §4) translate them. ### 6. The `replay` path-argument wart The current grammar admits bare paths: `replay history.log`, `replay /tmp/seed.commands`. Bare paths contain `.` and `/`, which are punctuation tokens. A naive token-based parser would see `Keyword(Replay) Identifier("history") Punct(Dot) Identifier("log")` and have to reassemble — annoyingly context-dependent. Decision: keep the existing bare-path UX. The parser special-cases the `replay` command at one point: after matching `Keyword(Replay)`, it consumes its argument **directly from the original source string** by reading from the next token's start byte to end-of-input (or to the next unambiguous terminator, of which today there is none in the DSL). The lexer's job for the path argument is essentially to identify *where the path starts*; the parser does the rest by source-slicing. The quoted-path form (`replay 'my project/seed.commands'`) still goes through the lexer's normal `StringLiteral` path and is matched as `Keyword(Replay) StringLiteral(s)`. This special-casing is documented inline in the parser. It costs ~10 lines of code and avoids a breaking change to the DSL surface. No other DSL command takes a free-form filesystem-path argument; if a future command does, it either uses the same special-case or accepts only quoted paths. ### 7. Whitespace The lexer skips ASCII whitespace (` \t\r\n`) between tokens. This honours ADR-0009's "whitespace is liberal" rule unchanged — the user can put any amount of whitespace between any two tokens, and the parser doesn't see it. The lexer does **not** track whitespace as tokens. I4 (syntax highlighting) recovers whitespace as the gaps between token spans, computed at render time. ### 8. Comments The DSL has no comment syntax today and this ADR doesn't introduce one. If a comment syntax is added later (likely `--` line comments to match SQL, but that conflicts with flag prefixes — this is a real design decision a future ADR will need to settle), the lexer is the right layer to skip them. Out of scope here. ### 9. I3 (tab completion) hook This ADR commits to making the parser's expected-token-set queryable at any point in the input. Concretely: - `parse_tokens(&[Token], &str)` returns `(Result, ParseDiagnostics)` where `ParseDiagnostics` carries the raw chumsky output — including the merged `expected` set at the failure point. - Truncating the token stream at the cursor and re-parsing yields the expected-token-set at the cursor position. I3 uses this for completion. - The lexer is independently useful to I3: tokenizing the in-progress input lets the completion logic see what *partial* token the cursor is in (mid-keyword, mid-identifier, mid-string). The full I3 surface (cursor positioning, completion menu, disambiguation, completion of identifiers from schema) is out of scope here. ADR-0020 commits only to the parser contract. ### 10. I4 (syntax highlighting) hook The token kinds are the natural classification for syntax highlighting: - `Keyword(_)` → keyword colour. - `Identifier(_)` → identifier colour. - `Number(_)`, `StringLiteral(_)` → literal colours (probably distinct from each other). - `Punct(_)` → punctuation colour (or no special colouring). - `Flag(_)` → flag colour. - `Error(_)` → error highlight (red underline / squiggle). The lexer succeeds on partial / invalid input (§2). I4 does not need a successful parse to render highlighting — only successful tokenization. This is the load-bearing property for "live" highlighting as the user types. The colour scheme, theme integration, render layer, and performance budget are out of scope here. ADR-0020 commits only to "the lexer always produces a token stream over which highlighting can iterate". ### 11. Recovery-based partial AST — explicitly deferred Chumsky supports parser combinators with recovery, producing partial ASTs alongside errors. Useful for I3 ("the user has typed `create table Foo with pk` and now I want to know what the partial AST is so I can suggest column-spec completions") but the design space is large (which recovery strategies, what the partial AST shape is, how downstream consumers handle it). This ADR keeps the parser non-recovering: a failed parse returns `Err(ParseError)` and no partial AST. I3's ADR will decide whether recovery is needed; if so, the change is local to `parse_tokens` and doesn't ripple. ### 12. Migration of existing parser tests The 50+ unit tests in `dsl/parser.rs::tests` are the spec of the current grammar. The migration is mechanical: - Tests that call `parse_command(input)` keep doing so — `parse_command` is the public boundary and its signature doesn't change. - Tests that assert on `ParseError::Invalid { message, .. }` may need wording updates if the new error layer rewords them, but anchor substrings ("unknown type", "specified twice", "mutually exclusive", "varchar", "expected one of") stay intact (those come from `try_map` content errors that survive unchanged). - Two existing tests assert on chumsky's structural wording: `structural_error_for_show_data_without_arg` ("after `show data`", "expected identifier", "found end of input") and `structural_error_for_change_column_with_swapped_args` ("after `change column Rich`", "expected `:`"). The new rendering preserves the same shape — `after , expected , found ` — so these tests should port with at most minor adjustments. ADR-0021 specifies the rendering precisely. A new test surface for the lexer itself: lex output for representative inputs, lex error tokens for unterminated strings + unknown chars, span correctness. ## Out of scope Deliberately deferred to keep this ADR focused: 1. **Per-command usage templates.** ADR-0021. 2. **`parse.usage.*` and `parse.token.*` catalog keys.** ADR-0021. 3. **Error renderer composition** (caret + structural error + usage hint). ADR-0021. 4. **I3 completion UI** + cursor logic + identifier completion from schema. Future I3 ADR. 5. **I4 colour scheme** + theme integration. Future I4 ADR. 6. **Recovery-based partial AST.** §11 — re-opens with I3. 7. **Comment syntax.** §8. 8. **Sharing the lexer with `sqlparser-rs`** (advanced-mode SQL). The two parsers stay separate today; a future SQL subset ADR may revisit. ## Consequences ### Positive - **Aggregation works.** Top-level `choice((create_*, drop_*, add_*, …))` failures emit "expected `create`, `drop`, `add`, …" structurally. The user sees the family of available commands rather than one branch's report. - **The bespoke `humanise()` machinery shrinks.** Roughly half the helpers in `parser.rs` (~80-100 lines) are no longer needed because token-level error patterns render directly via the catalog. Less code is less code to maintain. - **I3 / I4 inherit a clean foundation.** Their ADRs can focus on UX/UI rather than re-litigating parser shape. - **Lex errors and parse errors share a render path.** Unterminated strings and missing keywords both surface through the same catalog-driven layer. - **The token stream is a natural API for future tooling**: schema-aware highlighting, structural editing, command history rendering with token-level colour. ### Costs - **One-time migration of `dsl/parser.rs`.** Every combinator rewrites against `&[Token]`. Estimated 600-900 lines of parser.rs change including the lexer module; the lexer itself is probably 200-300 lines plus 150-250 lines of unit tests. - **Catalog grows by one entry per keyword and per punct.** The macro-driven `Keyword` / `Punct` enums (§2a) collapse the enum + lex table + as-str + catalog-key derivation into one declaration site, so adding a keyword is one line of Rust + one line of YAML. The catalog validator enforces completeness at test time (ADR-0021 §7). Compared to today — where adding a keyword means a new `keyword_ci("...")` call site (or several, if used in multiple commands) — the per-keyword cost is comparable; what shifts is *where* the edit happens. A unit test additionally asserts every `Keyword::ALL` variant is referenced by some parser combinator, catching dead enum entries. - **Span model needs care for UTF-8.** Byte offsets are the contract; the lexer must split tokens at UTF-8 boundaries. Identifier/string tests should include at least one multi-byte case. - **`replay` special-case** in the parser (§6). One command, ~10 lines, documented inline. Acceptable; not a precedent for other commands. - **Tests churn.** Two structural-error tests need new wording (mechanical port). Existing content-error tests stay as-is. ### Neutral - **chumsky stays.** No framework change, no new dependency. The parser still expresses the grammar declaratively; only the input atoms change. - **`Command` AST is unchanged.** The parser produces the same `Command` enum it does today; downstream code (runtime, app, db) is untouched. - **Public API of `dsl::parser` is unchanged.** `parse_command(&str) → Result` keeps its signature. ADR-0021 may extend the public surface (e.g. expose `lex()` and `parse_tokens()` for I3/I4) but does so additively. ## Implementation notes These are sketch-level — implementation will produce more detailed work, but they're enough that a session picking this up has direction. ### Order of operations 1. **Lexer module.** New file `src/dsl/lexer.rs`. Token / TokenKind / Keyword / Punct types. `lex(input: &str) → Vec` (always succeeds; embeds `Error(_)` tokens for shape errors). Unit tests for representative inputs, span correctness, error-token cases, multi-byte UTF-8. 2. **Token-aware combinator helpers.** Probably in `src/dsl/parser.rs` next to the existing combinators. `kw(Keyword)`, `punct(Punct)`, `ident()`, `number()`, `string_lit()`, `flag(&str)`, `eof()`. Each parses a single token by kind and returns its payload (or `()`). 3. **Rewrite `command_parser()` and its sub-parsers.** One sub-parser at a time; run the existing tests after each. Aim for green-after-every-step rather than a big-bang port. 4. **The `replay` source-slice special case** (§6). 5. **Strip `humanise()` machinery** that's no longer needed. The render path in `into_parse_error()` simplifies. ADR-0021 owns the new render shape; until it lands, keep a minimal humaniser that produces the existing wording. 6. **Public API check.** `parse_command` signature unchanged. Add `lex(&str) → Vec` and `parse_tokens(&[Token], &str) → Result` as `pub` so I3/I4 can hook in later (their ADRs will use these). ### Things that interact subtly - **The `try_map` content errors** survive unchanged. They fire on tokens (e.g. `try_map` on a parsed `Identifier(s)` that's expected to be a type name) but their messages and classification are identical to today. The catalog vocabulary they use ("unknown type", "expected one of", "mutually exclusive", "specified twice") stays. - **The `1:n` cardinality** lexes as `Number("1")`, `Punct(Colon)`, `Identifier("n"|"N")`. The parser composes these into the relationship-cardinality assertion as today; no special token kind for it. - **Negative number literals.** Lexer emits `Punct(Minus)`? No — there is no Minus in the punct set above because the current grammar has no need for a unary `-` outside number literals. Decision: the lexer recognises `-` as part of a number literal when followed by an ASCII digit, producing a single `Number("-5")` token. A bare `-` not followed by a digit is a `Punct(Minus)` only if a `Minus` variant is added — for now, treat as `Error(UnknownChar)`. This matches the current grammar (which only accepts `-` as a number sign). - **The `--` flag prefix vs. a future `--` line comment.** Today `--all-rows` etc. are flags. If a future ADR introduces SQL-style `--` line comments in advanced mode only, the lexer may need a mode parameter. ADR-0020 doesn't pre-empt that. - **The hard-coded `"running: "` prefix in `app.rs`** for caret padding: unchanged. The parser still reports a byte-position; the caret math is the same.