Files
rdbms-playground/docs/adr/0020-tokenization-layer-for-the-dsl-parser.md
claude@clouddev1 0e6f767848 docs: ADR-0042 — continue H1a parse-error pedagogy on the grammar tree
ADR-0020/0021 specified a chumsky-based H1a; ADR-0024 replaced chumsky
with the scannerless walker, leaving both obsolete. Mark them superseded
(kept as institutional memory) and add ADR-0042, which restates H1a
against the architecture as built.

ADR-0042 records that H1a is substantially shipped already — per-command
usage block, available-commands fallback, source-derived ident slot
labels, curated parse.custom.* near-miss messages, and schema-aware
[ERR] diagnostics — and defines the remaining work: a verified
per-command near-miss matrix (the definition of done), friendlier
literal expectation labels that add role context while keeping the
exact literal visible, and advanced-mode SQL parse parity (RETURNING
scope, CROSS JOIN ON, INSERT…SELECT count), kept distinct from
ADR-0019 §OOS-2 engine-error sanitisation.

- docs/adr/0020,0021: superseded notes + README entries
- docs/adr/0042: new ADR
- docs/adr/README.md: index upkeep (ADR-0000 rule)
2026-06-03 14:05:09 +00:00

27 KiB

ADR-0020: Tokenization layer for the DSL parser

Status

Superseded by ADR-0024 (2026-05-14). Accepted then superseded without being implemented.

Superseding note (2026-06-03). This ADR was never built. It specifies a chumsky-over-tokens architecture — a separate lexer producing Vec<Token>, a define_keywords! macro, and chumsky grammar combinators consuming &[Token]. ADR-0024 (unified grammar tree) instead adopted a scannerless hand-rolled walker that operates directly on source bytes, and removed chumsky from the project entirely (it is no longer a dependency). The lexer, keyword.rs, and the token model described below do not exist.

What this ADR got right survives in ADR-0024: the expected-set aggregation it wanted (one branch's report no longer swallowing the others) is delivered by the walker's structural expected derivation, and the I3 (completion) / I4 (highlighting) hooks it anticipated are served by the same walker. Read ADR-0024 for the architecture as built; this ADR remains as institutional memory of the path not taken and the reasoning that led there.


Original status (historical): Accepted.

Amends ADR-0001 (language and TUI framework) by adding a tokenization layer between the source string and the chumsky grammar. The chumsky choice itself is unchanged; what changes is the input to chumsky — &[Token] instead of &str.

Foundation for ADR-0021 (parser-as-source-of-truth pedagogy / H1a). Also the load-bearing piece for I3 (tab completion) and I4 (syntax highlighting) once those land.

Context

What the user sees today

Typing create at the prompt produces:

parse error: after `create`, expected `table`

Typing frobulate Customers produces:

parse error: expected `create`, found `frobulate`

Both are technically true and both are bad. The first hides that create is the entry to a single command (create table …) that the user almost certainly does not yet know the shape of. The second points at one of ten possible command-starting keywords and silently picks one. A user hitting either error gets a sharper "what comes next?" answer from a 1980s lex/yacc toolchain than from this app.

Two technical roots

The bad UX traces to two parser-level decisions, both correctable but not without architectural change:

  1. keyword_ci emits Rich::custom errors instead of structural ones. Chumsky's choice combinator merges the expected sets of its alternatives when they all fail — that is exactly what produces "expected data or table" instead of just "expected table". Custom errors don't participate in that merge: only one wins, deterministically the first. So our top-level choice((create_table, drop_*, add_*, …)) collapses to whichever branch reported its custom error first, throwing the others away.

  2. The parser has no concept of a "command" beyond its chumsky combinator graph. There is no place to attach a per-command grammar template (the "what does create table actually want?" answer). Adding one is possible but awkward without an explicit handle on "the entry token for this command".

ADR-0021 addresses (2). This ADR addresses (1) by removing the underlying cause: keyword_ci's try_map → Rich::custom shape only exists because the parser operates on raw characters and has to hand-write keyword recognition. Over a token stream, a keyword is just a token, and just(...) over tokens aggregates naturally.

Bespoke machinery in parser.rs

parser.rs carries ~180 lines of error-handling helpers (humanise(), consumed_context(), oxford_or(), describe_pattern(), describe_char(), format_expected_found(), first_custom_message(), the prefer-custom-over-structural selection in into_parse_error()). Most of it exists because chumsky-over-&str produces character-level error patterns that need humanising before a user can read them (RichPattern::Token('e') → "e" is not what the user needed to know).

Over a token stream, the patterns are token kinds — Keyword(Create), Identifier, Punct(Colon) — which render directly via the i18n catalog without per-character translation. Roughly half of that machinery dissolves.

Honest history

ADR-0001 chose Rust + chumsky for the DSL. It did not name "no lexer" as a design choice — the no-lexer shape grew incrementally inside parser.rs without an ADR.

The known requirements at the time included H1a (friendly error layer for parse errors), I3 (tab completion), and I4 (syntax highlighting). All three are easier with a token stream than without one — H1a needs aggregated expected sets, I3 needs "what tokens are valid at cursor", I4 needs token classification on potentially-invalid input. The lexer-shape question should have been considered against these requirements when the parser was built. It wasn't.

The user has previously raised this — pointing out that lex/yacc already handled this kind of error reporting better, and asking why we weren't getting comparable behaviour. That observation was correct on the merits and should have triggered an ADR amendment then. ADR-0020 is that amendment, late but in the right place.

This is not a "we now realise" framing. It is a "we should have decided this earlier and didn't" framing. The cost of acting on it now is real but bounded; the cost of acting on it after the query DSL or constraint-management commands have landed would be substantially larger. We have no users; the agility cost of refactoring is at its lowest.

What is and isn't this ADR

This ADR is purely about the input layer to the DSL parser. It does not specify per-command usage rendering, catalog keys for parse-error wording, or the renderer composition of caret + structural error + usage hint. Those are ADR-0021's scope. The two ADRs share an implementation session; the dependency runs one way (ADR-0021 needs ADR-0020's tokens).

This ADR also does not touch SQL parsing in advanced mode. That path uses sqlparser-rs (per ADR-0001) and has its own tokenization built in. A future ADR for the SQL subset (Q4) will decide whether to share the DSL lexer's token model or keep sqlparser-rs's token surface as-is — that's not prejudged here.

Decision

1. Two-phase parse: lex → parse

pub fn parse_command(input: &str) -> Result<Command, ParseError> {
    let tokens = lex(input)?;          // Stage 1
    parse_tokens(&tokens, input)       // Stage 2
}

Stage 1 (lex) produces a span-tagged token stream. Stage 2 (parse_tokens) is a chumsky parser whose input type is &[Token] instead of &str. The two stages are separately testable; the lexer has its own test surface that doesn't exercise the parser.

parse_tokens takes the original &str as a second argument purely so the parser can consume bare path arguments for the replay command directly from source (see §6). All other parser logic operates over the token slice.

2. Token model

#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Token {
    pub kind: TokenKind,
    pub span: Span,           // (start, end) byte offsets
}

#[derive(Debug, Clone, PartialEq, Eq)]
pub enum TokenKind {
    Keyword(Keyword),         // case-folded reserved word
    Identifier(String),       // case-preserving (ADR-0009)
    Number(String),           // raw text; numeric parse is a parser concern
    StringLiteral(String),    // unquoted, escapes processed
    Punct(Punct),             // : ( ) , = . - one char each
    Flag(String),             // --name (e.g. "--all-rows")
    Error(LexError),          // unrecognised char / unterminated string
                              //   — a token kind, not a Result variant
}

// Keyword is a closed set declared via a macro that is the
// single source of truth (§2a). Punct follows the same pattern.

The Keyword and Punct enums are not hand-declared. They come out of macros described in §2a — one declaration site that generates the enum, the lex-side string→variant mapping, the variant→literal rendering, and the catalog-key derivation in one place.

Notes on the model:

  • Keywords are an enum, not a string. This makes kw(Keyword::Create) a single-token chumsky match with exact identity — fastest path, cleanest error patterns, no string allocation in the hot path.
  • Type names stay as identifiers. int, text, serial, varchar all lex as Identifier(_). The parser-level type_keyword() continues to call Type::from_str which produces the existing "unknown type 'varchar' (expected one of: text, int, real, …)" message. That custom-error path is correct and stays. The closed-set Keyword enum is for the grammar's reserved words — words whose presence determines which command is being parsed. Type names are content.
  • Number is raw text, not parsed. Value::Number(String) per ADR-0014 stays string-backed; the lexer doesn't try to validate or convert. 1, -3.14, 1e10, 1abc all produce candidates for the parser to decide on. (The current parser rejects 1abc already; that doesn't change.)
  • StringLiteral is post-escape. The lexer processes ''' per the existing string syntax. The original span covers the surrounding quotes; the payload is the unescaped content. (Same convention as the current string_literal() parser.)
  • Flag(String) is --name exactly. No further parsing. The parser matches Flag(s) and checks s == "all-rows" etc. against a small set.
  • Error(_) is a token kind, not a Result variant. Lex always succeeds in producing Vec<Token> — even on unterminated strings or unrecognised characters, a diagnostic Error token is emitted in place. Rationale: I4 (syntax highlighting) needs to colour partial / invalid input; treating lex errors as data instead of control flow lets the highlighter walk a token stream uniformly. The parser sees Error(_) tokens and raises a structural error that names the underlying cause via the catalog.

2a. Single source of truth for keywords and punctuation

The Keyword enum, the lexer's reserved-word table, the variant→literal rendering, and the catalog-key derivation all come from one declaration. A define_keywords! macro_rules! invocation in src/dsl/keyword.rs:

macro_rules! define_keywords {
    ( $( $variant:ident => $literal:literal ),+ $(,)? ) => {
        #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
        pub enum Keyword { $( $variant ),+ }

        impl Keyword {
            pub const ALL: &'static [(Keyword, &'static str)] = &[
                $( (Keyword::$variant, $literal) ),+
            ];

            /// Lex-side mapping. Case-insensitive per ADR-0009.
            pub fn from_word(s: &str) -> Option<Keyword> {
                Self::ALL.iter()
                    .find(|(_, lit)| s.eq_ignore_ascii_case(lit))
                    .map(|(kw, _)| *kw)
            }

            /// Canonical lowercase literal for this variant.
            pub fn as_str(self) -> &'static str {
                Self::ALL.iter()
                    .find(|(kw, _)| *kw == self)
                    .map(|(_, lit)| *lit)
                    .unwrap()
            }

            /// `parse.token.keyword.<lit>` — the catalog key
            /// the renderer looks up for the expected-set
            /// vocabulary (ADR-0021 §4).
            pub fn catalog_token_key(self) -> String {
                format!("parse.token.keyword.{}", self.as_str())
            }
        }
    };
}

define_keywords! {
    Create => "create",
    Drop   => "drop",
    Add    => "add",
    // ... one line per keyword ...
}

Punct follows the same pattern via define_punct! — Colon, OpenParen, CloseParen, Comma, Equals, Dot, each generated from one line of the invocation.

Adding a new keyword is one line in the define_keywords! invocation, plus one line in the en-US YAML under parse.token.keyword.<lit> (the catalog validator catches a missing entry at test time per ADR-0021 §7). Everything else — every parser combinator that references the keyword, every usage-registry entry, every catalog-key lookup — is a use of the source of truth, not a duplicate, and is compile-time-checked by the type system (typos in kw(Keyword::Creaate) don't compile).

3. Span carriage

Every Token carries (start, end) byte offsets in the original source string. The lexer is byte-exact for multi-byte UTF-8 sequences (ASCII-only is the realistic input but the lexer does not panic on Unicode in identifiers or string literals).

ParseError::Invalid continues to carry position: usize, which is the start offset of the failing token (or the end-of-input position when the failure is at EOF). Caret rendering (in app.rs) is unchanged — same byte-position contract.

4. Lexer-vs-parser error split

The lexer is responsible for:

  • Tokenization shape errors: unterminated string literal, unrecognised character. Both surface as TokenKind::Error(_) tokens with span coverage of the offending region.
  • Nothing else. Specifically: type names, command keywords, value validity, flag names, numeric overflow, identifier collisions are all parser- or higher-layer concerns.

The parser is responsible for:

  • Grammar-shape errors: missing tokens, wrong tokens, unexpected tokens, end-of-input mid-command. These come out of chumsky's structural error machinery and aggregate across choice naturally.
  • Content errors via try_map: unknown type name ("unknown type 'varchar'"), mutually-exclusive flags ("--force-conversion and --dont-convert are mutually exclusive"), "with pk needs at least one column", referential-clause repeated. These keep their hand-written messages — the Rich::custom path is correct for content, it was only wrong for keyword-shape.

The render layer (humanise) handles both uniformly via the catalog (ADR-0021).

5. The chumsky-over-tokens combinator surface

Chumsky 0.13 parses arbitrary input types, not just &str. The combinator surface for Parser<'a, &'a [Token], …> is the same as for &str: just, choice, then, then_ignore, ignore_then, or_not, repeated, separated_by, try_map, labelled. The only differences:

  • just(Token { kind: Keyword::Create, .. }) doesn't work literally because spans differ. We define helpers kw(Keyword::Create), punct(Punct::Colon), ident(), number(), string_lit(), flag(name) that match by kind and ignore span on the input side, returning either () or the carried payload.
  • padded() (which strips whitespace) is no longer needed — the lexer skips whitespace already. This simplifies a lot of combinator chains (just(':').padded()punct(Punct::Colon)).
  • text::keyword(...), any().filter(...), etc. — all the character-level helpers — are gone from the parser. They belong to the lexer if anywhere.

The parser keeps producing Rich<Token> errors. The expected set members are RichPattern<Token>; the catalog's parse.token.* keys (ADR-0021 §4) translate them.

6. The replay path-argument wart

The current grammar admits bare paths: replay history.log, replay /tmp/seed.commands. Bare paths contain . and /, which are punctuation tokens. A naive token-based parser would see Keyword(Replay) Identifier("history") Punct(Dot) Identifier("log") and have to reassemble — annoyingly context-dependent.

Decision: keep the existing bare-path UX. The parser special-cases the replay command at one point: after matching Keyword(Replay), it consumes its argument directly from the original source string by reading from the next token's start byte to end-of-input (or to the next unambiguous terminator, of which today there is none in the DSL). The lexer's job for the path argument is essentially to identify where the path starts; the parser does the rest by source-slicing.

The quoted-path form (replay 'my project/seed.commands') still goes through the lexer's normal StringLiteral path and is matched as Keyword(Replay) StringLiteral(s).

This special-casing is documented inline in the parser. It costs ~10 lines of code and avoids a breaking change to the DSL surface. No other DSL command takes a free-form filesystem-path argument; if a future command does, it either uses the same special-case or accepts only quoted paths.

7. Whitespace

The lexer skips ASCII whitespace ( \t\r\n) between tokens. This honours ADR-0009's "whitespace is liberal" rule unchanged — the user can put any amount of whitespace between any two tokens, and the parser doesn't see it.

The lexer does not track whitespace as tokens. I4 (syntax highlighting) recovers whitespace as the gaps between token spans, computed at render time.

8. Comments

The DSL has no comment syntax today and this ADR doesn't introduce one. If a comment syntax is added later (likely -- line comments to match SQL, but that conflicts with flag prefixes — this is a real design decision a future ADR will need to settle), the lexer is the right layer to skip them. Out of scope here.

9. I3 (tab completion) hook

This ADR commits to making the parser's expected-token-set queryable at any point in the input. Concretely:

  • parse_tokens(&[Token], &str) returns (Result<Command, ParseError>, ParseDiagnostics) where ParseDiagnostics carries the raw chumsky output — including the merged expected set at the failure point.
  • Truncating the token stream at the cursor and re-parsing yields the expected-token-set at the cursor position. I3 uses this for completion.
  • The lexer is independently useful to I3: tokenizing the in-progress input lets the completion logic see what partial token the cursor is in (mid-keyword, mid-identifier, mid-string).

The full I3 surface (cursor positioning, completion menu, disambiguation, completion of identifiers from schema) is out of scope here. ADR-0020 commits only to the parser contract.

10. I4 (syntax highlighting) hook

The token kinds are the natural classification for syntax highlighting:

  • Keyword(_) → keyword colour.
  • Identifier(_) → identifier colour.
  • Number(_), StringLiteral(_) → literal colours (probably distinct from each other).
  • Punct(_) → punctuation colour (or no special colouring).
  • Flag(_) → flag colour.
  • Error(_) → error highlight (red underline / squiggle).

The lexer succeeds on partial / invalid input (§2). I4 does not need a successful parse to render highlighting — only successful tokenization. This is the load-bearing property for "live" highlighting as the user types.

The colour scheme, theme integration, render layer, and performance budget are out of scope here. ADR-0020 commits only to "the lexer always produces a token stream over which highlighting can iterate".

11. Recovery-based partial AST — explicitly deferred

Chumsky supports parser combinators with recovery, producing partial ASTs alongside errors. Useful for I3 ("the user has typed create table Foo with pk and now I want to know what the partial AST is so I can suggest column-spec completions") but the design space is large (which recovery strategies, what the partial AST shape is, how downstream consumers handle it).

This ADR keeps the parser non-recovering: a failed parse returns Err(ParseError) and no partial AST. I3's ADR will decide whether recovery is needed; if so, the change is local to parse_tokens and doesn't ripple.

12. Migration of existing parser tests

The 50+ unit tests in dsl/parser.rs::tests are the spec of the current grammar. The migration is mechanical:

  • Tests that call parse_command(input) keep doing so — parse_command is the public boundary and its signature doesn't change.
  • Tests that assert on ParseError::Invalid { message, .. } may need wording updates if the new error layer rewords them, but anchor substrings ("unknown type", "specified twice", "mutually exclusive", "varchar", "expected one of") stay intact (those come from try_map content errors that survive unchanged).
  • Two existing tests assert on chumsky's structural wording: structural_error_for_show_data_without_arg ("after show data", "expected identifier", "found end of input") and structural_error_for_change_column_with_swapped_args ("after change column Rich", "expected :"). The new rendering preserves the same shape — after <prefix>, expected <set>, found <token> — so these tests should port with at most minor adjustments. ADR-0021 specifies the rendering precisely.

A new test surface for the lexer itself: lex output for representative inputs, lex error tokens for unterminated strings + unknown chars, span correctness.

Out of scope

Deliberately deferred to keep this ADR focused:

  1. Per-command usage templates. ADR-0021.
  2. parse.usage.* and parse.token.* catalog keys. ADR-0021.
  3. Error renderer composition (caret + structural error
    • usage hint). ADR-0021.
  4. I3 completion UI + cursor logic + identifier completion from schema. Future I3 ADR.
  5. I4 colour scheme + theme integration. Future I4 ADR.
  6. Recovery-based partial AST. §11 — re-opens with I3.
  7. Comment syntax. §8.
  8. Sharing the lexer with sqlparser-rs (advanced-mode SQL). The two parsers stay separate today; a future SQL subset ADR may revisit.

Consequences

Positive

  • Aggregation works. Top-level choice((create_*, drop_*, add_*, …)) failures emit "expected create, drop, add, …" structurally. The user sees the family of available commands rather than one branch's report.
  • The bespoke humanise() machinery shrinks. Roughly half the helpers in parser.rs (~80-100 lines) are no longer needed because token-level error patterns render directly via the catalog. Less code is less code to maintain.
  • I3 / I4 inherit a clean foundation. Their ADRs can focus on UX/UI rather than re-litigating parser shape.
  • Lex errors and parse errors share a render path. Unterminated strings and missing keywords both surface through the same catalog-driven layer.
  • The token stream is a natural API for future tooling: schema-aware highlighting, structural editing, command history rendering with token-level colour.

Costs

  • One-time migration of dsl/parser.rs. Every combinator rewrites against &[Token]. Estimated 600-900 lines of parser.rs change including the lexer module; the lexer itself is probably 200-300 lines plus 150-250 lines of unit tests.
  • Catalog grows by one entry per keyword and per punct. The macro-driven Keyword / Punct enums (§2a) collapse the enum + lex table + as-str + catalog-key derivation into one declaration site, so adding a keyword is one line of Rust + one line of YAML. The catalog validator enforces completeness at test time (ADR-0021 §7). Compared to today — where adding a keyword means a new keyword_ci("...") call site (or several, if used in multiple commands) — the per-keyword cost is comparable; what shifts is where the edit happens. A unit test additionally asserts every Keyword::ALL variant is referenced by some parser combinator, catching dead enum entries.
  • Span model needs care for UTF-8. Byte offsets are the contract; the lexer must split tokens at UTF-8 boundaries. Identifier/string tests should include at least one multi-byte case.
  • replay special-case in the parser (§6). One command, ~10 lines, documented inline. Acceptable; not a precedent for other commands.
  • Tests churn. Two structural-error tests need new wording (mechanical port). Existing content-error tests stay as-is.

Neutral

  • chumsky stays. No framework change, no new dependency. The parser still expresses the grammar declaratively; only the input atoms change.
  • Command AST is unchanged. The parser produces the same Command enum it does today; downstream code (runtime, app, db) is untouched.
  • Public API of dsl::parser is unchanged. parse_command(&str) → Result<Command, ParseError> keeps its signature. ADR-0021 may extend the public surface (e.g. expose lex() and parse_tokens() for I3/I4) but does so additively.

Implementation notes

These are sketch-level — implementation will produce more detailed work, but they're enough that a session picking this up has direction.

Order of operations

  1. Lexer module. New file src/dsl/lexer.rs. Token / TokenKind / Keyword / Punct types. lex(input: &str) → Vec<Token> (always succeeds; embeds Error(_) tokens for shape errors). Unit tests for representative inputs, span correctness, error-token cases, multi-byte UTF-8.
  2. Token-aware combinator helpers. Probably in src/dsl/parser.rs next to the existing combinators. kw(Keyword), punct(Punct), ident(), number(), string_lit(), flag(&str), eof(). Each parses a single token by kind and returns its payload (or ()).
  3. Rewrite command_parser() and its sub-parsers. One sub-parser at a time; run the existing tests after each. Aim for green-after-every-step rather than a big-bang port.
  4. The replay source-slice special case (§6).
  5. Strip humanise() machinery that's no longer needed. The render path in into_parse_error() simplifies. ADR-0021 owns the new render shape; until it lands, keep a minimal humaniser that produces the existing wording.
  6. Public API check. parse_command signature unchanged. Add lex(&str) → Vec<Token> and parse_tokens(&[Token], &str) → Result<Command, ParseError> as pub so I3/I4 can hook in later (their ADRs will use these).

Things that interact subtly

  • The try_map content errors survive unchanged. They fire on tokens (e.g. try_map on a parsed Identifier(s) that's expected to be a type name) but their messages and classification are identical to today. The catalog vocabulary they use ("unknown type", "expected one of", "mutually exclusive", "specified twice") stays.
  • The 1:n cardinality lexes as Number("1"), Punct(Colon), Identifier("n"|"N"). The parser composes these into the relationship-cardinality assertion as today; no special token kind for it.
  • Negative number literals. Lexer emits Punct(Minus)? No — there is no Minus in the punct set above because the current grammar has no need for a unary - outside number literals. Decision: the lexer recognises - as part of a number literal when followed by an ASCII digit, producing a single Number("-5") token. A bare - not followed by a digit is a Punct(Minus) only if a Minus variant is added — for now, treat as Error(UnknownChar). This matches the current grammar (which only accepts - as a number sign).
  • The -- flag prefix vs. a future -- line comment. Today --all-rows etc. are flags. If a future ADR introduces SQL-style -- line comments in advanced mode only, the lexer may need a mode parameter. ADR-0020 doesn't pre-empt that.
  • The hard-coded "running: " prefix in app.rs for caret padding: unchanged. The parser still reports a byte-position; the caret math is the same.