ADR-0020 amends ADR-0001 with a two-phase parse: a lexer
producing a span-tagged token stream, then chumsky over
&[Token]. Single source of truth for keywords and punct via
a define_keywords!/define_punct! macro pattern. Parser
contract committed for I3 (queryable expected-token-set)
and I4 (lexer always succeeds, Error tokens for invalid
input). Includes an honest history note: the no-lexer shape
in dsl/parser.rs arose incrementally without ADR-level
deliberation against the known H1a/I3/I4 requirements; this
ADR corrects that.
ADR-0021 builds on ADR-0020 to close the H1a gap: a
per-command UsageEntry registry keyed off entry-keyword,
with parse errors rendered as caret + structural error +
matching usage template(s). Multi-entry families (add,
drop, show) render together. New catalog sections under
parse.usage.* (per-command grammar) and parse.token.*
(single-token vocabulary). Zero-prefix case ("frobulate
Customers") falls back to an "available commands:" framing.
Anchor-phrase compliance preserved.
26 KiB
ADR-0020: Tokenization layer for the DSL parser
Status
Accepted.
Amends ADR-0001 (language and TUI framework) by adding a
tokenization layer between the source string and the chumsky
grammar. The chumsky choice itself is unchanged; what changes
is the input to chumsky — &[Token] instead of &str.
Foundation for ADR-0021 (parser-as-source-of-truth pedagogy / H1a). Also the load-bearing piece for I3 (tab completion) and I4 (syntax highlighting) once those land.
Context
What the user sees today
Typing create at the prompt produces:
parse error: after `create`, expected `table`
Typing frobulate Customers produces:
parse error: expected `create`, found `frobulate`
Both are technically true and both are bad. The first hides
that create is the entry to a single command (create table …) that the user almost certainly does not yet know
the shape of. The second points at one of ten possible
command-starting keywords and silently picks one. A user
hitting either error gets a sharper "what comes next?" answer
from a 1980s lex/yacc toolchain than from this app.
Two technical roots
The bad UX traces to two parser-level decisions, both correctable but not without architectural change:
-
keyword_ciemitsRich::customerrors instead of structural ones. Chumsky'schoicecombinator merges theexpectedsets of its alternatives when they all fail — that is exactly what produces "expecteddataortable" instead of just "expectedtable". Custom errors don't participate in that merge: only one wins, deterministically the first. So our top-levelchoice((create_table, drop_*, add_*, …))collapses to whichever branch reported its custom error first, throwing the others away. -
The parser has no concept of a "command" beyond its chumsky combinator graph. There is no place to attach a per-command grammar template (the "what does
create tableactually want?" answer). Adding one is possible but awkward without an explicit handle on "the entry token for this command".
ADR-0021 addresses (2). This ADR addresses (1) by removing
the underlying cause: keyword_ci's try_map → Rich::custom
shape only exists because the parser operates on raw
characters and has to hand-write keyword recognition. Over a
token stream, a keyword is just a token, and just(...) over
tokens aggregates naturally.
Bespoke machinery in parser.rs
parser.rs carries ~180 lines of error-handling helpers
(humanise(), consumed_context(), oxford_or(),
describe_pattern(), describe_char(), format_expected_found(),
first_custom_message(), the prefer-custom-over-structural
selection in into_parse_error()). Most of it exists because
chumsky-over-&str produces character-level error patterns
that need humanising before a user can read them
(RichPattern::Token('e') → "e" is not what the user
needed to know).
Over a token stream, the patterns are token kinds — Keyword(Create),
Identifier, Punct(Colon) — which render directly via the
i18n catalog without per-character translation. Roughly half
of that machinery dissolves.
Honest history
ADR-0001 chose Rust + chumsky for the DSL. It did not name
"no lexer" as a design choice — the no-lexer shape grew
incrementally inside parser.rs without an ADR.
The known requirements at the time included H1a (friendly error layer for parse errors), I3 (tab completion), and I4 (syntax highlighting). All three are easier with a token stream than without one — H1a needs aggregated expected sets, I3 needs "what tokens are valid at cursor", I4 needs token classification on potentially-invalid input. The lexer-shape question should have been considered against these requirements when the parser was built. It wasn't.
The user has previously raised this — pointing out that
lex/yacc already handled this kind of error reporting
better, and asking why we weren't getting comparable
behaviour. That observation was correct on the merits and
should have triggered an ADR amendment then. ADR-0020 is that
amendment, late but in the right place.
This is not a "we now realise" framing. It is a "we should have decided this earlier and didn't" framing. The cost of acting on it now is real but bounded; the cost of acting on it after the query DSL or constraint-management commands have landed would be substantially larger. We have no users; the agility cost of refactoring is at its lowest.
What is and isn't this ADR
This ADR is purely about the input layer to the DSL parser. It does not specify per-command usage rendering, catalog keys for parse-error wording, or the renderer composition of caret + structural error + usage hint. Those are ADR-0021's scope. The two ADRs share an implementation session; the dependency runs one way (ADR-0021 needs ADR-0020's tokens).
This ADR also does not touch SQL parsing in advanced mode.
That path uses sqlparser-rs (per ADR-0001) and has its own
tokenization built in. A future ADR for the SQL subset (Q4)
will decide whether to share the DSL lexer's token model or
keep sqlparser-rs's token surface as-is — that's not
prejudged here.
Decision
1. Two-phase parse: lex → parse
pub fn parse_command(input: &str) -> Result<Command, ParseError> {
let tokens = lex(input)?; // Stage 1
parse_tokens(&tokens, input) // Stage 2
}
Stage 1 (lex) produces a span-tagged token stream. Stage 2
(parse_tokens) is a chumsky parser whose input type is
&[Token] instead of &str. The two stages are separately
testable; the lexer has its own test surface that doesn't
exercise the parser.
parse_tokens takes the original &str as a second
argument purely so the parser can consume bare path arguments
for the replay command directly from source (see §6). All
other parser logic operates over the token slice.
2. Token model
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Token {
pub kind: TokenKind,
pub span: Span, // (start, end) byte offsets
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum TokenKind {
Keyword(Keyword), // case-folded reserved word
Identifier(String), // case-preserving (ADR-0009)
Number(String), // raw text; numeric parse is a parser concern
StringLiteral(String), // unquoted, escapes processed
Punct(Punct), // : ( ) , = . - one char each
Flag(String), // --name (e.g. "--all-rows")
Error(LexError), // unrecognised char / unterminated string
// — a token kind, not a Result variant
}
// Keyword is a closed set declared via a macro that is the
// single source of truth (§2a). Punct follows the same pattern.
The Keyword and Punct enums are not hand-declared. They
come out of macros described in §2a — one declaration site
that generates the enum, the lex-side string→variant
mapping, the variant→literal rendering, and the catalog-key
derivation in one place.
Notes on the model:
- Keywords are an enum, not a string. This makes
kw(Keyword::Create)a single-token chumsky match with exact identity — fastest path, cleanest error patterns, no string allocation in the hot path. - Type names stay as identifiers.
int,text,serial,varcharall lex asIdentifier(_). The parser-leveltype_keyword()continues to callType::from_strwhich produces the existing "unknown type 'varchar' (expected one of: text, int, real, …)" message. That custom-error path is correct and stays. The closed-set Keyword enum is for the grammar's reserved words — words whose presence determines which command is being parsed. Type names are content. - Number is raw text, not parsed.
Value::Number(String)per ADR-0014 stays string-backed; the lexer doesn't try to validate or convert.1,-3.14,1e10,1abcall produce candidates for the parser to decide on. (The current parser rejects1abcalready; that doesn't change.) StringLiteralis post-escape. The lexer processes''→'per the existing string syntax. The original span covers the surrounding quotes; the payload is the unescaped content. (Same convention as the currentstring_literal()parser.)Flag(String)is--nameexactly. No further parsing. The parser matchesFlag(s)and checkss == "all-rows"etc. against a small set.Error(_)is a token kind, not aResultvariant. Lex always succeeds in producingVec<Token>— even on unterminated strings or unrecognised characters, a diagnostic Error token is emitted in place. Rationale: I4 (syntax highlighting) needs to colour partial / invalid input; treating lex errors as data instead of control flow lets the highlighter walk a token stream uniformly. The parser seesError(_)tokens and raises a structural error that names the underlying cause via the catalog.
2a. Single source of truth for keywords and punctuation
The Keyword enum, the lexer's reserved-word table, the
variant→literal rendering, and the catalog-key derivation
all come from one declaration. A define_keywords!
macro_rules! invocation in src/dsl/keyword.rs:
macro_rules! define_keywords {
( $( $variant:ident => $literal:literal ),+ $(,)? ) => {
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum Keyword { $( $variant ),+ }
impl Keyword {
pub const ALL: &'static [(Keyword, &'static str)] = &[
$( (Keyword::$variant, $literal) ),+
];
/// Lex-side mapping. Case-insensitive per ADR-0009.
pub fn from_word(s: &str) -> Option<Keyword> {
Self::ALL.iter()
.find(|(_, lit)| s.eq_ignore_ascii_case(lit))
.map(|(kw, _)| *kw)
}
/// Canonical lowercase literal for this variant.
pub fn as_str(self) -> &'static str {
Self::ALL.iter()
.find(|(kw, _)| *kw == self)
.map(|(_, lit)| *lit)
.unwrap()
}
/// `parse.token.keyword.<lit>` — the catalog key
/// the renderer looks up for the expected-set
/// vocabulary (ADR-0021 §4).
pub fn catalog_token_key(self) -> String {
format!("parse.token.keyword.{}", self.as_str())
}
}
};
}
define_keywords! {
Create => "create",
Drop => "drop",
Add => "add",
// ... one line per keyword ...
}
Punct follows the same pattern via define_punct! —
Colon, OpenParen, CloseParen, Comma, Equals, Dot, each
generated from one line of the invocation.
Adding a new keyword is one line in the
define_keywords! invocation, plus one line in the
en-US YAML under parse.token.keyword.<lit> (the catalog
validator catches a missing entry at test time per ADR-0021
§7). Everything else — every parser combinator that
references the keyword, every usage-registry entry, every
catalog-key lookup — is a use of the source of truth, not
a duplicate, and is compile-time-checked by the type system
(typos in kw(Keyword::Creaate) don't compile).
3. Span carriage
Every Token carries (start, end) byte offsets in the
original source string. The lexer is byte-exact for
multi-byte UTF-8 sequences (ASCII-only is the realistic
input but the lexer does not panic on Unicode in identifiers
or string literals).
ParseError::Invalid continues to carry position: usize,
which is the start offset of the failing token (or the
end-of-input position when the failure is at EOF). Caret
rendering (in app.rs) is unchanged — same byte-position
contract.
4. Lexer-vs-parser error split
The lexer is responsible for:
- Tokenization shape errors: unterminated string literal,
unrecognised character. Both surface as
TokenKind::Error(_)tokens with span coverage of the offending region. - Nothing else. Specifically: type names, command keywords, value validity, flag names, numeric overflow, identifier collisions are all parser- or higher-layer concerns.
The parser is responsible for:
- Grammar-shape errors: missing tokens, wrong tokens,
unexpected tokens, end-of-input mid-command. These come
out of chumsky's structural error machinery and aggregate
across
choicenaturally. - Content errors via
try_map: unknown type name ("unknown type 'varchar'"), mutually-exclusive flags ("--force-conversionand--dont-convertare mutually exclusive"), "with pk needs at least one column", referential-clause repeated. These keep their hand-written messages — theRich::custompath is correct for content, it was only wrong for keyword-shape.
The render layer (humanise) handles both uniformly via the catalog (ADR-0021).
5. The chumsky-over-tokens combinator surface
Chumsky 0.13 parses arbitrary input types, not just &str.
The combinator surface for Parser<'a, &'a [Token], …> is
the same as for &str: just, choice, then,
then_ignore, ignore_then, or_not, repeated,
separated_by, try_map, labelled. The only differences:
just(Token { kind: Keyword::Create, .. })doesn't work literally because spans differ. We define helperskw(Keyword::Create),punct(Punct::Colon),ident(),number(),string_lit(),flag(name)that match by kind and ignore span on the input side, returning either()or the carried payload.padded()(which strips whitespace) is no longer needed — the lexer skips whitespace already. This simplifies a lot of combinator chains (just(':').padded()→punct(Punct::Colon)).text::keyword(...),any().filter(...), etc. — all the character-level helpers — are gone from the parser. They belong to the lexer if anywhere.
The parser keeps producing Rich<Token> errors. The
expected set members are RichPattern<Token>; the
catalog's parse.token.* keys (ADR-0021 §4) translate them.
6. The replay path-argument wart
The current grammar admits bare paths: replay history.log,
replay /tmp/seed.commands. Bare paths contain . and /,
which are punctuation tokens. A naive token-based parser
would see Keyword(Replay) Identifier("history") Punct(Dot) Identifier("log") and have to reassemble — annoyingly
context-dependent.
Decision: keep the existing bare-path UX. The parser
special-cases the replay command at one point: after
matching Keyword(Replay), it consumes its argument
directly from the original source string by reading from
the next token's start byte to end-of-input (or to the next
unambiguous terminator, of which today there is none in the
DSL). The lexer's job for the path argument is essentially
to identify where the path starts; the parser does the
rest by source-slicing.
The quoted-path form (replay 'my project/seed.commands')
still goes through the lexer's normal StringLiteral path
and is matched as Keyword(Replay) StringLiteral(s).
This special-casing is documented inline in the parser. It costs ~10 lines of code and avoids a breaking change to the DSL surface. No other DSL command takes a free-form filesystem-path argument; if a future command does, it either uses the same special-case or accepts only quoted paths.
7. Whitespace
The lexer skips ASCII whitespace ( \t\r\n) between tokens.
This honours ADR-0009's "whitespace is liberal" rule
unchanged — the user can put any amount of whitespace
between any two tokens, and the parser doesn't see it.
The lexer does not track whitespace as tokens. I4 (syntax highlighting) recovers whitespace as the gaps between token spans, computed at render time.
8. Comments
The DSL has no comment syntax today and this ADR doesn't
introduce one. If a comment syntax is added later (likely
-- line comments to match SQL, but that conflicts with
flag prefixes — this is a real design decision a future ADR
will need to settle), the lexer is the right layer to skip
them. Out of scope here.
9. I3 (tab completion) hook
This ADR commits to making the parser's expected-token-set queryable at any point in the input. Concretely:
parse_tokens(&[Token], &str)returns(Result<Command, ParseError>, ParseDiagnostics)whereParseDiagnosticscarries the raw chumsky output — including the mergedexpectedset at the failure point.- Truncating the token stream at the cursor and re-parsing yields the expected-token-set at the cursor position. I3 uses this for completion.
- The lexer is independently useful to I3: tokenizing the in-progress input lets the completion logic see what partial token the cursor is in (mid-keyword, mid-identifier, mid-string).
The full I3 surface (cursor positioning, completion menu, disambiguation, completion of identifiers from schema) is out of scope here. ADR-0020 commits only to the parser contract.
10. I4 (syntax highlighting) hook
The token kinds are the natural classification for syntax highlighting:
Keyword(_)→ keyword colour.Identifier(_)→ identifier colour.Number(_),StringLiteral(_)→ literal colours (probably distinct from each other).Punct(_)→ punctuation colour (or no special colouring).Flag(_)→ flag colour.Error(_)→ error highlight (red underline / squiggle).
The lexer succeeds on partial / invalid input (§2). I4 does not need a successful parse to render highlighting — only successful tokenization. This is the load-bearing property for "live" highlighting as the user types.
The colour scheme, theme integration, render layer, and performance budget are out of scope here. ADR-0020 commits only to "the lexer always produces a token stream over which highlighting can iterate".
11. Recovery-based partial AST — explicitly deferred
Chumsky supports parser combinators with recovery, producing
partial ASTs alongside errors. Useful for I3 ("the user has
typed create table Foo with pk and now I want to know
what the partial AST is so I can suggest column-spec
completions") but the design space is large (which recovery
strategies, what the partial AST shape is, how downstream
consumers handle it).
This ADR keeps the parser non-recovering: a failed parse
returns Err(ParseError) and no partial AST. I3's ADR will
decide whether recovery is needed; if so, the change is
local to parse_tokens and doesn't ripple.
12. Migration of existing parser tests
The 50+ unit tests in dsl/parser.rs::tests are the spec of
the current grammar. The migration is mechanical:
- Tests that call
parse_command(input)keep doing so —parse_commandis the public boundary and its signature doesn't change. - Tests that assert on
ParseError::Invalid { message, .. }may need wording updates if the new error layer rewords them, but anchor substrings ("unknown type", "specified twice", "mutually exclusive", "varchar", "expected one of") stay intact (those come fromtry_mapcontent errors that survive unchanged). - Two existing tests assert on chumsky's structural wording:
structural_error_for_show_data_without_arg("aftershow data", "expected identifier", "found end of input") andstructural_error_for_change_column_with_swapped_args("afterchange column Rich", "expected:"). The new rendering preserves the same shape —after <prefix>, expected <set>, found <token>— so these tests should port with at most minor adjustments. ADR-0021 specifies the rendering precisely.
A new test surface for the lexer itself: lex output for representative inputs, lex error tokens for unterminated strings + unknown chars, span correctness.
Out of scope
Deliberately deferred to keep this ADR focused:
- Per-command usage templates. ADR-0021.
parse.usage.*andparse.token.*catalog keys. ADR-0021.- Error renderer composition (caret + structural error
- usage hint). ADR-0021.
- I3 completion UI + cursor logic + identifier completion from schema. Future I3 ADR.
- I4 colour scheme + theme integration. Future I4 ADR.
- Recovery-based partial AST. §11 — re-opens with I3.
- Comment syntax. §8.
- Sharing the lexer with
sqlparser-rs(advanced-mode SQL). The two parsers stay separate today; a future SQL subset ADR may revisit.
Consequences
Positive
- Aggregation works. Top-level
choice((create_*, drop_*, add_*, …))failures emit "expectedcreate,drop,add, …" structurally. The user sees the family of available commands rather than one branch's report. - The bespoke
humanise()machinery shrinks. Roughly half the helpers inparser.rs(~80-100 lines) are no longer needed because token-level error patterns render directly via the catalog. Less code is less code to maintain. - I3 / I4 inherit a clean foundation. Their ADRs can focus on UX/UI rather than re-litigating parser shape.
- Lex errors and parse errors share a render path. Unterminated strings and missing keywords both surface through the same catalog-driven layer.
- The token stream is a natural API for future tooling: schema-aware highlighting, structural editing, command history rendering with token-level colour.
Costs
- One-time migration of
dsl/parser.rs. Every combinator rewrites against&[Token]. Estimated 600-900 lines of parser.rs change including the lexer module; the lexer itself is probably 200-300 lines plus 150-250 lines of unit tests. - Catalog grows by one entry per keyword and per punct.
The macro-driven
Keyword/Punctenums (§2a) collapse the enum + lex table + as-str + catalog-key derivation into one declaration site, so adding a keyword is one line of Rust + one line of YAML. The catalog validator enforces completeness at test time (ADR-0021 §7). Compared to today — where adding a keyword means a newkeyword_ci("...")call site (or several, if used in multiple commands) — the per-keyword cost is comparable; what shifts is where the edit happens. A unit test additionally asserts everyKeyword::ALLvariant is referenced by some parser combinator, catching dead enum entries. - Span model needs care for UTF-8. Byte offsets are the contract; the lexer must split tokens at UTF-8 boundaries. Identifier/string tests should include at least one multi-byte case.
replayspecial-case in the parser (§6). One command, ~10 lines, documented inline. Acceptable; not a precedent for other commands.- Tests churn. Two structural-error tests need new wording (mechanical port). Existing content-error tests stay as-is.
Neutral
- chumsky stays. No framework change, no new dependency. The parser still expresses the grammar declaratively; only the input atoms change.
CommandAST is unchanged. The parser produces the sameCommandenum it does today; downstream code (runtime, app, db) is untouched.- Public API of
dsl::parseris unchanged.parse_command(&str) → Result<Command, ParseError>keeps its signature. ADR-0021 may extend the public surface (e.g. exposelex()andparse_tokens()for I3/I4) but does so additively.
Implementation notes
These are sketch-level — implementation will produce more detailed work, but they're enough that a session picking this up has direction.
Order of operations
- Lexer module. New file
src/dsl/lexer.rs. Token / TokenKind / Keyword / Punct types.lex(input: &str) → Vec<Token>(always succeeds; embedsError(_)tokens for shape errors). Unit tests for representative inputs, span correctness, error-token cases, multi-byte UTF-8. - Token-aware combinator helpers. Probably in
src/dsl/parser.rsnext to the existing combinators.kw(Keyword),punct(Punct),ident(),number(),string_lit(),flag(&str),eof(). Each parses a single token by kind and returns its payload (or()). - Rewrite
command_parser()and its sub-parsers. One sub-parser at a time; run the existing tests after each. Aim for green-after-every-step rather than a big-bang port. - The
replaysource-slice special case (§6). - Strip
humanise()machinery that's no longer needed. The render path ininto_parse_error()simplifies. ADR-0021 owns the new render shape; until it lands, keep a minimal humaniser that produces the existing wording. - Public API check.
parse_commandsignature unchanged. Addlex(&str) → Vec<Token>andparse_tokens(&[Token], &str) → Result<Command, ParseError>aspubso I3/I4 can hook in later (their ADRs will use these).
Things that interact subtly
- The
try_mapcontent errors survive unchanged. They fire on tokens (e.g.try_mapon a parsedIdentifier(s)that's expected to be a type name) but their messages and classification are identical to today. The catalog vocabulary they use ("unknown type", "expected one of", "mutually exclusive", "specified twice") stays. - The
1:ncardinality lexes asNumber("1"),Punct(Colon),Identifier("n"|"N"). The parser composes these into the relationship-cardinality assertion as today; no special token kind for it. - Negative number literals. Lexer emits
Punct(Minus)? No — there is no Minus in the punct set above because the current grammar has no need for a unary-outside number literals. Decision: the lexer recognises-as part of a number literal when followed by an ASCII digit, producing a singleNumber("-5")token. A bare-not followed by a digit is aPunct(Minus)only if aMinusvariant is added — for now, treat asError(UnknownChar). This matches the current grammar (which only accepts-as a number sign). - The
--flag prefix vs. a future--line comment. Today--all-rowsetc. are flags. If a future ADR introduces SQL-style--line comments in advanced mode only, the lexer may need a mode parameter. ADR-0020 doesn't pre-empt that. - The hard-coded
"running: "prefix inapp.rsfor caret padding: unchanged. The parser still reports a byte-position; the caret math is the same.