# ADR-0020: Tokenization layer for the DSL parser

## Status

Accepted.

Amends ADR-0001 (language and TUI framework) by adding a
tokenization layer between the source string and the chumsky
grammar. The chumsky choice itself is unchanged; what changes
is the input to chumsky — `&[Token]` instead of `&str`.

Foundation for ADR-0021 (parser-as-source-of-truth pedagogy /
H1a). Also the load-bearing piece for I3 (tab completion) and
I4 (syntax highlighting) once those land.

## Context

### What the user sees today

Typing `create` at the prompt produces:

```
parse error: after `create`, expected `table`
```

Typing `frobulate Customers` produces:

```
parse error: expected `create`, found `frobulate`
```

Both are technically true and both are bad. The first hides
that `create` is the entry to a single command (`create
table …`) that the user almost certainly does not yet know
the shape of. The second points at one of ten possible
command-starting keywords and silently picks one. A user
hitting either error gets a sharper "what comes next?" answer
from a 1980s `lex`/`yacc` toolchain than from this app.

### Two technical roots

The bad UX traces to two parser-level decisions, both
correctable but not without architectural change:

1. **`keyword_ci` emits `Rich::custom` errors instead of
   structural ones.** Chumsky's `choice` combinator merges the
   `expected` sets of its alternatives when they all fail —
   that is exactly what produces "expected `data` or `table`"
   instead of just "expected `table`". Custom errors don't
   participate in that merge: only one wins, deterministically
   the first. So our top-level `choice((create_table, drop_*,
   add_*, …))` collapses to whichever branch reported its
   custom error first, throwing the others away.

2. **The parser has no concept of a "command" beyond its
   chumsky combinator graph.** There is no place to attach a
   per-command grammar template (the "what does `create
   table` actually want?" answer). Adding one is possible but
   awkward without an explicit handle on "the entry token
   for this command".

ADR-0021 addresses (2). This ADR addresses (1) by removing
the underlying cause: `keyword_ci`'s `try_map → Rich::custom`
shape only exists because the parser operates on raw
characters and has to hand-write keyword recognition. Over a
token stream, a keyword is just a token, and `just(...)` over
tokens aggregates naturally.

### Bespoke machinery in `parser.rs`

`parser.rs` carries ~180 lines of error-handling helpers
(`humanise()`, `consumed_context()`, `oxford_or()`,
`describe_pattern()`, `describe_char()`, `format_expected_found()`,
`first_custom_message()`, the `prefer-custom-over-structural`
selection in `into_parse_error()`). Most of it exists because
chumsky-over-`&str` produces character-level error patterns
that need humanising before a user can read them
(`RichPattern::Token('e')` → "`e`" is not what the user
needed to know).

Over a token stream, the patterns are token kinds — `Keyword(Create)`,
`Identifier`, `Punct(Colon)` — which render directly via the
i18n catalog without per-character translation. Roughly half
of that machinery dissolves.

### Honest history

ADR-0001 chose Rust + chumsky for the DSL. It did not name
"no lexer" as a design choice — the no-lexer shape grew
incrementally inside `parser.rs` without an ADR.

The known requirements at the time included **H1a** (friendly
error layer for parse errors), **I3** (tab completion), and
**I4** (syntax highlighting). All three are easier with a
token stream than without one — H1a needs aggregated expected
sets, I3 needs "what tokens are valid at cursor", I4 needs
token classification on potentially-invalid input. The
lexer-shape question should have been considered against
these requirements when the parser was built. It wasn't.

The user has previously raised this — pointing out that
`lex`/`yacc` already handled this kind of error reporting
better, and asking why we weren't getting comparable
behaviour. That observation was correct on the merits and
should have triggered an ADR amendment then. ADR-0020 is that
amendment, late but in the right place.

This is not a "we now realise" framing. It is a "we should
have decided this earlier and didn't" framing. The cost of
acting on it now is real but bounded; the cost of acting on
it after the query DSL or constraint-management commands have
landed would be substantially larger. We have no users; the
agility cost of refactoring is at its lowest.

### What is and isn't this ADR

This ADR is purely about the **input layer to the DSL
parser**. It does not specify per-command usage rendering,
catalog keys for parse-error wording, or the renderer
composition of caret + structural error + usage hint. Those
are ADR-0021's scope. The two ADRs share an implementation
session; the dependency runs one way (ADR-0021 needs
ADR-0020's tokens).

This ADR also does not touch SQL parsing in advanced mode.
That path uses `sqlparser-rs` (per ADR-0001) and has its own
tokenization built in. A future ADR for the SQL subset (Q4)
will decide whether to share the DSL lexer's token model or
keep `sqlparser-rs`'s token surface as-is — that's not
prejudged here.

## Decision

### 1. Two-phase parse: lex → parse

```rust
pub fn parse_command(input: &str) -> Result<Command, ParseError> {
    let tokens = lex(input)?;          // Stage 1
    parse_tokens(&tokens, input)       // Stage 2
}
```

Stage 1 (`lex`) produces a span-tagged token stream. Stage 2
(`parse_tokens`) is a chumsky parser whose input type is
`&[Token]` instead of `&str`. The two stages are separately
testable; the lexer has its own test surface that doesn't
exercise the parser.

`parse_tokens` takes the original `&str` as a second
argument purely so the parser can consume bare path arguments
for the `replay` command directly from source (see §6). All
other parser logic operates over the token slice.

### 2. Token model

```rust
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Token {
    pub kind: TokenKind,
    pub span: Span,           // (start, end) byte offsets
}

#[derive(Debug, Clone, PartialEq, Eq)]
pub enum TokenKind {
    Keyword(Keyword),         // case-folded reserved word
    Identifier(String),       // case-preserving (ADR-0009)
    Number(String),           // raw text; numeric parse is a parser concern
    StringLiteral(String),    // unquoted, escapes processed
    Punct(Punct),             // : ( ) , = . - one char each
    Flag(String),             // --name (e.g. "--all-rows")
    Error(LexError),          // unrecognised char / unterminated string
                              //   — a token kind, not a Result variant
}

// Keyword is a closed set declared via a macro that is the
// single source of truth (§2a). Punct follows the same pattern.
```

The `Keyword` and `Punct` enums are not hand-declared. They
come out of macros described in §2a — one declaration site
that generates the enum, the lex-side string→variant
mapping, the variant→literal rendering, and the catalog-key
derivation in one place.

Notes on the model:

- **Keywords are an enum, not a string.** This makes
  `kw(Keyword::Create)` a single-token chumsky match with
  exact identity — fastest path, cleanest error patterns,
  no string allocation in the hot path.
- **Type names stay as identifiers.** `int`, `text`,
  `serial`, `varchar` all lex as `Identifier(_)`. The
  parser-level `type_keyword()` continues to call
  `Type::from_str` which produces the existing "unknown type
  'varchar' (expected one of: text, int, real, …)"
  message. That custom-error path is correct and stays. The
  closed-set Keyword enum is for the *grammar's* reserved
  words — words whose presence determines which command is
  being parsed. Type names are *content*.
- **Number is raw text, not parsed.** `Value::Number(String)`
  per ADR-0014 stays string-backed; the lexer doesn't try to
  validate or convert. `1`, `-3.14`, `1e10`, `1abc` all
  produce candidates for the parser to decide on. (The
  current parser rejects `1abc` already; that doesn't change.)
- **`StringLiteral` is post-escape.** The lexer processes
  `''` → `'` per the existing string syntax. The original
  span covers the surrounding quotes; the payload is the
  unescaped content. (Same convention as the current
  `string_literal()` parser.)
- **`Flag(String)` is `--name` exactly.** No further parsing.
  The parser matches `Flag(s)` and checks `s == "all-rows"`
  etc. against a small set.
- **`Error(_)` is a token kind, not a `Result` variant.**
  Lex always succeeds in producing `Vec<Token>` — even on
  unterminated strings or unrecognised characters, a
  diagnostic Error token is emitted in place. Rationale: I4
  (syntax highlighting) needs to colour partial / invalid
  input; treating lex errors as data instead of control flow
  lets the highlighter walk a token stream uniformly. The
  parser sees `Error(_)` tokens and raises a structural
  error that names the underlying cause via the catalog.

### 2a. Single source of truth for keywords and punctuation

The `Keyword` enum, the lexer's reserved-word table, the
variant→literal rendering, and the catalog-key derivation
all come from one declaration. A `define_keywords!`
`macro_rules!` invocation in `src/dsl/keyword.rs`:

```rust
macro_rules! define_keywords {
    ( $( $variant:ident => $literal:literal ),+ $(,)? ) => {
        #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
        pub enum Keyword { $( $variant ),+ }

        impl Keyword {
            pub const ALL: &'static [(Keyword, &'static str)] = &[
                $( (Keyword::$variant, $literal) ),+
            ];

            /// Lex-side mapping. Case-insensitive per ADR-0009.
            pub fn from_word(s: &str) -> Option<Keyword> {
                Self::ALL.iter()
                    .find(|(_, lit)| s.eq_ignore_ascii_case(lit))
                    .map(|(kw, _)| *kw)
            }

            /// Canonical lowercase literal for this variant.
            pub fn as_str(self) -> &'static str {
                Self::ALL.iter()
                    .find(|(kw, _)| *kw == self)
                    .map(|(_, lit)| *lit)
                    .unwrap()
            }

            /// `parse.token.keyword.<lit>` — the catalog key
            /// the renderer looks up for the expected-set
            /// vocabulary (ADR-0021 §4).
            pub fn catalog_token_key(self) -> String {
                format!("parse.token.keyword.{}", self.as_str())
            }
        }
    };
}

define_keywords! {
    Create => "create",
    Drop   => "drop",
    Add    => "add",
    // ... one line per keyword ...
}
```

`Punct` follows the same pattern via `define_punct!` —
Colon, OpenParen, CloseParen, Comma, Equals, Dot, each
generated from one line of the invocation.

Adding a new keyword is **one line** in the
`define_keywords!` invocation, plus one line in the
en-US YAML under `parse.token.keyword.<lit>` (the catalog
validator catches a missing entry at test time per ADR-0021
§7). Everything else — every parser combinator that
references the keyword, every usage-registry entry, every
catalog-key lookup — is a *use* of the source of truth, not
a duplicate, and is compile-time-checked by the type system
(typos in `kw(Keyword::Creaate)` don't compile).

### 3. Span carriage

Every `Token` carries `(start, end)` byte offsets in the
original source string. The lexer is byte-exact for
multi-byte UTF-8 sequences (ASCII-only is the realistic
input but the lexer does not panic on Unicode in identifiers
or string literals).

`ParseError::Invalid` continues to carry `position: usize`,
which is the `start` offset of the failing token (or the
end-of-input position when the failure is at EOF). Caret
rendering (in `app.rs`) is unchanged — same byte-position
contract.

### 4. Lexer-vs-parser error split

The lexer is responsible for:

- **Tokenization shape errors**: unterminated string literal,
  unrecognised character. Both surface as `TokenKind::Error(_)`
  tokens with span coverage of the offending region.
- **Nothing else.** Specifically: type names, command
  keywords, value validity, flag names, numeric overflow,
  identifier collisions are all parser- or higher-layer
  concerns.

The parser is responsible for:

- **Grammar-shape errors**: missing tokens, wrong tokens,
  unexpected tokens, end-of-input mid-command. These come
  out of chumsky's structural error machinery and aggregate
  across `choice` naturally.
- **Content errors via `try_map`**: unknown type name
  ("unknown type 'varchar'"), mutually-exclusive flags
  ("`--force-conversion` and `--dont-convert` are mutually
  exclusive"), "with pk needs at least one column",
  referential-clause repeated. These keep their hand-written
  messages — the `Rich::custom` path is correct for content,
  it was only wrong for keyword-shape.

The render layer (humanise) handles both uniformly via the
catalog (ADR-0021).

### 5. The chumsky-over-tokens combinator surface

Chumsky 0.13 parses arbitrary input types, not just `&str`.
The combinator surface for `Parser<'a, &'a [Token], …>` is
the same as for `&str`: `just`, `choice`, `then`,
`then_ignore`, `ignore_then`, `or_not`, `repeated`,
`separated_by`, `try_map`, `labelled`. The only differences:

- `just(Token { kind: Keyword::Create, .. })` doesn't work
  literally because spans differ. We define helpers
  `kw(Keyword::Create)`, `punct(Punct::Colon)`, `ident()`,
  `number()`, `string_lit()`, `flag(name)` that match by
  kind and ignore span on the input side, returning either
  `()` or the carried payload.
- `padded()` (which strips whitespace) is no longer needed —
  the lexer skips whitespace already. This simplifies a lot
  of combinator chains (`just(':').padded()` → `punct(Punct::Colon)`).
- `text::keyword(...)`, `any().filter(...)`, etc. — all the
  character-level helpers — are gone from the parser. They
  belong to the lexer if anywhere.

The parser keeps producing `Rich<Token>` errors. The
`expected` set members are `RichPattern<Token>`; the
catalog's `parse.token.*` keys (ADR-0021 §4) translate them.

### 6. The `replay` path-argument wart

The current grammar admits bare paths: `replay history.log`,
`replay /tmp/seed.commands`. Bare paths contain `.` and `/`,
which are punctuation tokens. A naive token-based parser
would see `Keyword(Replay) Identifier("history") Punct(Dot)
Identifier("log")` and have to reassemble — annoyingly
context-dependent.

Decision: keep the existing bare-path UX. The parser
special-cases the `replay` command at one point: after
matching `Keyword(Replay)`, it consumes its argument
**directly from the original source string** by reading from
the next token's start byte to end-of-input (or to the next
unambiguous terminator, of which today there is none in the
DSL). The lexer's job for the path argument is essentially
to identify *where the path starts*; the parser does the
rest by source-slicing.

The quoted-path form (`replay 'my project/seed.commands'`)
still goes through the lexer's normal `StringLiteral` path
and is matched as `Keyword(Replay) StringLiteral(s)`.

This special-casing is documented inline in the parser. It
costs ~10 lines of code and avoids a breaking change to the
DSL surface. No other DSL command takes a free-form
filesystem-path argument; if a future command does, it
either uses the same special-case or accepts only quoted
paths.

### 7. Whitespace

The lexer skips ASCII whitespace (` \t\r\n`) between tokens.
This honours ADR-0009's "whitespace is liberal" rule
unchanged — the user can put any amount of whitespace
between any two tokens, and the parser doesn't see it.

The lexer does **not** track whitespace as tokens. I4 (syntax
highlighting) recovers whitespace as the gaps between token
spans, computed at render time.

### 8. Comments

The DSL has no comment syntax today and this ADR doesn't
introduce one. If a comment syntax is added later (likely
`--` line comments to match SQL, but that conflicts with
flag prefixes — this is a real design decision a future ADR
will need to settle), the lexer is the right layer to skip
them. Out of scope here.

### 9. I3 (tab completion) hook

This ADR commits to making the parser's expected-token-set
queryable at any point in the input. Concretely:

- `parse_tokens(&[Token], &str)` returns
  `(Result<Command, ParseError>, ParseDiagnostics)` where
  `ParseDiagnostics` carries the raw chumsky output —
  including the merged `expected` set at the failure point.
- Truncating the token stream at the cursor and re-parsing
  yields the expected-token-set at the cursor position. I3
  uses this for completion.
- The lexer is independently useful to I3: tokenizing the
  in-progress input lets the completion logic see what
  *partial* token the cursor is in (mid-keyword, mid-identifier,
  mid-string).

The full I3 surface (cursor positioning, completion menu,
disambiguation, completion of identifiers from schema) is
out of scope here. ADR-0020 commits only to the parser
contract.

### 10. I4 (syntax highlighting) hook

The token kinds are the natural classification for syntax
highlighting:

- `Keyword(_)` → keyword colour.
- `Identifier(_)` → identifier colour.
- `Number(_)`, `StringLiteral(_)` → literal colours
  (probably distinct from each other).
- `Punct(_)` → punctuation colour (or no special colouring).
- `Flag(_)` → flag colour.
- `Error(_)` → error highlight (red underline / squiggle).

The lexer succeeds on partial / invalid input (§2). I4 does
not need a successful parse to render highlighting — only
successful tokenization. This is the load-bearing property
for "live" highlighting as the user types.

The colour scheme, theme integration, render layer, and
performance budget are out of scope here. ADR-0020 commits
only to "the lexer always produces a token stream over which
highlighting can iterate".

### 11. Recovery-based partial AST — explicitly deferred

Chumsky supports parser combinators with recovery, producing
partial ASTs alongside errors. Useful for I3 ("the user has
typed `create table Foo with pk` and now I want to know
what the partial AST is so I can suggest column-spec
completions") but the design space is large (which recovery
strategies, what the partial AST shape is, how downstream
consumers handle it).

This ADR keeps the parser non-recovering: a failed parse
returns `Err(ParseError)` and no partial AST. I3's ADR will
decide whether recovery is needed; if so, the change is
local to `parse_tokens` and doesn't ripple.

### 12. Migration of existing parser tests

The 50+ unit tests in `dsl/parser.rs::tests` are the spec of
the current grammar. The migration is mechanical:

- Tests that call `parse_command(input)` keep doing so —
  `parse_command` is the public boundary and its signature
  doesn't change.
- Tests that assert on `ParseError::Invalid { message, ..  }`
  may need wording updates if the new error layer rewords
  them, but anchor substrings ("unknown type", "specified
  twice", "mutually exclusive", "varchar", "expected one
  of") stay intact (those come from `try_map` content errors
  that survive unchanged).
- Two existing tests assert on chumsky's structural wording:
  `structural_error_for_show_data_without_arg`
  ("after `show data`", "expected identifier", "found end of
  input") and `structural_error_for_change_column_with_swapped_args`
  ("after `change column Rich`", "expected `:`"). The new
  rendering preserves the same shape — `after <prefix>,
  expected <set>, found <token>` — so these tests should
  port with at most minor adjustments. ADR-0021 specifies
  the rendering precisely.

A new test surface for the lexer itself: lex output for
representative inputs, lex error tokens for unterminated
strings + unknown chars, span correctness.

## Out of scope

Deliberately deferred to keep this ADR focused:

1. **Per-command usage templates.** ADR-0021.
2. **`parse.usage.*` and `parse.token.*` catalog keys.**
   ADR-0021.
3. **Error renderer composition** (caret + structural error
   + usage hint). ADR-0021.
4. **I3 completion UI** + cursor logic + identifier
   completion from schema. Future I3 ADR.
5. **I4 colour scheme** + theme integration. Future I4 ADR.
6. **Recovery-based partial AST.** §11 — re-opens with I3.
7. **Comment syntax.** §8.
8. **Sharing the lexer with `sqlparser-rs`** (advanced-mode
   SQL). The two parsers stay separate today; a future SQL
   subset ADR may revisit.

## Consequences

### Positive

- **Aggregation works.** Top-level `choice((create_*, drop_*,
  add_*, …))` failures emit "expected `create`, `drop`,
  `add`, …" structurally. The user sees the family of
  available commands rather than one branch's report.
- **The bespoke `humanise()` machinery shrinks.** Roughly
  half the helpers in `parser.rs` (~80-100 lines) are no
  longer needed because token-level error patterns render
  directly via the catalog. Less code is less code to
  maintain.
- **I3 / I4 inherit a clean foundation.** Their ADRs can
  focus on UX/UI rather than re-litigating parser shape.
- **Lex errors and parse errors share a render path.**
  Unterminated strings and missing keywords both surface
  through the same catalog-driven layer.
- **The token stream is a natural API for future tooling**:
  schema-aware highlighting, structural editing, command
  history rendering with token-level colour.

### Costs

- **One-time migration of `dsl/parser.rs`.** Every combinator
  rewrites against `&[Token]`. Estimated 600-900 lines of
  parser.rs change including the lexer module; the lexer
  itself is probably 200-300 lines plus 150-250 lines of
  unit tests.
- **Catalog grows by one entry per keyword and per punct.**
  The macro-driven `Keyword` / `Punct` enums (§2a) collapse
  the enum + lex table + as-str + catalog-key derivation into
  one declaration site, so adding a keyword is one line of
  Rust + one line of YAML. The catalog validator enforces
  completeness at test time (ADR-0021 §7). Compared to today
  — where adding a keyword means a new `keyword_ci("...")`
  call site (or several, if used in multiple commands) — the
  per-keyword cost is comparable; what shifts is *where* the
  edit happens. A unit test additionally asserts every
  `Keyword::ALL` variant is referenced by some parser
  combinator, catching dead enum entries.
- **Span model needs care for UTF-8.** Byte offsets are the
  contract; the lexer must split tokens at UTF-8 boundaries.
  Identifier/string tests should include at least one
  multi-byte case.
- **`replay` special-case** in the parser (§6). One command,
  ~10 lines, documented inline. Acceptable; not a precedent
  for other commands.
- **Tests churn.** Two structural-error tests need new wording
  (mechanical port). Existing content-error tests stay as-is.

### Neutral

- **chumsky stays.** No framework change, no new dependency.
  The parser still expresses the grammar declaratively; only
  the input atoms change.
- **`Command` AST is unchanged.** The parser produces the
  same `Command` enum it does today; downstream code (runtime,
  app, db) is untouched.
- **Public API of `dsl::parser` is unchanged.**
  `parse_command(&str) → Result<Command, ParseError>` keeps
  its signature. ADR-0021 may extend the public surface (e.g.
  expose `lex()` and `parse_tokens()` for I3/I4) but does so
  additively.

## Implementation notes

These are sketch-level — implementation will produce more
detailed work, but they're enough that a session picking
this up has direction.

### Order of operations

1. **Lexer module.** New file `src/dsl/lexer.rs`. Token /
   TokenKind / Keyword / Punct types. `lex(input: &str) →
   Vec<Token>` (always succeeds; embeds `Error(_)` tokens
   for shape errors). Unit tests for representative inputs,
   span correctness, error-token cases, multi-byte UTF-8.
2. **Token-aware combinator helpers.** Probably in
   `src/dsl/parser.rs` next to the existing combinators.
   `kw(Keyword)`, `punct(Punct)`, `ident()`, `number()`,
   `string_lit()`, `flag(&str)`, `eof()`. Each parses a
   single token by kind and returns its payload (or `()`).
3. **Rewrite `command_parser()` and its sub-parsers.** One
   sub-parser at a time; run the existing tests after each.
   Aim for green-after-every-step rather than a big-bang
   port.
4. **The `replay` source-slice special case** (§6).
5. **Strip `humanise()` machinery** that's no longer needed.
   The render path in `into_parse_error()` simplifies.
   ADR-0021 owns the new render shape; until it lands, keep
   a minimal humaniser that produces the existing wording.
6. **Public API check.** `parse_command` signature unchanged.
   Add `lex(&str) → Vec<Token>` and `parse_tokens(&[Token],
   &str) → Result<Command, ParseError>` as `pub` so I3/I4
   can hook in later (their ADRs will use these).

### Things that interact subtly

- **The `try_map` content errors** survive unchanged. They
  fire on tokens (e.g. `try_map` on a parsed `Identifier(s)`
  that's expected to be a type name) but their messages and
  classification are identical to today. The catalog
  vocabulary they use ("unknown type", "expected one of",
  "mutually exclusive", "specified twice") stays.
- **The `1:n` cardinality** lexes as `Number("1")`,
  `Punct(Colon)`, `Identifier("n"|"N")`. The parser composes
  these into the relationship-cardinality assertion as
  today; no special token kind for it.
- **Negative number literals.** Lexer emits `Punct(Minus)`?
  No — there is no Minus in the punct set above because the
  current grammar has no need for a unary `-` outside number
  literals. Decision: the lexer recognises `-` as part of a
  number literal when followed by an ASCII digit, producing a
  single `Number("-5")` token. A bare `-` not followed by a
  digit is a `Punct(Minus)` only if a `Minus` variant is
  added — for now, treat as `Error(UnknownChar)`. This
  matches the current grammar (which only accepts `-` as a
  number sign).
- **The `--` flag prefix vs. a future `--` line comment.**
  Today `--all-rows` etc. are flags. If a future ADR
  introduces SQL-style `--` line comments in advanced mode
  only, the lexer may need a mode parameter. ADR-0020 doesn't
  pre-empt that.
- **The hard-coded `"running: "` prefix in `app.rs`** for
  caret padding: unchanged. The parser still reports a
  byte-position; the caret math is the same.