0e6f767848
ADR-0020/0021 specified a chumsky-based H1a; ADR-0024 replaced chumsky with the scannerless walker, leaving both obsolete. Mark them superseded (kept as institutional memory) and add ADR-0042, which restates H1a against the architecture as built. ADR-0042 records that H1a is substantially shipped already — per-command usage block, available-commands fallback, source-derived ident slot labels, curated parse.custom.* near-miss messages, and schema-aware [ERR] diagnostics — and defines the remaining work: a verified per-command near-miss matrix (the definition of done), friendlier literal expectation labels that add role context while keeping the exact literal visible, and advanced-mode SQL parse parity (RETURNING scope, CROSS JOIN ON, INSERT…SELECT count), kept distinct from ADR-0019 §OOS-2 engine-error sanitisation. - docs/adr/0020,0021: superseded notes + README entries - docs/adr/0042: new ADR - docs/adr/README.md: index upkeep (ADR-0000 rule)
661 lines
27 KiB
Markdown
661 lines
27 KiB
Markdown
# ADR-0020: Tokenization layer for the DSL parser
|
|
|
|
## Status
|
|
|
|
**Superseded by ADR-0024** (2026-05-14). Accepted then superseded
|
|
without being implemented.
|
|
|
|
> **Superseding note (2026-06-03).** This ADR was never built. It
|
|
> specifies a `chumsky`-over-tokens architecture — a separate lexer
|
|
> producing `Vec<Token>`, a `define_keywords!` macro, and chumsky
|
|
> grammar combinators consuming `&[Token]`. ADR-0024 (unified grammar
|
|
> tree) instead adopted a **scannerless hand-rolled walker** that
|
|
> operates directly on source bytes, and **removed chumsky from the
|
|
> project entirely** (it is no longer a dependency). The lexer,
|
|
> `keyword.rs`, and the token model described below do not exist.
|
|
>
|
|
> What this ADR got *right* survives in ADR-0024: the
|
|
> expected-set aggregation it wanted (one branch's report no longer
|
|
> swallowing the others) is delivered by the walker's structural
|
|
> `expected` derivation, and the I3 (completion) / I4 (highlighting)
|
|
> hooks it anticipated are served by the same walker. Read ADR-0024
|
|
> for the architecture as built; this ADR remains as institutional
|
|
> memory of the path not taken and the reasoning that led there.
|
|
|
|
---
|
|
|
|
*Original status (historical):* Accepted.
|
|
|
|
Amends ADR-0001 (language and TUI framework) by adding a
|
|
tokenization layer between the source string and the chumsky
|
|
grammar. The chumsky choice itself is unchanged; what changes
|
|
is the input to chumsky — `&[Token]` instead of `&str`.
|
|
|
|
Foundation for ADR-0021 (parser-as-source-of-truth pedagogy /
|
|
H1a). Also the load-bearing piece for I3 (tab completion) and
|
|
I4 (syntax highlighting) once those land.
|
|
|
|
## Context
|
|
|
|
### What the user sees today
|
|
|
|
Typing `create` at the prompt produces:
|
|
|
|
```
|
|
parse error: after `create`, expected `table`
|
|
```
|
|
|
|
Typing `frobulate Customers` produces:
|
|
|
|
```
|
|
parse error: expected `create`, found `frobulate`
|
|
```
|
|
|
|
Both are technically true and both are bad. The first hides
|
|
that `create` is the entry to a single command (`create
|
|
table …`) that the user almost certainly does not yet know
|
|
the shape of. The second points at one of ten possible
|
|
command-starting keywords and silently picks one. A user
|
|
hitting either error gets a sharper "what comes next?" answer
|
|
from a 1980s `lex`/`yacc` toolchain than from this app.
|
|
|
|
### Two technical roots
|
|
|
|
The bad UX traces to two parser-level decisions, both
|
|
correctable but not without architectural change:
|
|
|
|
1. **`keyword_ci` emits `Rich::custom` errors instead of
|
|
structural ones.** Chumsky's `choice` combinator merges the
|
|
`expected` sets of its alternatives when they all fail —
|
|
that is exactly what produces "expected `data` or `table`"
|
|
instead of just "expected `table`". Custom errors don't
|
|
participate in that merge: only one wins, deterministically
|
|
the first. So our top-level `choice((create_table, drop_*,
|
|
add_*, …))` collapses to whichever branch reported its
|
|
custom error first, throwing the others away.
|
|
|
|
2. **The parser has no concept of a "command" beyond its
|
|
chumsky combinator graph.** There is no place to attach a
|
|
per-command grammar template (the "what does `create
|
|
table` actually want?" answer). Adding one is possible but
|
|
awkward without an explicit handle on "the entry token
|
|
for this command".
|
|
|
|
ADR-0021 addresses (2). This ADR addresses (1) by removing
|
|
the underlying cause: `keyword_ci`'s `try_map → Rich::custom`
|
|
shape only exists because the parser operates on raw
|
|
characters and has to hand-write keyword recognition. Over a
|
|
token stream, a keyword is just a token, and `just(...)` over
|
|
tokens aggregates naturally.
|
|
|
|
### Bespoke machinery in `parser.rs`
|
|
|
|
`parser.rs` carries ~180 lines of error-handling helpers
|
|
(`humanise()`, `consumed_context()`, `oxford_or()`,
|
|
`describe_pattern()`, `describe_char()`, `format_expected_found()`,
|
|
`first_custom_message()`, the `prefer-custom-over-structural`
|
|
selection in `into_parse_error()`). Most of it exists because
|
|
chumsky-over-`&str` produces character-level error patterns
|
|
that need humanising before a user can read them
|
|
(`RichPattern::Token('e')` → "`e`" is not what the user
|
|
needed to know).
|
|
|
|
Over a token stream, the patterns are token kinds — `Keyword(Create)`,
|
|
`Identifier`, `Punct(Colon)` — which render directly via the
|
|
i18n catalog without per-character translation. Roughly half
|
|
of that machinery dissolves.
|
|
|
|
### Honest history
|
|
|
|
ADR-0001 chose Rust + chumsky for the DSL. It did not name
|
|
"no lexer" as a design choice — the no-lexer shape grew
|
|
incrementally inside `parser.rs` without an ADR.
|
|
|
|
The known requirements at the time included **H1a** (friendly
|
|
error layer for parse errors), **I3** (tab completion), and
|
|
**I4** (syntax highlighting). All three are easier with a
|
|
token stream than without one — H1a needs aggregated expected
|
|
sets, I3 needs "what tokens are valid at cursor", I4 needs
|
|
token classification on potentially-invalid input. The
|
|
lexer-shape question should have been considered against
|
|
these requirements when the parser was built. It wasn't.
|
|
|
|
The user has previously raised this — pointing out that
|
|
`lex`/`yacc` already handled this kind of error reporting
|
|
better, and asking why we weren't getting comparable
|
|
behaviour. That observation was correct on the merits and
|
|
should have triggered an ADR amendment then. ADR-0020 is that
|
|
amendment, late but in the right place.
|
|
|
|
This is not a "we now realise" framing. It is a "we should
|
|
have decided this earlier and didn't" framing. The cost of
|
|
acting on it now is real but bounded; the cost of acting on
|
|
it after the query DSL or constraint-management commands have
|
|
landed would be substantially larger. We have no users; the
|
|
agility cost of refactoring is at its lowest.
|
|
|
|
### What is and isn't this ADR
|
|
|
|
This ADR is purely about the **input layer to the DSL
|
|
parser**. It does not specify per-command usage rendering,
|
|
catalog keys for parse-error wording, or the renderer
|
|
composition of caret + structural error + usage hint. Those
|
|
are ADR-0021's scope. The two ADRs share an implementation
|
|
session; the dependency runs one way (ADR-0021 needs
|
|
ADR-0020's tokens).
|
|
|
|
This ADR also does not touch SQL parsing in advanced mode.
|
|
That path uses `sqlparser-rs` (per ADR-0001) and has its own
|
|
tokenization built in. A future ADR for the SQL subset (Q4)
|
|
will decide whether to share the DSL lexer's token model or
|
|
keep `sqlparser-rs`'s token surface as-is — that's not
|
|
prejudged here.
|
|
|
|
## Decision
|
|
|
|
### 1. Two-phase parse: lex → parse
|
|
|
|
```rust
|
|
pub fn parse_command(input: &str) -> Result<Command, ParseError> {
|
|
let tokens = lex(input)?; // Stage 1
|
|
parse_tokens(&tokens, input) // Stage 2
|
|
}
|
|
```
|
|
|
|
Stage 1 (`lex`) produces a span-tagged token stream. Stage 2
|
|
(`parse_tokens`) is a chumsky parser whose input type is
|
|
`&[Token]` instead of `&str`. The two stages are separately
|
|
testable; the lexer has its own test surface that doesn't
|
|
exercise the parser.
|
|
|
|
`parse_tokens` takes the original `&str` as a second
|
|
argument purely so the parser can consume bare path arguments
|
|
for the `replay` command directly from source (see §6). All
|
|
other parser logic operates over the token slice.
|
|
|
|
### 2. Token model
|
|
|
|
```rust
|
|
#[derive(Debug, Clone, PartialEq, Eq)]
|
|
pub struct Token {
|
|
pub kind: TokenKind,
|
|
pub span: Span, // (start, end) byte offsets
|
|
}
|
|
|
|
#[derive(Debug, Clone, PartialEq, Eq)]
|
|
pub enum TokenKind {
|
|
Keyword(Keyword), // case-folded reserved word
|
|
Identifier(String), // case-preserving (ADR-0009)
|
|
Number(String), // raw text; numeric parse is a parser concern
|
|
StringLiteral(String), // unquoted, escapes processed
|
|
Punct(Punct), // : ( ) , = . - one char each
|
|
Flag(String), // --name (e.g. "--all-rows")
|
|
Error(LexError), // unrecognised char / unterminated string
|
|
// — a token kind, not a Result variant
|
|
}
|
|
|
|
// Keyword is a closed set declared via a macro that is the
|
|
// single source of truth (§2a). Punct follows the same pattern.
|
|
```
|
|
|
|
The `Keyword` and `Punct` enums are not hand-declared. They
|
|
come out of macros described in §2a — one declaration site
|
|
that generates the enum, the lex-side string→variant
|
|
mapping, the variant→literal rendering, and the catalog-key
|
|
derivation in one place.
|
|
|
|
Notes on the model:
|
|
|
|
- **Keywords are an enum, not a string.** This makes
|
|
`kw(Keyword::Create)` a single-token chumsky match with
|
|
exact identity — fastest path, cleanest error patterns,
|
|
no string allocation in the hot path.
|
|
- **Type names stay as identifiers.** `int`, `text`,
|
|
`serial`, `varchar` all lex as `Identifier(_)`. The
|
|
parser-level `type_keyword()` continues to call
|
|
`Type::from_str` which produces the existing "unknown type
|
|
'varchar' (expected one of: text, int, real, …)"
|
|
message. That custom-error path is correct and stays. The
|
|
closed-set Keyword enum is for the *grammar's* reserved
|
|
words — words whose presence determines which command is
|
|
being parsed. Type names are *content*.
|
|
- **Number is raw text, not parsed.** `Value::Number(String)`
|
|
per ADR-0014 stays string-backed; the lexer doesn't try to
|
|
validate or convert. `1`, `-3.14`, `1e10`, `1abc` all
|
|
produce candidates for the parser to decide on. (The
|
|
current parser rejects `1abc` already; that doesn't change.)
|
|
- **`StringLiteral` is post-escape.** The lexer processes
|
|
`''` → `'` per the existing string syntax. The original
|
|
span covers the surrounding quotes; the payload is the
|
|
unescaped content. (Same convention as the current
|
|
`string_literal()` parser.)
|
|
- **`Flag(String)` is `--name` exactly.** No further parsing.
|
|
The parser matches `Flag(s)` and checks `s == "all-rows"`
|
|
etc. against a small set.
|
|
- **`Error(_)` is a token kind, not a `Result` variant.**
|
|
Lex always succeeds in producing `Vec<Token>` — even on
|
|
unterminated strings or unrecognised characters, a
|
|
diagnostic Error token is emitted in place. Rationale: I4
|
|
(syntax highlighting) needs to colour partial / invalid
|
|
input; treating lex errors as data instead of control flow
|
|
lets the highlighter walk a token stream uniformly. The
|
|
parser sees `Error(_)` tokens and raises a structural
|
|
error that names the underlying cause via the catalog.
|
|
|
|
### 2a. Single source of truth for keywords and punctuation
|
|
|
|
The `Keyword` enum, the lexer's reserved-word table, the
|
|
variant→literal rendering, and the catalog-key derivation
|
|
all come from one declaration. A `define_keywords!`
|
|
`macro_rules!` invocation in `src/dsl/keyword.rs`:
|
|
|
|
```rust
|
|
macro_rules! define_keywords {
|
|
( $( $variant:ident => $literal:literal ),+ $(,)? ) => {
|
|
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
|
|
pub enum Keyword { $( $variant ),+ }
|
|
|
|
impl Keyword {
|
|
pub const ALL: &'static [(Keyword, &'static str)] = &[
|
|
$( (Keyword::$variant, $literal) ),+
|
|
];
|
|
|
|
/// Lex-side mapping. Case-insensitive per ADR-0009.
|
|
pub fn from_word(s: &str) -> Option<Keyword> {
|
|
Self::ALL.iter()
|
|
.find(|(_, lit)| s.eq_ignore_ascii_case(lit))
|
|
.map(|(kw, _)| *kw)
|
|
}
|
|
|
|
/// Canonical lowercase literal for this variant.
|
|
pub fn as_str(self) -> &'static str {
|
|
Self::ALL.iter()
|
|
.find(|(kw, _)| *kw == self)
|
|
.map(|(_, lit)| *lit)
|
|
.unwrap()
|
|
}
|
|
|
|
/// `parse.token.keyword.<lit>` — the catalog key
|
|
/// the renderer looks up for the expected-set
|
|
/// vocabulary (ADR-0021 §4).
|
|
pub fn catalog_token_key(self) -> String {
|
|
format!("parse.token.keyword.{}", self.as_str())
|
|
}
|
|
}
|
|
};
|
|
}
|
|
|
|
define_keywords! {
|
|
Create => "create",
|
|
Drop => "drop",
|
|
Add => "add",
|
|
// ... one line per keyword ...
|
|
}
|
|
```
|
|
|
|
`Punct` follows the same pattern via `define_punct!` —
|
|
Colon, OpenParen, CloseParen, Comma, Equals, Dot, each
|
|
generated from one line of the invocation.
|
|
|
|
Adding a new keyword is **one line** in the
|
|
`define_keywords!` invocation, plus one line in the
|
|
en-US YAML under `parse.token.keyword.<lit>` (the catalog
|
|
validator catches a missing entry at test time per ADR-0021
|
|
§7). Everything else — every parser combinator that
|
|
references the keyword, every usage-registry entry, every
|
|
catalog-key lookup — is a *use* of the source of truth, not
|
|
a duplicate, and is compile-time-checked by the type system
|
|
(typos in `kw(Keyword::Creaate)` don't compile).
|
|
|
|
### 3. Span carriage
|
|
|
|
Every `Token` carries `(start, end)` byte offsets in the
|
|
original source string. The lexer is byte-exact for
|
|
multi-byte UTF-8 sequences (ASCII-only is the realistic
|
|
input but the lexer does not panic on Unicode in identifiers
|
|
or string literals).
|
|
|
|
`ParseError::Invalid` continues to carry `position: usize`,
|
|
which is the `start` offset of the failing token (or the
|
|
end-of-input position when the failure is at EOF). Caret
|
|
rendering (in `app.rs`) is unchanged — same byte-position
|
|
contract.
|
|
|
|
### 4. Lexer-vs-parser error split
|
|
|
|
The lexer is responsible for:
|
|
|
|
- **Tokenization shape errors**: unterminated string literal,
|
|
unrecognised character. Both surface as `TokenKind::Error(_)`
|
|
tokens with span coverage of the offending region.
|
|
- **Nothing else.** Specifically: type names, command
|
|
keywords, value validity, flag names, numeric overflow,
|
|
identifier collisions are all parser- or higher-layer
|
|
concerns.
|
|
|
|
The parser is responsible for:
|
|
|
|
- **Grammar-shape errors**: missing tokens, wrong tokens,
|
|
unexpected tokens, end-of-input mid-command. These come
|
|
out of chumsky's structural error machinery and aggregate
|
|
across `choice` naturally.
|
|
- **Content errors via `try_map`**: unknown type name
|
|
("unknown type 'varchar'"), mutually-exclusive flags
|
|
("`--force-conversion` and `--dont-convert` are mutually
|
|
exclusive"), "with pk needs at least one column",
|
|
referential-clause repeated. These keep their hand-written
|
|
messages — the `Rich::custom` path is correct for content,
|
|
it was only wrong for keyword-shape.
|
|
|
|
The render layer (humanise) handles both uniformly via the
|
|
catalog (ADR-0021).
|
|
|
|
### 5. The chumsky-over-tokens combinator surface
|
|
|
|
Chumsky 0.13 parses arbitrary input types, not just `&str`.
|
|
The combinator surface for `Parser<'a, &'a [Token], …>` is
|
|
the same as for `&str`: `just`, `choice`, `then`,
|
|
`then_ignore`, `ignore_then`, `or_not`, `repeated`,
|
|
`separated_by`, `try_map`, `labelled`. The only differences:
|
|
|
|
- `just(Token { kind: Keyword::Create, .. })` doesn't work
|
|
literally because spans differ. We define helpers
|
|
`kw(Keyword::Create)`, `punct(Punct::Colon)`, `ident()`,
|
|
`number()`, `string_lit()`, `flag(name)` that match by
|
|
kind and ignore span on the input side, returning either
|
|
`()` or the carried payload.
|
|
- `padded()` (which strips whitespace) is no longer needed —
|
|
the lexer skips whitespace already. This simplifies a lot
|
|
of combinator chains (`just(':').padded()` → `punct(Punct::Colon)`).
|
|
- `text::keyword(...)`, `any().filter(...)`, etc. — all the
|
|
character-level helpers — are gone from the parser. They
|
|
belong to the lexer if anywhere.
|
|
|
|
The parser keeps producing `Rich<Token>` errors. The
|
|
`expected` set members are `RichPattern<Token>`; the
|
|
catalog's `parse.token.*` keys (ADR-0021 §4) translate them.
|
|
|
|
### 6. The `replay` path-argument wart
|
|
|
|
The current grammar admits bare paths: `replay history.log`,
|
|
`replay /tmp/seed.commands`. Bare paths contain `.` and `/`,
|
|
which are punctuation tokens. A naive token-based parser
|
|
would see `Keyword(Replay) Identifier("history") Punct(Dot)
|
|
Identifier("log")` and have to reassemble — annoyingly
|
|
context-dependent.
|
|
|
|
Decision: keep the existing bare-path UX. The parser
|
|
special-cases the `replay` command at one point: after
|
|
matching `Keyword(Replay)`, it consumes its argument
|
|
**directly from the original source string** by reading from
|
|
the next token's start byte to end-of-input (or to the next
|
|
unambiguous terminator, of which today there is none in the
|
|
DSL). The lexer's job for the path argument is essentially
|
|
to identify *where the path starts*; the parser does the
|
|
rest by source-slicing.
|
|
|
|
The quoted-path form (`replay 'my project/seed.commands'`)
|
|
still goes through the lexer's normal `StringLiteral` path
|
|
and is matched as `Keyword(Replay) StringLiteral(s)`.
|
|
|
|
This special-casing is documented inline in the parser. It
|
|
costs ~10 lines of code and avoids a breaking change to the
|
|
DSL surface. No other DSL command takes a free-form
|
|
filesystem-path argument; if a future command does, it
|
|
either uses the same special-case or accepts only quoted
|
|
paths.
|
|
|
|
### 7. Whitespace
|
|
|
|
The lexer skips ASCII whitespace (` \t\r\n`) between tokens.
|
|
This honours ADR-0009's "whitespace is liberal" rule
|
|
unchanged — the user can put any amount of whitespace
|
|
between any two tokens, and the parser doesn't see it.
|
|
|
|
The lexer does **not** track whitespace as tokens. I4 (syntax
|
|
highlighting) recovers whitespace as the gaps between token
|
|
spans, computed at render time.
|
|
|
|
### 8. Comments
|
|
|
|
The DSL has no comment syntax today and this ADR doesn't
|
|
introduce one. If a comment syntax is added later (likely
|
|
`--` line comments to match SQL, but that conflicts with
|
|
flag prefixes — this is a real design decision a future ADR
|
|
will need to settle), the lexer is the right layer to skip
|
|
them. Out of scope here.
|
|
|
|
### 9. I3 (tab completion) hook
|
|
|
|
This ADR commits to making the parser's expected-token-set
|
|
queryable at any point in the input. Concretely:
|
|
|
|
- `parse_tokens(&[Token], &str)` returns
|
|
`(Result<Command, ParseError>, ParseDiagnostics)` where
|
|
`ParseDiagnostics` carries the raw chumsky output —
|
|
including the merged `expected` set at the failure point.
|
|
- Truncating the token stream at the cursor and re-parsing
|
|
yields the expected-token-set at the cursor position. I3
|
|
uses this for completion.
|
|
- The lexer is independently useful to I3: tokenizing the
|
|
in-progress input lets the completion logic see what
|
|
*partial* token the cursor is in (mid-keyword, mid-identifier,
|
|
mid-string).
|
|
|
|
The full I3 surface (cursor positioning, completion menu,
|
|
disambiguation, completion of identifiers from schema) is
|
|
out of scope here. ADR-0020 commits only to the parser
|
|
contract.
|
|
|
|
### 10. I4 (syntax highlighting) hook
|
|
|
|
The token kinds are the natural classification for syntax
|
|
highlighting:
|
|
|
|
- `Keyword(_)` → keyword colour.
|
|
- `Identifier(_)` → identifier colour.
|
|
- `Number(_)`, `StringLiteral(_)` → literal colours
|
|
(probably distinct from each other).
|
|
- `Punct(_)` → punctuation colour (or no special colouring).
|
|
- `Flag(_)` → flag colour.
|
|
- `Error(_)` → error highlight (red underline / squiggle).
|
|
|
|
The lexer succeeds on partial / invalid input (§2). I4 does
|
|
not need a successful parse to render highlighting — only
|
|
successful tokenization. This is the load-bearing property
|
|
for "live" highlighting as the user types.
|
|
|
|
The colour scheme, theme integration, render layer, and
|
|
performance budget are out of scope here. ADR-0020 commits
|
|
only to "the lexer always produces a token stream over which
|
|
highlighting can iterate".
|
|
|
|
### 11. Recovery-based partial AST — explicitly deferred
|
|
|
|
Chumsky supports parser combinators with recovery, producing
|
|
partial ASTs alongside errors. Useful for I3 ("the user has
|
|
typed `create table Foo with pk` and now I want to know
|
|
what the partial AST is so I can suggest column-spec
|
|
completions") but the design space is large (which recovery
|
|
strategies, what the partial AST shape is, how downstream
|
|
consumers handle it).
|
|
|
|
This ADR keeps the parser non-recovering: a failed parse
|
|
returns `Err(ParseError)` and no partial AST. I3's ADR will
|
|
decide whether recovery is needed; if so, the change is
|
|
local to `parse_tokens` and doesn't ripple.
|
|
|
|
### 12. Migration of existing parser tests
|
|
|
|
The 50+ unit tests in `dsl/parser.rs::tests` are the spec of
|
|
the current grammar. The migration is mechanical:
|
|
|
|
- Tests that call `parse_command(input)` keep doing so —
|
|
`parse_command` is the public boundary and its signature
|
|
doesn't change.
|
|
- Tests that assert on `ParseError::Invalid { message, .. }`
|
|
may need wording updates if the new error layer rewords
|
|
them, but anchor substrings ("unknown type", "specified
|
|
twice", "mutually exclusive", "varchar", "expected one
|
|
of") stay intact (those come from `try_map` content errors
|
|
that survive unchanged).
|
|
- Two existing tests assert on chumsky's structural wording:
|
|
`structural_error_for_show_data_without_arg`
|
|
("after `show data`", "expected identifier", "found end of
|
|
input") and `structural_error_for_change_column_with_swapped_args`
|
|
("after `change column Rich`", "expected `:`"). The new
|
|
rendering preserves the same shape — `after <prefix>,
|
|
expected <set>, found <token>` — so these tests should
|
|
port with at most minor adjustments. ADR-0021 specifies
|
|
the rendering precisely.
|
|
|
|
A new test surface for the lexer itself: lex output for
|
|
representative inputs, lex error tokens for unterminated
|
|
strings + unknown chars, span correctness.
|
|
|
|
## Out of scope
|
|
|
|
Deliberately deferred to keep this ADR focused:
|
|
|
|
1. **Per-command usage templates.** ADR-0021.
|
|
2. **`parse.usage.*` and `parse.token.*` catalog keys.**
|
|
ADR-0021.
|
|
3. **Error renderer composition** (caret + structural error
|
|
+ usage hint). ADR-0021.
|
|
4. **I3 completion UI** + cursor logic + identifier
|
|
completion from schema. Future I3 ADR.
|
|
5. **I4 colour scheme** + theme integration. Future I4 ADR.
|
|
6. **Recovery-based partial AST.** §11 — re-opens with I3.
|
|
7. **Comment syntax.** §8.
|
|
8. **Sharing the lexer with `sqlparser-rs`** (advanced-mode
|
|
SQL). The two parsers stay separate today; a future SQL
|
|
subset ADR may revisit.
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
- **Aggregation works.** Top-level `choice((create_*, drop_*,
|
|
add_*, …))` failures emit "expected `create`, `drop`,
|
|
`add`, …" structurally. The user sees the family of
|
|
available commands rather than one branch's report.
|
|
- **The bespoke `humanise()` machinery shrinks.** Roughly
|
|
half the helpers in `parser.rs` (~80-100 lines) are no
|
|
longer needed because token-level error patterns render
|
|
directly via the catalog. Less code is less code to
|
|
maintain.
|
|
- **I3 / I4 inherit a clean foundation.** Their ADRs can
|
|
focus on UX/UI rather than re-litigating parser shape.
|
|
- **Lex errors and parse errors share a render path.**
|
|
Unterminated strings and missing keywords both surface
|
|
through the same catalog-driven layer.
|
|
- **The token stream is a natural API for future tooling**:
|
|
schema-aware highlighting, structural editing, command
|
|
history rendering with token-level colour.
|
|
|
|
### Costs
|
|
|
|
- **One-time migration of `dsl/parser.rs`.** Every combinator
|
|
rewrites against `&[Token]`. Estimated 600-900 lines of
|
|
parser.rs change including the lexer module; the lexer
|
|
itself is probably 200-300 lines plus 150-250 lines of
|
|
unit tests.
|
|
- **Catalog grows by one entry per keyword and per punct.**
|
|
The macro-driven `Keyword` / `Punct` enums (§2a) collapse
|
|
the enum + lex table + as-str + catalog-key derivation into
|
|
one declaration site, so adding a keyword is one line of
|
|
Rust + one line of YAML. The catalog validator enforces
|
|
completeness at test time (ADR-0021 §7). Compared to today
|
|
— where adding a keyword means a new `keyword_ci("...")`
|
|
call site (or several, if used in multiple commands) — the
|
|
per-keyword cost is comparable; what shifts is *where* the
|
|
edit happens. A unit test additionally asserts every
|
|
`Keyword::ALL` variant is referenced by some parser
|
|
combinator, catching dead enum entries.
|
|
- **Span model needs care for UTF-8.** Byte offsets are the
|
|
contract; the lexer must split tokens at UTF-8 boundaries.
|
|
Identifier/string tests should include at least one
|
|
multi-byte case.
|
|
- **`replay` special-case** in the parser (§6). One command,
|
|
~10 lines, documented inline. Acceptable; not a precedent
|
|
for other commands.
|
|
- **Tests churn.** Two structural-error tests need new wording
|
|
(mechanical port). Existing content-error tests stay as-is.
|
|
|
|
### Neutral
|
|
|
|
- **chumsky stays.** No framework change, no new dependency.
|
|
The parser still expresses the grammar declaratively; only
|
|
the input atoms change.
|
|
- **`Command` AST is unchanged.** The parser produces the
|
|
same `Command` enum it does today; downstream code (runtime,
|
|
app, db) is untouched.
|
|
- **Public API of `dsl::parser` is unchanged.**
|
|
`parse_command(&str) → Result<Command, ParseError>` keeps
|
|
its signature. ADR-0021 may extend the public surface (e.g.
|
|
expose `lex()` and `parse_tokens()` for I3/I4) but does so
|
|
additively.
|
|
|
|
## Implementation notes
|
|
|
|
These are sketch-level — implementation will produce more
|
|
detailed work, but they're enough that a session picking
|
|
this up has direction.
|
|
|
|
### Order of operations
|
|
|
|
1. **Lexer module.** New file `src/dsl/lexer.rs`. Token /
|
|
TokenKind / Keyword / Punct types. `lex(input: &str) →
|
|
Vec<Token>` (always succeeds; embeds `Error(_)` tokens
|
|
for shape errors). Unit tests for representative inputs,
|
|
span correctness, error-token cases, multi-byte UTF-8.
|
|
2. **Token-aware combinator helpers.** Probably in
|
|
`src/dsl/parser.rs` next to the existing combinators.
|
|
`kw(Keyword)`, `punct(Punct)`, `ident()`, `number()`,
|
|
`string_lit()`, `flag(&str)`, `eof()`. Each parses a
|
|
single token by kind and returns its payload (or `()`).
|
|
3. **Rewrite `command_parser()` and its sub-parsers.** One
|
|
sub-parser at a time; run the existing tests after each.
|
|
Aim for green-after-every-step rather than a big-bang
|
|
port.
|
|
4. **The `replay` source-slice special case** (§6).
|
|
5. **Strip `humanise()` machinery** that's no longer needed.
|
|
The render path in `into_parse_error()` simplifies.
|
|
ADR-0021 owns the new render shape; until it lands, keep
|
|
a minimal humaniser that produces the existing wording.
|
|
6. **Public API check.** `parse_command` signature unchanged.
|
|
Add `lex(&str) → Vec<Token>` and `parse_tokens(&[Token],
|
|
&str) → Result<Command, ParseError>` as `pub` so I3/I4
|
|
can hook in later (their ADRs will use these).
|
|
|
|
### Things that interact subtly
|
|
|
|
- **The `try_map` content errors** survive unchanged. They
|
|
fire on tokens (e.g. `try_map` on a parsed `Identifier(s)`
|
|
that's expected to be a type name) but their messages and
|
|
classification are identical to today. The catalog
|
|
vocabulary they use ("unknown type", "expected one of",
|
|
"mutually exclusive", "specified twice") stays.
|
|
- **The `1:n` cardinality** lexes as `Number("1")`,
|
|
`Punct(Colon)`, `Identifier("n"|"N")`. The parser composes
|
|
these into the relationship-cardinality assertion as
|
|
today; no special token kind for it.
|
|
- **Negative number literals.** Lexer emits `Punct(Minus)`?
|
|
No — there is no Minus in the punct set above because the
|
|
current grammar has no need for a unary `-` outside number
|
|
literals. Decision: the lexer recognises `-` as part of a
|
|
number literal when followed by an ASCII digit, producing a
|
|
single `Number("-5")` token. A bare `-` not followed by a
|
|
digit is a `Punct(Minus)` only if a `Minus` variant is
|
|
added — for now, treat as `Error(UnknownChar)`. This
|
|
matches the current grammar (which only accepts `-` as a
|
|
number sign).
|
|
- **The `--` flag prefix vs. a future `--` line comment.**
|
|
Today `--all-rows` etc. are flags. If a future ADR
|
|
introduces SQL-style `--` line comments in advanced mode
|
|
only, the lexer may need a mode parameter. ADR-0020 doesn't
|
|
pre-empt that.
|
|
- **The hard-coded `"running: "` prefix in `app.rs`** for
|
|
caret padding: unchanged. The parser still reports a
|
|
byte-position; the caret math is the same.
|