ADR-0020 + ADR-0021: tokenization layer and parse-error pedagogy (H1a)

ADR-0020 amends ADR-0001 with a two-phase parse: a lexer
producing a span-tagged token stream, then chumsky over
&[Token]. Single source of truth for keywords and punct via
a define_keywords!/define_punct! macro pattern. Parser
contract committed for I3 (queryable expected-token-set)
and I4 (lexer always succeeds, Error tokens for invalid
input). Includes an honest history note: the no-lexer shape
in dsl/parser.rs arose incrementally without ADR-level
deliberation against the known H1a/I3/I4 requirements; this
ADR corrects that.

ADR-0021 builds on ADR-0020 to close the H1a gap: a
per-command UsageEntry registry keyed off entry-keyword,
with parse errors rendered as caret + structural error +
matching usage template(s). Multi-entry families (add,
drop, show) render together. New catalog sections under
parse.usage.* (per-command grammar) and parse.token.*
(single-token vocabulary). Zero-prefix case ("frobulate
Customers") falls back to an "available commands:" framing.
Anchor-phrase compliance preserved.
This commit is contained in:
claude@clouddev1
2026-05-10 08:43:20 +00:00
parent 47601f7c85
commit 857ee753f2
3 changed files with 1092 additions and 0 deletions
@@ -0,0 +1,638 @@
# ADR-0020: Tokenization layer for the DSL parser
## Status
Accepted.
Amends ADR-0001 (language and TUI framework) by adding a
tokenization layer between the source string and the chumsky
grammar. The chumsky choice itself is unchanged; what changes
is the input to chumsky — `&[Token]` instead of `&str`.
Foundation for ADR-0021 (parser-as-source-of-truth pedagogy /
H1a). Also the load-bearing piece for I3 (tab completion) and
I4 (syntax highlighting) once those land.
## Context
### What the user sees today
Typing `create` at the prompt produces:
```
parse error: after `create`, expected `table`
```
Typing `frobulate Customers` produces:
```
parse error: expected `create`, found `frobulate`
```
Both are technically true and both are bad. The first hides
that `create` is the entry to a single command (`create
table …`) that the user almost certainly does not yet know
the shape of. The second points at one of ten possible
command-starting keywords and silently picks one. A user
hitting either error gets a sharper "what comes next?" answer
from a 1980s `lex`/`yacc` toolchain than from this app.
### Two technical roots
The bad UX traces to two parser-level decisions, both
correctable but not without architectural change:
1. **`keyword_ci` emits `Rich::custom` errors instead of
structural ones.** Chumsky's `choice` combinator merges the
`expected` sets of its alternatives when they all fail —
that is exactly what produces "expected `data` or `table`"
instead of just "expected `table`". Custom errors don't
participate in that merge: only one wins, deterministically
the first. So our top-level `choice((create_table, drop_*,
add_*, …))` collapses to whichever branch reported its
custom error first, throwing the others away.
2. **The parser has no concept of a "command" beyond its
chumsky combinator graph.** There is no place to attach a
per-command grammar template (the "what does `create
table` actually want?" answer). Adding one is possible but
awkward without an explicit handle on "the entry token
for this command".
ADR-0021 addresses (2). This ADR addresses (1) by removing
the underlying cause: `keyword_ci`'s `try_map → Rich::custom`
shape only exists because the parser operates on raw
characters and has to hand-write keyword recognition. Over a
token stream, a keyword is just a token, and `just(...)` over
tokens aggregates naturally.
### Bespoke machinery in `parser.rs`
`parser.rs` carries ~180 lines of error-handling helpers
(`humanise()`, `consumed_context()`, `oxford_or()`,
`describe_pattern()`, `describe_char()`, `format_expected_found()`,
`first_custom_message()`, the `prefer-custom-over-structural`
selection in `into_parse_error()`). Most of it exists because
chumsky-over-`&str` produces character-level error patterns
that need humanising before a user can read them
(`RichPattern::Token('e')` → "`e`" is not what the user
needed to know).
Over a token stream, the patterns are token kinds — `Keyword(Create)`,
`Identifier`, `Punct(Colon)` — which render directly via the
i18n catalog without per-character translation. Roughly half
of that machinery dissolves.
### Honest history
ADR-0001 chose Rust + chumsky for the DSL. It did not name
"no lexer" as a design choice — the no-lexer shape grew
incrementally inside `parser.rs` without an ADR.
The known requirements at the time included **H1a** (friendly
error layer for parse errors), **I3** (tab completion), and
**I4** (syntax highlighting). All three are easier with a
token stream than without one — H1a needs aggregated expected
sets, I3 needs "what tokens are valid at cursor", I4 needs
token classification on potentially-invalid input. The
lexer-shape question should have been considered against
these requirements when the parser was built. It wasn't.
The user has previously raised this — pointing out that
`lex`/`yacc` already handled this kind of error reporting
better, and asking why we weren't getting comparable
behaviour. That observation was correct on the merits and
should have triggered an ADR amendment then. ADR-0020 is that
amendment, late but in the right place.
This is not a "we now realise" framing. It is a "we should
have decided this earlier and didn't" framing. The cost of
acting on it now is real but bounded; the cost of acting on
it after the query DSL or constraint-management commands have
landed would be substantially larger. We have no users; the
agility cost of refactoring is at its lowest.
### What is and isn't this ADR
This ADR is purely about the **input layer to the DSL
parser**. It does not specify per-command usage rendering,
catalog keys for parse-error wording, or the renderer
composition of caret + structural error + usage hint. Those
are ADR-0021's scope. The two ADRs share an implementation
session; the dependency runs one way (ADR-0021 needs
ADR-0020's tokens).
This ADR also does not touch SQL parsing in advanced mode.
That path uses `sqlparser-rs` (per ADR-0001) and has its own
tokenization built in. A future ADR for the SQL subset (Q4)
will decide whether to share the DSL lexer's token model or
keep `sqlparser-rs`'s token surface as-is — that's not
prejudged here.
## Decision
### 1. Two-phase parse: lex → parse
```rust
pub fn parse_command(input: &str) -> Result<Command, ParseError> {
let tokens = lex(input)?; // Stage 1
parse_tokens(&tokens, input) // Stage 2
}
```
Stage 1 (`lex`) produces a span-tagged token stream. Stage 2
(`parse_tokens`) is a chumsky parser whose input type is
`&[Token]` instead of `&str`. The two stages are separately
testable; the lexer has its own test surface that doesn't
exercise the parser.
`parse_tokens` takes the original `&str` as a second
argument purely so the parser can consume bare path arguments
for the `replay` command directly from source (see §6). All
other parser logic operates over the token slice.
### 2. Token model
```rust
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Token {
pub kind: TokenKind,
pub span: Span, // (start, end) byte offsets
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum TokenKind {
Keyword(Keyword), // case-folded reserved word
Identifier(String), // case-preserving (ADR-0009)
Number(String), // raw text; numeric parse is a parser concern
StringLiteral(String), // unquoted, escapes processed
Punct(Punct), // : ( ) , = . - one char each
Flag(String), // --name (e.g. "--all-rows")
Error(LexError), // unrecognised char / unterminated string
// — a token kind, not a Result variant
}
// Keyword is a closed set declared via a macro that is the
// single source of truth (§2a). Punct follows the same pattern.
```
The `Keyword` and `Punct` enums are not hand-declared. They
come out of macros described in §2a — one declaration site
that generates the enum, the lex-side string→variant
mapping, the variant→literal rendering, and the catalog-key
derivation in one place.
Notes on the model:
- **Keywords are an enum, not a string.** This makes
`kw(Keyword::Create)` a single-token chumsky match with
exact identity — fastest path, cleanest error patterns,
no string allocation in the hot path.
- **Type names stay as identifiers.** `int`, `text`,
`serial`, `varchar` all lex as `Identifier(_)`. The
parser-level `type_keyword()` continues to call
`Type::from_str` which produces the existing "unknown type
'varchar' (expected one of: text, int, real, …)"
message. That custom-error path is correct and stays. The
closed-set Keyword enum is for the *grammar's* reserved
words — words whose presence determines which command is
being parsed. Type names are *content*.
- **Number is raw text, not parsed.** `Value::Number(String)`
per ADR-0014 stays string-backed; the lexer doesn't try to
validate or convert. `1`, `-3.14`, `1e10`, `1abc` all
produce candidates for the parser to decide on. (The
current parser rejects `1abc` already; that doesn't change.)
- **`StringLiteral` is post-escape.** The lexer processes
`''``'` per the existing string syntax. The original
span covers the surrounding quotes; the payload is the
unescaped content. (Same convention as the current
`string_literal()` parser.)
- **`Flag(String)` is `--name` exactly.** No further parsing.
The parser matches `Flag(s)` and checks `s == "all-rows"`
etc. against a small set.
- **`Error(_)` is a token kind, not a `Result` variant.**
Lex always succeeds in producing `Vec<Token>` — even on
unterminated strings or unrecognised characters, a
diagnostic Error token is emitted in place. Rationale: I4
(syntax highlighting) needs to colour partial / invalid
input; treating lex errors as data instead of control flow
lets the highlighter walk a token stream uniformly. The
parser sees `Error(_)` tokens and raises a structural
error that names the underlying cause via the catalog.
### 2a. Single source of truth for keywords and punctuation
The `Keyword` enum, the lexer's reserved-word table, the
variant→literal rendering, and the catalog-key derivation
all come from one declaration. A `define_keywords!`
`macro_rules!` invocation in `src/dsl/keyword.rs`:
```rust
macro_rules! define_keywords {
( $( $variant:ident => $literal:literal ),+ $(,)? ) => {
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum Keyword { $( $variant ),+ }
impl Keyword {
pub const ALL: &'static [(Keyword, &'static str)] = &[
$( (Keyword::$variant, $literal) ),+
];
/// Lex-side mapping. Case-insensitive per ADR-0009.
pub fn from_word(s: &str) -> Option<Keyword> {
Self::ALL.iter()
.find(|(_, lit)| s.eq_ignore_ascii_case(lit))
.map(|(kw, _)| *kw)
}
/// Canonical lowercase literal for this variant.
pub fn as_str(self) -> &'static str {
Self::ALL.iter()
.find(|(kw, _)| *kw == self)
.map(|(_, lit)| *lit)
.unwrap()
}
/// `parse.token.keyword.<lit>` — the catalog key
/// the renderer looks up for the expected-set
/// vocabulary (ADR-0021 §4).
pub fn catalog_token_key(self) -> String {
format!("parse.token.keyword.{}", self.as_str())
}
}
};
}
define_keywords! {
Create => "create",
Drop => "drop",
Add => "add",
// ... one line per keyword ...
}
```
`Punct` follows the same pattern via `define_punct!`
Colon, OpenParen, CloseParen, Comma, Equals, Dot, each
generated from one line of the invocation.
Adding a new keyword is **one line** in the
`define_keywords!` invocation, plus one line in the
en-US YAML under `parse.token.keyword.<lit>` (the catalog
validator catches a missing entry at test time per ADR-0021
§7). Everything else — every parser combinator that
references the keyword, every usage-registry entry, every
catalog-key lookup — is a *use* of the source of truth, not
a duplicate, and is compile-time-checked by the type system
(typos in `kw(Keyword::Creaate)` don't compile).
### 3. Span carriage
Every `Token` carries `(start, end)` byte offsets in the
original source string. The lexer is byte-exact for
multi-byte UTF-8 sequences (ASCII-only is the realistic
input but the lexer does not panic on Unicode in identifiers
or string literals).
`ParseError::Invalid` continues to carry `position: usize`,
which is the `start` offset of the failing token (or the
end-of-input position when the failure is at EOF). Caret
rendering (in `app.rs`) is unchanged — same byte-position
contract.
### 4. Lexer-vs-parser error split
The lexer is responsible for:
- **Tokenization shape errors**: unterminated string literal,
unrecognised character. Both surface as `TokenKind::Error(_)`
tokens with span coverage of the offending region.
- **Nothing else.** Specifically: type names, command
keywords, value validity, flag names, numeric overflow,
identifier collisions are all parser- or higher-layer
concerns.
The parser is responsible for:
- **Grammar-shape errors**: missing tokens, wrong tokens,
unexpected tokens, end-of-input mid-command. These come
out of chumsky's structural error machinery and aggregate
across `choice` naturally.
- **Content errors via `try_map`**: unknown type name
("unknown type 'varchar'"), mutually-exclusive flags
("`--force-conversion` and `--dont-convert` are mutually
exclusive"), "with pk needs at least one column",
referential-clause repeated. These keep their hand-written
messages — the `Rich::custom` path is correct for content,
it was only wrong for keyword-shape.
The render layer (humanise) handles both uniformly via the
catalog (ADR-0021).
### 5. The chumsky-over-tokens combinator surface
Chumsky 0.13 parses arbitrary input types, not just `&str`.
The combinator surface for `Parser<'a, &'a [Token], …>` is
the same as for `&str`: `just`, `choice`, `then`,
`then_ignore`, `ignore_then`, `or_not`, `repeated`,
`separated_by`, `try_map`, `labelled`. The only differences:
- `just(Token { kind: Keyword::Create, .. })` doesn't work
literally because spans differ. We define helpers
`kw(Keyword::Create)`, `punct(Punct::Colon)`, `ident()`,
`number()`, `string_lit()`, `flag(name)` that match by
kind and ignore span on the input side, returning either
`()` or the carried payload.
- `padded()` (which strips whitespace) is no longer needed —
the lexer skips whitespace already. This simplifies a lot
of combinator chains (`just(':').padded()``punct(Punct::Colon)`).
- `text::keyword(...)`, `any().filter(...)`, etc. — all the
character-level helpers — are gone from the parser. They
belong to the lexer if anywhere.
The parser keeps producing `Rich<Token>` errors. The
`expected` set members are `RichPattern<Token>`; the
catalog's `parse.token.*` keys (ADR-0021 §4) translate them.
### 6. The `replay` path-argument wart
The current grammar admits bare paths: `replay history.log`,
`replay /tmp/seed.commands`. Bare paths contain `.` and `/`,
which are punctuation tokens. A naive token-based parser
would see `Keyword(Replay) Identifier("history") Punct(Dot)
Identifier("log")` and have to reassemble — annoyingly
context-dependent.
Decision: keep the existing bare-path UX. The parser
special-cases the `replay` command at one point: after
matching `Keyword(Replay)`, it consumes its argument
**directly from the original source string** by reading from
the next token's start byte to end-of-input (or to the next
unambiguous terminator, of which today there is none in the
DSL). The lexer's job for the path argument is essentially
to identify *where the path starts*; the parser does the
rest by source-slicing.
The quoted-path form (`replay 'my project/seed.commands'`)
still goes through the lexer's normal `StringLiteral` path
and is matched as `Keyword(Replay) StringLiteral(s)`.
This special-casing is documented inline in the parser. It
costs ~10 lines of code and avoids a breaking change to the
DSL surface. No other DSL command takes a free-form
filesystem-path argument; if a future command does, it
either uses the same special-case or accepts only quoted
paths.
### 7. Whitespace
The lexer skips ASCII whitespace (` \t\r\n`) between tokens.
This honours ADR-0009's "whitespace is liberal" rule
unchanged — the user can put any amount of whitespace
between any two tokens, and the parser doesn't see it.
The lexer does **not** track whitespace as tokens. I4 (syntax
highlighting) recovers whitespace as the gaps between token
spans, computed at render time.
### 8. Comments
The DSL has no comment syntax today and this ADR doesn't
introduce one. If a comment syntax is added later (likely
`--` line comments to match SQL, but that conflicts with
flag prefixes — this is a real design decision a future ADR
will need to settle), the lexer is the right layer to skip
them. Out of scope here.
### 9. I3 (tab completion) hook
This ADR commits to making the parser's expected-token-set
queryable at any point in the input. Concretely:
- `parse_tokens(&[Token], &str)` returns
`(Result<Command, ParseError>, ParseDiagnostics)` where
`ParseDiagnostics` carries the raw chumsky output —
including the merged `expected` set at the failure point.
- Truncating the token stream at the cursor and re-parsing
yields the expected-token-set at the cursor position. I3
uses this for completion.
- The lexer is independently useful to I3: tokenizing the
in-progress input lets the completion logic see what
*partial* token the cursor is in (mid-keyword, mid-identifier,
mid-string).
The full I3 surface (cursor positioning, completion menu,
disambiguation, completion of identifiers from schema) is
out of scope here. ADR-0020 commits only to the parser
contract.
### 10. I4 (syntax highlighting) hook
The token kinds are the natural classification for syntax
highlighting:
- `Keyword(_)` → keyword colour.
- `Identifier(_)` → identifier colour.
- `Number(_)`, `StringLiteral(_)` → literal colours
(probably distinct from each other).
- `Punct(_)` → punctuation colour (or no special colouring).
- `Flag(_)` → flag colour.
- `Error(_)` → error highlight (red underline / squiggle).
The lexer succeeds on partial / invalid input (§2). I4 does
not need a successful parse to render highlighting — only
successful tokenization. This is the load-bearing property
for "live" highlighting as the user types.
The colour scheme, theme integration, render layer, and
performance budget are out of scope here. ADR-0020 commits
only to "the lexer always produces a token stream over which
highlighting can iterate".
### 11. Recovery-based partial AST — explicitly deferred
Chumsky supports parser combinators with recovery, producing
partial ASTs alongside errors. Useful for I3 ("the user has
typed `create table Foo with pk` and now I want to know
what the partial AST is so I can suggest column-spec
completions") but the design space is large (which recovery
strategies, what the partial AST shape is, how downstream
consumers handle it).
This ADR keeps the parser non-recovering: a failed parse
returns `Err(ParseError)` and no partial AST. I3's ADR will
decide whether recovery is needed; if so, the change is
local to `parse_tokens` and doesn't ripple.
### 12. Migration of existing parser tests
The 50+ unit tests in `dsl/parser.rs::tests` are the spec of
the current grammar. The migration is mechanical:
- Tests that call `parse_command(input)` keep doing so —
`parse_command` is the public boundary and its signature
doesn't change.
- Tests that assert on `ParseError::Invalid { message, .. }`
may need wording updates if the new error layer rewords
them, but anchor substrings ("unknown type", "specified
twice", "mutually exclusive", "varchar", "expected one
of") stay intact (those come from `try_map` content errors
that survive unchanged).
- Two existing tests assert on chumsky's structural wording:
`structural_error_for_show_data_without_arg`
("after `show data`", "expected identifier", "found end of
input") and `structural_error_for_change_column_with_swapped_args`
("after `change column Rich`", "expected `:`"). The new
rendering preserves the same shape — `after <prefix>,
expected <set>, found <token>` — so these tests should
port with at most minor adjustments. ADR-0021 specifies
the rendering precisely.
A new test surface for the lexer itself: lex output for
representative inputs, lex error tokens for unterminated
strings + unknown chars, span correctness.
## Out of scope
Deliberately deferred to keep this ADR focused:
1. **Per-command usage templates.** ADR-0021.
2. **`parse.usage.*` and `parse.token.*` catalog keys.**
ADR-0021.
3. **Error renderer composition** (caret + structural error
+ usage hint). ADR-0021.
4. **I3 completion UI** + cursor logic + identifier
completion from schema. Future I3 ADR.
5. **I4 colour scheme** + theme integration. Future I4 ADR.
6. **Recovery-based partial AST.** §11 — re-opens with I3.
7. **Comment syntax.** §8.
8. **Sharing the lexer with `sqlparser-rs`** (advanced-mode
SQL). The two parsers stay separate today; a future SQL
subset ADR may revisit.
## Consequences
### Positive
- **Aggregation works.** Top-level `choice((create_*, drop_*,
add_*, …))` failures emit "expected `create`, `drop`,
`add`, …" structurally. The user sees the family of
available commands rather than one branch's report.
- **The bespoke `humanise()` machinery shrinks.** Roughly
half the helpers in `parser.rs` (~80-100 lines) are no
longer needed because token-level error patterns render
directly via the catalog. Less code is less code to
maintain.
- **I3 / I4 inherit a clean foundation.** Their ADRs can
focus on UX/UI rather than re-litigating parser shape.
- **Lex errors and parse errors share a render path.**
Unterminated strings and missing keywords both surface
through the same catalog-driven layer.
- **The token stream is a natural API for future tooling**:
schema-aware highlighting, structural editing, command
history rendering with token-level colour.
### Costs
- **One-time migration of `dsl/parser.rs`.** Every combinator
rewrites against `&[Token]`. Estimated 600-900 lines of
parser.rs change including the lexer module; the lexer
itself is probably 200-300 lines plus 150-250 lines of
unit tests.
- **Catalog grows by one entry per keyword and per punct.**
The macro-driven `Keyword` / `Punct` enums (§2a) collapse
the enum + lex table + as-str + catalog-key derivation into
one declaration site, so adding a keyword is one line of
Rust + one line of YAML. The catalog validator enforces
completeness at test time (ADR-0021 §7). Compared to today
— where adding a keyword means a new `keyword_ci("...")`
call site (or several, if used in multiple commands) — the
per-keyword cost is comparable; what shifts is *where* the
edit happens. A unit test additionally asserts every
`Keyword::ALL` variant is referenced by some parser
combinator, catching dead enum entries.
- **Span model needs care for UTF-8.** Byte offsets are the
contract; the lexer must split tokens at UTF-8 boundaries.
Identifier/string tests should include at least one
multi-byte case.
- **`replay` special-case** in the parser (§6). One command,
~10 lines, documented inline. Acceptable; not a precedent
for other commands.
- **Tests churn.** Two structural-error tests need new wording
(mechanical port). Existing content-error tests stay as-is.
### Neutral
- **chumsky stays.** No framework change, no new dependency.
The parser still expresses the grammar declaratively; only
the input atoms change.
- **`Command` AST is unchanged.** The parser produces the
same `Command` enum it does today; downstream code (runtime,
app, db) is untouched.
- **Public API of `dsl::parser` is unchanged.**
`parse_command(&str) → Result<Command, ParseError>` keeps
its signature. ADR-0021 may extend the public surface (e.g.
expose `lex()` and `parse_tokens()` for I3/I4) but does so
additively.
## Implementation notes
These are sketch-level — implementation will produce more
detailed work, but they're enough that a session picking
this up has direction.
### Order of operations
1. **Lexer module.** New file `src/dsl/lexer.rs`. Token /
TokenKind / Keyword / Punct types. `lex(input: &str) →
Vec<Token>` (always succeeds; embeds `Error(_)` tokens
for shape errors). Unit tests for representative inputs,
span correctness, error-token cases, multi-byte UTF-8.
2. **Token-aware combinator helpers.** Probably in
`src/dsl/parser.rs` next to the existing combinators.
`kw(Keyword)`, `punct(Punct)`, `ident()`, `number()`,
`string_lit()`, `flag(&str)`, `eof()`. Each parses a
single token by kind and returns its payload (or `()`).
3. **Rewrite `command_parser()` and its sub-parsers.** One
sub-parser at a time; run the existing tests after each.
Aim for green-after-every-step rather than a big-bang
port.
4. **The `replay` source-slice special case** (§6).
5. **Strip `humanise()` machinery** that's no longer needed.
The render path in `into_parse_error()` simplifies.
ADR-0021 owns the new render shape; until it lands, keep
a minimal humaniser that produces the existing wording.
6. **Public API check.** `parse_command` signature unchanged.
Add `lex(&str) → Vec<Token>` and `parse_tokens(&[Token],
&str) → Result<Command, ParseError>` as `pub` so I3/I4
can hook in later (their ADRs will use these).
### Things that interact subtly
- **The `try_map` content errors** survive unchanged. They
fire on tokens (e.g. `try_map` on a parsed `Identifier(s)`
that's expected to be a type name) but their messages and
classification are identical to today. The catalog
vocabulary they use ("unknown type", "expected one of",
"mutually exclusive", "specified twice") stays.
- **The `1:n` cardinality** lexes as `Number("1")`,
`Punct(Colon)`, `Identifier("n"|"N")`. The parser composes
these into the relationship-cardinality assertion as
today; no special token kind for it.
- **Negative number literals.** Lexer emits `Punct(Minus)`?
No — there is no Minus in the punct set above because the
current grammar has no need for a unary `-` outside number
literals. Decision: the lexer recognises `-` as part of a
number literal when followed by an ASCII digit, producing a
single `Number("-5")` token. A bare `-` not followed by a
digit is a `Punct(Minus)` only if a `Minus` variant is
added — for now, treat as `Error(UnknownChar)`. This
matches the current grammar (which only accepts `-` as a
number sign).
- **The `--` flag prefix vs. a future `--` line comment.**
Today `--all-rows` etc. are flags. If a future ADR
introduces SQL-style `--` line comments in advanced mode
only, the lexer may need a mode parameter. ADR-0020 doesn't
pre-empt that.
- **The hard-coded `"running: "` prefix in `app.rs`** for
caret padding: unchanged. The parser still reports a
byte-position; the caret math is the same.
@@ -0,0 +1,452 @@
# ADR-0021: Parser-as-source-of-truth for H1a (per-command usage in parse errors)
## Status
Accepted.
Builds on ADR-0020 (tokenization layer). Addresses H1a from
`requirements.md` — the parse-error pedagogy gap that
ADR-0019's friendly-error layer left untouched.
Cross-references ADR-0019 (i18n catalog conventions; H1a's
output goes through the same catalog) and ADR-0009 (DSL
syntax conventions; usage templates render in the project's
documented surface form).
## Context
ADR-0019 dramatically improved engine-error wording.
Parse-error wording is now the visibly-weakest user surface —
the user-reported gap was concrete: typing `create` produces
```
parse error: after `create`, expected `table`
```
The error is *structurally* correct (chumsky has consumed
`create` and is now looking for the next required token) but
*pedagogically* silent. A learner who got this far typed
`create` because they'd been told that's how new tables are
made; what they need next is the shape of the command, not
a single missing-token pointer.
Comparable observations apply across the whole DSL surface:
- `add` → expected `column` or `1` (uninformative; user
needs the shape of `add column …` AND `add 1:n
relationship …`).
- `update Customers` → expected `set` (true; but `update`'s
full grammar with `set …`, `where …`, `--all-rows` is what
the user wants illustrated).
- `frobulate Customers` → expected one of `create`, `drop`,
`add`, `rename`, `change`, `show`, `insert`, `update`,
`delete`, `replay` (true after ADR-0020; the available-commands
list is now informative, but the no-prefix case wants its
own framing — "available commands" rather than "expected").
H1a's job is to close that gap by surfacing the **grammar**
of the command at the point of error, not just the next
token.
### What ADR-0020 supplies
ADR-0020 lands the lexer + parser-over-tokens architecture.
What that buys H1a:
- **Aggregated `expected` sets at the failure point** (top-level
`choice` failures now list every command-starting keyword,
not just one). The user-visible "available commands" list
becomes correct without any work in this ADR.
- **Token-kind error patterns** (`RichPattern<Token>` instead
of `RichPattern<char>`). Each pattern renders via a stable
catalog key — no per-character humanising.
- **A canonical entry-token for each command** (the first
`Keyword(_)` consumed). H1a keys per-command usage
templates off this token.
### What this ADR adds on top
- A registry of per-command **usage templates** (one
declaration per command).
- A renderer that composes the parse error with: caret +
structural error wording + matching usage template(s).
- New catalog keys under `parse.usage.*` (templates) and
`parse.token.*` (single-token rendering for expected-set
joins).
- A "no commands consumed" fallback that renders an
available-commands list under a different prefix
("available commands:" rather than "expected:") for the
zero-prefix case.
## Decision
### 1. Per-command usage template registry
Each command parser is paired with a `UsageEntry`:
```rust
pub struct UsageEntry {
/// First keyword that distinguishes this command. Used
/// as the registry key.
pub entry: Keyword,
/// Catalog key for the grammar template body (under
/// `parse.usage.*`). One key per command.
pub catalog_key: &'static str,
}
```
The registry is a `&'static [UsageEntry]` declared in one
place (`src/dsl/usage.rs`). Lookup: given a consumed entry
keyword, return all entries whose `entry == keyword`. For
`Keyword::Add` the registry returns the `add column` and
`add 1:n relationship` entries; for `Keyword::Drop` it
returns `drop table`, `drop column`, `drop relationship`;
for unique-entry keywords (e.g. `Keyword::Create` today) it
returns one.
The catalog key is what gets translated. Template bodies
live in `src/friendly/strings/en-US.yaml` under
`parse.usage.*`:
```yaml
parse:
usage:
create_table: "create table <Name> with pk [<col>:<type>[, ...]]"
drop_table: "drop table <Name>"
add_column: "add column [to] [table] <Table>: <Name> (<Type>)"
add_relationship: |
add 1:n relationship [as <Name>]
from <Parent>.<col> to <Child>.<col>
[on delete <action>] [on update <action>]
[--create-fk]
rename_column: "rename column [in] [table] <Table>: <Old> to <New>"
change_column: |
change column [in] [table] <Table>: <Name> (<Type>)
[--force-conversion | --dont-convert]
show_data: "show data <Table>"
show_table: "show table <Table>"
insert: "insert into <Table> [(<col>[, ...])] [values] (<value>[, ...])"
update: "update <Table> set <col>=<value>[, ...] (where <col>=<value> | --all-rows)"
delete: "delete from <Table> (where <col>=<value> | --all-rows)"
drop_column: "drop column [from] [table] <Table>: <Name>"
drop_relationship: |
drop relationship <Name>
drop relationship from <Parent>.<col> to <Child>.<col>
replay: "replay <path> | replay '<path with spaces>'"
```
(Wording is illustrative; exact phrasing settled at
implementation time. The bracket convention `[...]` for
optional parts and angle-bracket `<...>` for placeholders
matches ADR-0009's documentation surface.)
### 2. The renderer composes three blocks
A parse error renders as:
```
running: <user input>
^ ← caret (existing, unchanged)
parse error: <structural-or-content message>
usage: <template1>
<template2> ← when multiple entries share the entry keyword
```
Block 1 (the echo + caret) is unchanged from today.
Block 2 is the structural or content error. ADR-0020
guarantees the structural error is now properly aggregated
("expected `data` or `table`" not "expected `table`"). The
content errors (unknown type, mutually-exclusive flags) are
unchanged in voice.
Block 3 (usage:) is new. It is rendered if and only if **at
least one keyword token was consumed** before the parser
failed AND that keyword is a registered entry. If no keyword
was consumed (e.g., `frobulate Customers`, where `frobulate`
is an `Identifier`, not a `Keyword`), Block 3 is replaced
with the no-prefix fallback (§5).
If multiple entries match (e.g., the `add` family), all are
listed under a single `usage:` prefix, one per line.
### 3. Identifying the consumed entry keyword
The parser surfaces, alongside the `ParseError`, the
**deepest successfully-consumed keyword token**. Mechanism:
- `parse_tokens` returns `(Result<Command, ParseError>,
ParseDiagnostics)` where `ParseDiagnostics` carries the
furthest position chumsky reached AND a snapshot of the
consumed prefix.
- The renderer walks the consumed prefix backward to find the
first `Keyword(_)` token. (Almost always the first token,
but a future grammar where a command starts with a
literal — none today — would still resolve correctly.)
This logic lives in `src/dsl/usage.rs::matched_entry()` so
the registry and the lookup sit together.
### 4. `parse.token.*` — single-token catalog vocabulary
Chumsky's expected-set rendering needs a name for each token
kind. Today `humanise()` hand-codes these
(`describe_pattern` returns "`create`", "identifier", etc.).
ADR-0021 moves the vocabulary into the catalog:
```yaml
parse:
token:
# Keywords — one entry per Keyword enum variant.
keyword.create: "`create`"
keyword.table: "`table`"
keyword.with: "`with`"
# ... one per Keyword variant ...
# Punctuation.
punct.colon: "`:`"
punct.open_paren: "`(`"
punct.close_paren: "`)`"
punct.comma: "`,`"
punct.equals: "`=`"
punct.dot: "`.`"
# Token-class labels.
identifier: "identifier"
number: "number"
string_literal: "string literal"
flag: "flag (--name)"
end_of_input: "end of input"
# Lexer-error tokens.
error.unterminated_string: "unterminated string starting at column {column}"
error.unknown_char: "unrecognised character {found}"
```
Joining ("`a`, `b`, or `c`") stays in code (`oxford_or` from
the current humanise machinery, lifted intact). Wording of
each token is in the catalog.
`parse.error` (existing wrapper key) stays. Its `{detail}`
placeholder is filled by:
```
{consumed_prefix} expected {oxford_or(expected)}, found {found_token}
```
— each piece sourced from the catalog, joined in code.
`parse.caret` (existing) and `parse.empty` (existing)
unchanged.
### 5. No-prefix fallback: "available commands"
When the parser fails with **no keyword consumed**, the
"expected" set lists every top-level command-starting
keyword. That's correct but the framing should be
"available commands" rather than "expected".
Renderer detects this case (consumed-keyword count == 0) and
substitutes Block 3 with:
```
available commands: create, drop, add, rename, change,
show, insert, update, delete, replay
```
via a new catalog key:
```yaml
parse:
available_commands: "available commands: {commands}"
```
The list is the alphabetised set of `entry` keywords from
the usage registry, each rendered via its `parse.token.keyword.*`
catalog entry (so the strings are catalog-sourced, not
hard-coded).
This case only fires when the user typed something the
parser couldn't classify as any known command keyword — the
"frobulate Customers" case. It's both rarer and more useful
than the with-prefix case: a user this lost benefits more
from the full menu than from a missing-token pointer.
### 6. Anchor-phrase compliance (ADR-0019 §10)
ADR-0019's anchor-phrase list contains nine substrings the
catalog commits to keeping stable. None are parse-error-specific,
so this ADR doesn't add to the list. The existing parser
test that asserts on "unknown type" and "expected one of"
substrings stays — those come from `Type::from_str`'s custom
error message which ADR-0020 §4 keeps unchanged.
The current structural-error tests assert on substrings like
"after `show data`", "expected identifier", "found end of
input", "after `change column Rich`", "expected `:`". The
new render shape preserves all of these — the rendering
template is `{prefix} expected {set}, found {token}` and
the prefix / set / token come from the catalog with the same
wording. Tests should port unchanged or with at most minor
adjustments.
### 7. Catalog validator covers the new keys
ADR-0019 §8.6's `KEYS_AND_PLACEHOLDERS` validator extends
to cover:
- Every `parse.usage.<command>` key referenced from the
registry exists.
- Every `parse.token.keyword.<variant>` key for every
`Keyword` enum variant exists.
- Every `parse.token.punct.<variant>` key for every `Punct`
variant exists.
- The `parse.token.{identifier, number, string_literal,
flag, end_of_input}` keys exist.
- The `parse.token.error.*` keys exist for every
`LexErrorKind` variant.
- The `parse.available_commands` key exists.
- No format specifiers (already enforced).
- No engine vocabulary (already enforced).
### 8. The `usage:` block respects the verbosity setting?
No. The `messages (short|verbose)` setting (ADR-0019)
governs *engine-error* verbosity (whether to render the
hint block of a `FriendlyError`). Parse errors don't go
through `FriendlyError`; they have their own render path,
and the usage block is always shown. Rationale: a learner
toggling to `messages short` is signalling they recognise
the engine-error patterns and want less explanation around
those — they're not signalling that they want less
parse-help. Parse errors mean the user couldn't even
formulate a runnable command; that's exactly the moment to
maximise pedagogical surface, regardless of the
engine-error verbosity preference.
If experience shows this is wrong, a future amendment can
gate the usage block on a separate setting. Doesn't need
to be designed now.
## Out of scope
1. **Tab completion (I3) and syntax highlighting (I4)**
themselves. ADR-0020 §9-10 commits to the parser
contract; ADR-0021 doesn't extend it.
2. **Schema-aware suggestions** ("did you mean `Customers`?"
when the user typed `Customrs`). Useful but a separate
feature; would land in I3 territory (completion + spell
check share a candidate list).
3. **Suggested fixes** ("change `crete` to `create`"). Same
bucket as schema-aware suggestions.
4. **Multi-error reporting.** Today and after this ADR, the
parser reports the first error and stops. Recovery-based
multi-error parsing is out of scope and re-opens with
I3's ADR (ADR-0020 §11).
5. **Persisting the verbosity setting** (which doesn't
affect parse errors anyway, per §8). ADR-0019 deferred it
to a future settings ADR.
## Consequences
### Positive
- **Per-command usage at point of failure.** A learner who
types `create` sees the full `create table` grammar
instead of "expected `table`". The user-reported gap
closes.
- **Aggregated `available commands` for cold starts.**
`frobulate Customers` now lists the ten command-starting
keywords under a sensible framing.
- **Vocabulary lives in the catalog, not in code.** Renaming
a keyword's user-facing wording is one YAML edit. Adding
a new keyword adds two lines (registry + token-name key);
the validator catches both if forgotten.
- **The render path simplifies.** `humanise()` shrinks to a
small composer over catalog lookups — no per-character
description, no `RichPattern` walking, no
prefer-custom-over-structural switching (the latter
becomes "render the structural error and append the usage
template").
- **Composes with ADR-0019's `FriendlyError`.** Engine
errors and parse errors are rendered through different
paths but both go through the catalog, so vocabulary
drift between them is impossible.
### Costs
- **A second registry to keep in sync** with the parser. The
validator (§7) catches missing usage entries / missing
token keys at test time, but adding a new command means
three steps (parser combinator, usage-registry entry,
catalog YAML edit). Mitigation: a unit test asserts every
command in the parser has a registry entry (catches
forgotten entries; matches the friendly-module pattern).
- **Catalog grows by ~30-40 entries** (one usage template
per command, one keyword name per `Keyword` variant, a
handful of token-class names, a handful of error names).
Each entry is one line of YAML; total catalog grows from
~170 entries to ~210. Within budget.
- **Wording iteration** on the usage templates will probably
happen post-merge. This is normal for pedagogical text
and the catalog makes it cheap.
### Neutral
- **Public parser API is unchanged.** `parse_command(&str)`
signature stable. The new `lex` and `parse_tokens`
functions exposed by ADR-0020 are the I3/I4 hook;
ADR-0021 doesn't add to that surface.
- **`AppEvent` shape unchanged.** Parse errors continue to
flow through `dispatch_dsl`'s existing path (push echo,
push caret, push error). This ADR's render changes are
internal to that function plus the `t!()` calls inside it.
## Implementation notes
### Order of operations (within the joint ADR-0020 + ADR-0021 implementation session)
1. Land ADR-0020 (lexer + parser refactor + minimal
humaniser).
2. Add `src/dsl/usage.rs` with the registry struct, the
static table, and `matched_entry()`.
3. Populate `parse.usage.*` and `parse.token.*` catalog
sections.
4. Extend `friendly::keys::KEYS_AND_PLACEHOLDERS` with the
new keys.
5. Rewrite `dispatch_dsl`'s error-render arm in `app.rs` to
compose the three blocks per §2 (or §5 fallback).
6. Add tests:
- Unit: every registered usage entry resolves through the
catalog. Every `Keyword` variant has a `parse.token.keyword.*`
entry.
- Integration (`tests/parse_error_pedagogy.rs`, new):
`create`, `add`, `update Customers`, `frobulate
Customers`, `create table` (no PK clause), `insert into
T` (no values), each producing the expected
three-block output.
7. Update or port the two existing structural-error tests
in `parser.rs::tests` to the new render shape.
### Things that interact subtly
- **The "deepest consumed keyword" mechanism** (§3) walks
the prefix once per parse failure. Cheap; no perf concern.
But it must not pick up keywords from inside content that
is itself part of a partial AST (e.g. an identifier the
user is typing that happens to be the first letters of a
keyword); since the lexer commits to identifier-vs-keyword
classification before the parser sees tokens, this isn't a
real risk. Documented inline.
- **Multiple usage entries per `add` / `drop`** are rendered
under one `usage:` prefix per §2. This is one of the
pedagogically-best parts of the change: the user gets the
full family rather than guessing which sibling they
wanted.
- **`replay`'s special-case parsing** (ADR-0020 §6) is
invisible to the usage layer. The user typing `replay`
with no path gets the `parse.usage.replay` template.
- **`messages` is an app-level command, not a DSL command**,
so it is not in the parser registry and doesn't appear in
`available commands:`. Same posture as `mode`, `help`,
`quit`. Documented in the registry's prelude.
+2
View File
@@ -25,3 +25,5 @@ This directory contains the project's ADRs, recorded per
- [ADR-0017 — Column type-change compatibility](0017-column-type-change-compatibility.md) - [ADR-0017 — Column type-change compatibility](0017-column-type-change-compatibility.md)
- [ADR-0018 — Auto-fill contracts for `serial` and `shortid` columns](0018-auto-fill-contracts-for-serial-and-shortid.md) - [ADR-0018 — Auto-fill contracts for `serial` and `shortid` columns](0018-auto-fill-contracts-for-serial-and-shortid.md)
- [ADR-0019 — Friendly error layer (H1) and i18n message catalog](0019-friendly-error-layer-and-i18n.md) - [ADR-0019 — Friendly error layer (H1) and i18n message catalog](0019-friendly-error-layer-and-i18n.md)
- [ADR-0020 — Tokenization layer for the DSL parser](0020-tokenization-layer-for-the-dsl-parser.md)
- [ADR-0021 — Parser-as-source-of-truth for H1a (per-command usage in parse errors)](0021-parser-as-source-of-truth-for-h1a.md)