ADR-0020 + ADR-0021: tokenization layer and parse-error pedagogy (H1a)
ADR-0020 amends ADR-0001 with a two-phase parse: a lexer
producing a span-tagged token stream, then chumsky over
&[Token]. Single source of truth for keywords and punct via
a define_keywords!/define_punct! macro pattern. Parser
contract committed for I3 (queryable expected-token-set)
and I4 (lexer always succeeds, Error tokens for invalid
input). Includes an honest history note: the no-lexer shape
in dsl/parser.rs arose incrementally without ADR-level
deliberation against the known H1a/I3/I4 requirements; this
ADR corrects that.
ADR-0021 builds on ADR-0020 to close the H1a gap: a
per-command UsageEntry registry keyed off entry-keyword,
with parse errors rendered as caret + structural error +
matching usage template(s). Multi-entry families (add,
drop, show) render together. New catalog sections under
parse.usage.* (per-command grammar) and parse.token.*
(single-token vocabulary). Zero-prefix case ("frobulate
Customers") falls back to an "available commands:" framing.
Anchor-phrase compliance preserved.
This commit is contained in:
@@ -0,0 +1,638 @@
|
|||||||
|
# ADR-0020: Tokenization layer for the DSL parser
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted.
|
||||||
|
|
||||||
|
Amends ADR-0001 (language and TUI framework) by adding a
|
||||||
|
tokenization layer between the source string and the chumsky
|
||||||
|
grammar. The chumsky choice itself is unchanged; what changes
|
||||||
|
is the input to chumsky — `&[Token]` instead of `&str`.
|
||||||
|
|
||||||
|
Foundation for ADR-0021 (parser-as-source-of-truth pedagogy /
|
||||||
|
H1a). Also the load-bearing piece for I3 (tab completion) and
|
||||||
|
I4 (syntax highlighting) once those land.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
### What the user sees today
|
||||||
|
|
||||||
|
Typing `create` at the prompt produces:
|
||||||
|
|
||||||
|
```
|
||||||
|
parse error: after `create`, expected `table`
|
||||||
|
```
|
||||||
|
|
||||||
|
Typing `frobulate Customers` produces:
|
||||||
|
|
||||||
|
```
|
||||||
|
parse error: expected `create`, found `frobulate`
|
||||||
|
```
|
||||||
|
|
||||||
|
Both are technically true and both are bad. The first hides
|
||||||
|
that `create` is the entry to a single command (`create
|
||||||
|
table …`) that the user almost certainly does not yet know
|
||||||
|
the shape of. The second points at one of ten possible
|
||||||
|
command-starting keywords and silently picks one. A user
|
||||||
|
hitting either error gets a sharper "what comes next?" answer
|
||||||
|
from a 1980s `lex`/`yacc` toolchain than from this app.
|
||||||
|
|
||||||
|
### Two technical roots
|
||||||
|
|
||||||
|
The bad UX traces to two parser-level decisions, both
|
||||||
|
correctable but not without architectural change:
|
||||||
|
|
||||||
|
1. **`keyword_ci` emits `Rich::custom` errors instead of
|
||||||
|
structural ones.** Chumsky's `choice` combinator merges the
|
||||||
|
`expected` sets of its alternatives when they all fail —
|
||||||
|
that is exactly what produces "expected `data` or `table`"
|
||||||
|
instead of just "expected `table`". Custom errors don't
|
||||||
|
participate in that merge: only one wins, deterministically
|
||||||
|
the first. So our top-level `choice((create_table, drop_*,
|
||||||
|
add_*, …))` collapses to whichever branch reported its
|
||||||
|
custom error first, throwing the others away.
|
||||||
|
|
||||||
|
2. **The parser has no concept of a "command" beyond its
|
||||||
|
chumsky combinator graph.** There is no place to attach a
|
||||||
|
per-command grammar template (the "what does `create
|
||||||
|
table` actually want?" answer). Adding one is possible but
|
||||||
|
awkward without an explicit handle on "the entry token
|
||||||
|
for this command".
|
||||||
|
|
||||||
|
ADR-0021 addresses (2). This ADR addresses (1) by removing
|
||||||
|
the underlying cause: `keyword_ci`'s `try_map → Rich::custom`
|
||||||
|
shape only exists because the parser operates on raw
|
||||||
|
characters and has to hand-write keyword recognition. Over a
|
||||||
|
token stream, a keyword is just a token, and `just(...)` over
|
||||||
|
tokens aggregates naturally.
|
||||||
|
|
||||||
|
### Bespoke machinery in `parser.rs`
|
||||||
|
|
||||||
|
`parser.rs` carries ~180 lines of error-handling helpers
|
||||||
|
(`humanise()`, `consumed_context()`, `oxford_or()`,
|
||||||
|
`describe_pattern()`, `describe_char()`, `format_expected_found()`,
|
||||||
|
`first_custom_message()`, the `prefer-custom-over-structural`
|
||||||
|
selection in `into_parse_error()`). Most of it exists because
|
||||||
|
chumsky-over-`&str` produces character-level error patterns
|
||||||
|
that need humanising before a user can read them
|
||||||
|
(`RichPattern::Token('e')` → "`e`" is not what the user
|
||||||
|
needed to know).
|
||||||
|
|
||||||
|
Over a token stream, the patterns are token kinds — `Keyword(Create)`,
|
||||||
|
`Identifier`, `Punct(Colon)` — which render directly via the
|
||||||
|
i18n catalog without per-character translation. Roughly half
|
||||||
|
of that machinery dissolves.
|
||||||
|
|
||||||
|
### Honest history
|
||||||
|
|
||||||
|
ADR-0001 chose Rust + chumsky for the DSL. It did not name
|
||||||
|
"no lexer" as a design choice — the no-lexer shape grew
|
||||||
|
incrementally inside `parser.rs` without an ADR.
|
||||||
|
|
||||||
|
The known requirements at the time included **H1a** (friendly
|
||||||
|
error layer for parse errors), **I3** (tab completion), and
|
||||||
|
**I4** (syntax highlighting). All three are easier with a
|
||||||
|
token stream than without one — H1a needs aggregated expected
|
||||||
|
sets, I3 needs "what tokens are valid at cursor", I4 needs
|
||||||
|
token classification on potentially-invalid input. The
|
||||||
|
lexer-shape question should have been considered against
|
||||||
|
these requirements when the parser was built. It wasn't.
|
||||||
|
|
||||||
|
The user has previously raised this — pointing out that
|
||||||
|
`lex`/`yacc` already handled this kind of error reporting
|
||||||
|
better, and asking why we weren't getting comparable
|
||||||
|
behaviour. That observation was correct on the merits and
|
||||||
|
should have triggered an ADR amendment then. ADR-0020 is that
|
||||||
|
amendment, late but in the right place.
|
||||||
|
|
||||||
|
This is not a "we now realise" framing. It is a "we should
|
||||||
|
have decided this earlier and didn't" framing. The cost of
|
||||||
|
acting on it now is real but bounded; the cost of acting on
|
||||||
|
it after the query DSL or constraint-management commands have
|
||||||
|
landed would be substantially larger. We have no users; the
|
||||||
|
agility cost of refactoring is at its lowest.
|
||||||
|
|
||||||
|
### What is and isn't this ADR
|
||||||
|
|
||||||
|
This ADR is purely about the **input layer to the DSL
|
||||||
|
parser**. It does not specify per-command usage rendering,
|
||||||
|
catalog keys for parse-error wording, or the renderer
|
||||||
|
composition of caret + structural error + usage hint. Those
|
||||||
|
are ADR-0021's scope. The two ADRs share an implementation
|
||||||
|
session; the dependency runs one way (ADR-0021 needs
|
||||||
|
ADR-0020's tokens).
|
||||||
|
|
||||||
|
This ADR also does not touch SQL parsing in advanced mode.
|
||||||
|
That path uses `sqlparser-rs` (per ADR-0001) and has its own
|
||||||
|
tokenization built in. A future ADR for the SQL subset (Q4)
|
||||||
|
will decide whether to share the DSL lexer's token model or
|
||||||
|
keep `sqlparser-rs`'s token surface as-is — that's not
|
||||||
|
prejudged here.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### 1. Two-phase parse: lex → parse
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub fn parse_command(input: &str) -> Result<Command, ParseError> {
|
||||||
|
let tokens = lex(input)?; // Stage 1
|
||||||
|
parse_tokens(&tokens, input) // Stage 2
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Stage 1 (`lex`) produces a span-tagged token stream. Stage 2
|
||||||
|
(`parse_tokens`) is a chumsky parser whose input type is
|
||||||
|
`&[Token]` instead of `&str`. The two stages are separately
|
||||||
|
testable; the lexer has its own test surface that doesn't
|
||||||
|
exercise the parser.
|
||||||
|
|
||||||
|
`parse_tokens` takes the original `&str` as a second
|
||||||
|
argument purely so the parser can consume bare path arguments
|
||||||
|
for the `replay` command directly from source (see §6). All
|
||||||
|
other parser logic operates over the token slice.
|
||||||
|
|
||||||
|
### 2. Token model
|
||||||
|
|
||||||
|
```rust
|
||||||
|
#[derive(Debug, Clone, PartialEq, Eq)]
|
||||||
|
pub struct Token {
|
||||||
|
pub kind: TokenKind,
|
||||||
|
pub span: Span, // (start, end) byte offsets
|
||||||
|
}
|
||||||
|
|
||||||
|
#[derive(Debug, Clone, PartialEq, Eq)]
|
||||||
|
pub enum TokenKind {
|
||||||
|
Keyword(Keyword), // case-folded reserved word
|
||||||
|
Identifier(String), // case-preserving (ADR-0009)
|
||||||
|
Number(String), // raw text; numeric parse is a parser concern
|
||||||
|
StringLiteral(String), // unquoted, escapes processed
|
||||||
|
Punct(Punct), // : ( ) , = . - one char each
|
||||||
|
Flag(String), // --name (e.g. "--all-rows")
|
||||||
|
Error(LexError), // unrecognised char / unterminated string
|
||||||
|
// — a token kind, not a Result variant
|
||||||
|
}
|
||||||
|
|
||||||
|
// Keyword is a closed set declared via a macro that is the
|
||||||
|
// single source of truth (§2a). Punct follows the same pattern.
|
||||||
|
```
|
||||||
|
|
||||||
|
The `Keyword` and `Punct` enums are not hand-declared. They
|
||||||
|
come out of macros described in §2a — one declaration site
|
||||||
|
that generates the enum, the lex-side string→variant
|
||||||
|
mapping, the variant→literal rendering, and the catalog-key
|
||||||
|
derivation in one place.
|
||||||
|
|
||||||
|
Notes on the model:
|
||||||
|
|
||||||
|
- **Keywords are an enum, not a string.** This makes
|
||||||
|
`kw(Keyword::Create)` a single-token chumsky match with
|
||||||
|
exact identity — fastest path, cleanest error patterns,
|
||||||
|
no string allocation in the hot path.
|
||||||
|
- **Type names stay as identifiers.** `int`, `text`,
|
||||||
|
`serial`, `varchar` all lex as `Identifier(_)`. The
|
||||||
|
parser-level `type_keyword()` continues to call
|
||||||
|
`Type::from_str` which produces the existing "unknown type
|
||||||
|
'varchar' (expected one of: text, int, real, …)"
|
||||||
|
message. That custom-error path is correct and stays. The
|
||||||
|
closed-set Keyword enum is for the *grammar's* reserved
|
||||||
|
words — words whose presence determines which command is
|
||||||
|
being parsed. Type names are *content*.
|
||||||
|
- **Number is raw text, not parsed.** `Value::Number(String)`
|
||||||
|
per ADR-0014 stays string-backed; the lexer doesn't try to
|
||||||
|
validate or convert. `1`, `-3.14`, `1e10`, `1abc` all
|
||||||
|
produce candidates for the parser to decide on. (The
|
||||||
|
current parser rejects `1abc` already; that doesn't change.)
|
||||||
|
- **`StringLiteral` is post-escape.** The lexer processes
|
||||||
|
`''` → `'` per the existing string syntax. The original
|
||||||
|
span covers the surrounding quotes; the payload is the
|
||||||
|
unescaped content. (Same convention as the current
|
||||||
|
`string_literal()` parser.)
|
||||||
|
- **`Flag(String)` is `--name` exactly.** No further parsing.
|
||||||
|
The parser matches `Flag(s)` and checks `s == "all-rows"`
|
||||||
|
etc. against a small set.
|
||||||
|
- **`Error(_)` is a token kind, not a `Result` variant.**
|
||||||
|
Lex always succeeds in producing `Vec<Token>` — even on
|
||||||
|
unterminated strings or unrecognised characters, a
|
||||||
|
diagnostic Error token is emitted in place. Rationale: I4
|
||||||
|
(syntax highlighting) needs to colour partial / invalid
|
||||||
|
input; treating lex errors as data instead of control flow
|
||||||
|
lets the highlighter walk a token stream uniformly. The
|
||||||
|
parser sees `Error(_)` tokens and raises a structural
|
||||||
|
error that names the underlying cause via the catalog.
|
||||||
|
|
||||||
|
### 2a. Single source of truth for keywords and punctuation
|
||||||
|
|
||||||
|
The `Keyword` enum, the lexer's reserved-word table, the
|
||||||
|
variant→literal rendering, and the catalog-key derivation
|
||||||
|
all come from one declaration. A `define_keywords!`
|
||||||
|
`macro_rules!` invocation in `src/dsl/keyword.rs`:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
macro_rules! define_keywords {
|
||||||
|
( $( $variant:ident => $literal:literal ),+ $(,)? ) => {
|
||||||
|
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
|
||||||
|
pub enum Keyword { $( $variant ),+ }
|
||||||
|
|
||||||
|
impl Keyword {
|
||||||
|
pub const ALL: &'static [(Keyword, &'static str)] = &[
|
||||||
|
$( (Keyword::$variant, $literal) ),+
|
||||||
|
];
|
||||||
|
|
||||||
|
/// Lex-side mapping. Case-insensitive per ADR-0009.
|
||||||
|
pub fn from_word(s: &str) -> Option<Keyword> {
|
||||||
|
Self::ALL.iter()
|
||||||
|
.find(|(_, lit)| s.eq_ignore_ascii_case(lit))
|
||||||
|
.map(|(kw, _)| *kw)
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Canonical lowercase literal for this variant.
|
||||||
|
pub fn as_str(self) -> &'static str {
|
||||||
|
Self::ALL.iter()
|
||||||
|
.find(|(kw, _)| *kw == self)
|
||||||
|
.map(|(_, lit)| *lit)
|
||||||
|
.unwrap()
|
||||||
|
}
|
||||||
|
|
||||||
|
/// `parse.token.keyword.<lit>` — the catalog key
|
||||||
|
/// the renderer looks up for the expected-set
|
||||||
|
/// vocabulary (ADR-0021 §4).
|
||||||
|
pub fn catalog_token_key(self) -> String {
|
||||||
|
format!("parse.token.keyword.{}", self.as_str())
|
||||||
|
}
|
||||||
|
}
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
define_keywords! {
|
||||||
|
Create => "create",
|
||||||
|
Drop => "drop",
|
||||||
|
Add => "add",
|
||||||
|
// ... one line per keyword ...
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`Punct` follows the same pattern via `define_punct!` —
|
||||||
|
Colon, OpenParen, CloseParen, Comma, Equals, Dot, each
|
||||||
|
generated from one line of the invocation.
|
||||||
|
|
||||||
|
Adding a new keyword is **one line** in the
|
||||||
|
`define_keywords!` invocation, plus one line in the
|
||||||
|
en-US YAML under `parse.token.keyword.<lit>` (the catalog
|
||||||
|
validator catches a missing entry at test time per ADR-0021
|
||||||
|
§7). Everything else — every parser combinator that
|
||||||
|
references the keyword, every usage-registry entry, every
|
||||||
|
catalog-key lookup — is a *use* of the source of truth, not
|
||||||
|
a duplicate, and is compile-time-checked by the type system
|
||||||
|
(typos in `kw(Keyword::Creaate)` don't compile).
|
||||||
|
|
||||||
|
### 3. Span carriage
|
||||||
|
|
||||||
|
Every `Token` carries `(start, end)` byte offsets in the
|
||||||
|
original source string. The lexer is byte-exact for
|
||||||
|
multi-byte UTF-8 sequences (ASCII-only is the realistic
|
||||||
|
input but the lexer does not panic on Unicode in identifiers
|
||||||
|
or string literals).
|
||||||
|
|
||||||
|
`ParseError::Invalid` continues to carry `position: usize`,
|
||||||
|
which is the `start` offset of the failing token (or the
|
||||||
|
end-of-input position when the failure is at EOF). Caret
|
||||||
|
rendering (in `app.rs`) is unchanged — same byte-position
|
||||||
|
contract.
|
||||||
|
|
||||||
|
### 4. Lexer-vs-parser error split
|
||||||
|
|
||||||
|
The lexer is responsible for:
|
||||||
|
|
||||||
|
- **Tokenization shape errors**: unterminated string literal,
|
||||||
|
unrecognised character. Both surface as `TokenKind::Error(_)`
|
||||||
|
tokens with span coverage of the offending region.
|
||||||
|
- **Nothing else.** Specifically: type names, command
|
||||||
|
keywords, value validity, flag names, numeric overflow,
|
||||||
|
identifier collisions are all parser- or higher-layer
|
||||||
|
concerns.
|
||||||
|
|
||||||
|
The parser is responsible for:
|
||||||
|
|
||||||
|
- **Grammar-shape errors**: missing tokens, wrong tokens,
|
||||||
|
unexpected tokens, end-of-input mid-command. These come
|
||||||
|
out of chumsky's structural error machinery and aggregate
|
||||||
|
across `choice` naturally.
|
||||||
|
- **Content errors via `try_map`**: unknown type name
|
||||||
|
("unknown type 'varchar'"), mutually-exclusive flags
|
||||||
|
("`--force-conversion` and `--dont-convert` are mutually
|
||||||
|
exclusive"), "with pk needs at least one column",
|
||||||
|
referential-clause repeated. These keep their hand-written
|
||||||
|
messages — the `Rich::custom` path is correct for content,
|
||||||
|
it was only wrong for keyword-shape.
|
||||||
|
|
||||||
|
The render layer (humanise) handles both uniformly via the
|
||||||
|
catalog (ADR-0021).
|
||||||
|
|
||||||
|
### 5. The chumsky-over-tokens combinator surface
|
||||||
|
|
||||||
|
Chumsky 0.13 parses arbitrary input types, not just `&str`.
|
||||||
|
The combinator surface for `Parser<'a, &'a [Token], …>` is
|
||||||
|
the same as for `&str`: `just`, `choice`, `then`,
|
||||||
|
`then_ignore`, `ignore_then`, `or_not`, `repeated`,
|
||||||
|
`separated_by`, `try_map`, `labelled`. The only differences:
|
||||||
|
|
||||||
|
- `just(Token { kind: Keyword::Create, .. })` doesn't work
|
||||||
|
literally because spans differ. We define helpers
|
||||||
|
`kw(Keyword::Create)`, `punct(Punct::Colon)`, `ident()`,
|
||||||
|
`number()`, `string_lit()`, `flag(name)` that match by
|
||||||
|
kind and ignore span on the input side, returning either
|
||||||
|
`()` or the carried payload.
|
||||||
|
- `padded()` (which strips whitespace) is no longer needed —
|
||||||
|
the lexer skips whitespace already. This simplifies a lot
|
||||||
|
of combinator chains (`just(':').padded()` → `punct(Punct::Colon)`).
|
||||||
|
- `text::keyword(...)`, `any().filter(...)`, etc. — all the
|
||||||
|
character-level helpers — are gone from the parser. They
|
||||||
|
belong to the lexer if anywhere.
|
||||||
|
|
||||||
|
The parser keeps producing `Rich<Token>` errors. The
|
||||||
|
`expected` set members are `RichPattern<Token>`; the
|
||||||
|
catalog's `parse.token.*` keys (ADR-0021 §4) translate them.
|
||||||
|
|
||||||
|
### 6. The `replay` path-argument wart
|
||||||
|
|
||||||
|
The current grammar admits bare paths: `replay history.log`,
|
||||||
|
`replay /tmp/seed.commands`. Bare paths contain `.` and `/`,
|
||||||
|
which are punctuation tokens. A naive token-based parser
|
||||||
|
would see `Keyword(Replay) Identifier("history") Punct(Dot)
|
||||||
|
Identifier("log")` and have to reassemble — annoyingly
|
||||||
|
context-dependent.
|
||||||
|
|
||||||
|
Decision: keep the existing bare-path UX. The parser
|
||||||
|
special-cases the `replay` command at one point: after
|
||||||
|
matching `Keyword(Replay)`, it consumes its argument
|
||||||
|
**directly from the original source string** by reading from
|
||||||
|
the next token's start byte to end-of-input (or to the next
|
||||||
|
unambiguous terminator, of which today there is none in the
|
||||||
|
DSL). The lexer's job for the path argument is essentially
|
||||||
|
to identify *where the path starts*; the parser does the
|
||||||
|
rest by source-slicing.
|
||||||
|
|
||||||
|
The quoted-path form (`replay 'my project/seed.commands'`)
|
||||||
|
still goes through the lexer's normal `StringLiteral` path
|
||||||
|
and is matched as `Keyword(Replay) StringLiteral(s)`.
|
||||||
|
|
||||||
|
This special-casing is documented inline in the parser. It
|
||||||
|
costs ~10 lines of code and avoids a breaking change to the
|
||||||
|
DSL surface. No other DSL command takes a free-form
|
||||||
|
filesystem-path argument; if a future command does, it
|
||||||
|
either uses the same special-case or accepts only quoted
|
||||||
|
paths.
|
||||||
|
|
||||||
|
### 7. Whitespace
|
||||||
|
|
||||||
|
The lexer skips ASCII whitespace (` \t\r\n`) between tokens.
|
||||||
|
This honours ADR-0009's "whitespace is liberal" rule
|
||||||
|
unchanged — the user can put any amount of whitespace
|
||||||
|
between any two tokens, and the parser doesn't see it.
|
||||||
|
|
||||||
|
The lexer does **not** track whitespace as tokens. I4 (syntax
|
||||||
|
highlighting) recovers whitespace as the gaps between token
|
||||||
|
spans, computed at render time.
|
||||||
|
|
||||||
|
### 8. Comments
|
||||||
|
|
||||||
|
The DSL has no comment syntax today and this ADR doesn't
|
||||||
|
introduce one. If a comment syntax is added later (likely
|
||||||
|
`--` line comments to match SQL, but that conflicts with
|
||||||
|
flag prefixes — this is a real design decision a future ADR
|
||||||
|
will need to settle), the lexer is the right layer to skip
|
||||||
|
them. Out of scope here.
|
||||||
|
|
||||||
|
### 9. I3 (tab completion) hook
|
||||||
|
|
||||||
|
This ADR commits to making the parser's expected-token-set
|
||||||
|
queryable at any point in the input. Concretely:
|
||||||
|
|
||||||
|
- `parse_tokens(&[Token], &str)` returns
|
||||||
|
`(Result<Command, ParseError>, ParseDiagnostics)` where
|
||||||
|
`ParseDiagnostics` carries the raw chumsky output —
|
||||||
|
including the merged `expected` set at the failure point.
|
||||||
|
- Truncating the token stream at the cursor and re-parsing
|
||||||
|
yields the expected-token-set at the cursor position. I3
|
||||||
|
uses this for completion.
|
||||||
|
- The lexer is independently useful to I3: tokenizing the
|
||||||
|
in-progress input lets the completion logic see what
|
||||||
|
*partial* token the cursor is in (mid-keyword, mid-identifier,
|
||||||
|
mid-string).
|
||||||
|
|
||||||
|
The full I3 surface (cursor positioning, completion menu,
|
||||||
|
disambiguation, completion of identifiers from schema) is
|
||||||
|
out of scope here. ADR-0020 commits only to the parser
|
||||||
|
contract.
|
||||||
|
|
||||||
|
### 10. I4 (syntax highlighting) hook
|
||||||
|
|
||||||
|
The token kinds are the natural classification for syntax
|
||||||
|
highlighting:
|
||||||
|
|
||||||
|
- `Keyword(_)` → keyword colour.
|
||||||
|
- `Identifier(_)` → identifier colour.
|
||||||
|
- `Number(_)`, `StringLiteral(_)` → literal colours
|
||||||
|
(probably distinct from each other).
|
||||||
|
- `Punct(_)` → punctuation colour (or no special colouring).
|
||||||
|
- `Flag(_)` → flag colour.
|
||||||
|
- `Error(_)` → error highlight (red underline / squiggle).
|
||||||
|
|
||||||
|
The lexer succeeds on partial / invalid input (§2). I4 does
|
||||||
|
not need a successful parse to render highlighting — only
|
||||||
|
successful tokenization. This is the load-bearing property
|
||||||
|
for "live" highlighting as the user types.
|
||||||
|
|
||||||
|
The colour scheme, theme integration, render layer, and
|
||||||
|
performance budget are out of scope here. ADR-0020 commits
|
||||||
|
only to "the lexer always produces a token stream over which
|
||||||
|
highlighting can iterate".
|
||||||
|
|
||||||
|
### 11. Recovery-based partial AST — explicitly deferred
|
||||||
|
|
||||||
|
Chumsky supports parser combinators with recovery, producing
|
||||||
|
partial ASTs alongside errors. Useful for I3 ("the user has
|
||||||
|
typed `create table Foo with pk` and now I want to know
|
||||||
|
what the partial AST is so I can suggest column-spec
|
||||||
|
completions") but the design space is large (which recovery
|
||||||
|
strategies, what the partial AST shape is, how downstream
|
||||||
|
consumers handle it).
|
||||||
|
|
||||||
|
This ADR keeps the parser non-recovering: a failed parse
|
||||||
|
returns `Err(ParseError)` and no partial AST. I3's ADR will
|
||||||
|
decide whether recovery is needed; if so, the change is
|
||||||
|
local to `parse_tokens` and doesn't ripple.
|
||||||
|
|
||||||
|
### 12. Migration of existing parser tests
|
||||||
|
|
||||||
|
The 50+ unit tests in `dsl/parser.rs::tests` are the spec of
|
||||||
|
the current grammar. The migration is mechanical:
|
||||||
|
|
||||||
|
- Tests that call `parse_command(input)` keep doing so —
|
||||||
|
`parse_command` is the public boundary and its signature
|
||||||
|
doesn't change.
|
||||||
|
- Tests that assert on `ParseError::Invalid { message, .. }`
|
||||||
|
may need wording updates if the new error layer rewords
|
||||||
|
them, but anchor substrings ("unknown type", "specified
|
||||||
|
twice", "mutually exclusive", "varchar", "expected one
|
||||||
|
of") stay intact (those come from `try_map` content errors
|
||||||
|
that survive unchanged).
|
||||||
|
- Two existing tests assert on chumsky's structural wording:
|
||||||
|
`structural_error_for_show_data_without_arg`
|
||||||
|
("after `show data`", "expected identifier", "found end of
|
||||||
|
input") and `structural_error_for_change_column_with_swapped_args`
|
||||||
|
("after `change column Rich`", "expected `:`"). The new
|
||||||
|
rendering preserves the same shape — `after <prefix>,
|
||||||
|
expected <set>, found <token>` — so these tests should
|
||||||
|
port with at most minor adjustments. ADR-0021 specifies
|
||||||
|
the rendering precisely.
|
||||||
|
|
||||||
|
A new test surface for the lexer itself: lex output for
|
||||||
|
representative inputs, lex error tokens for unterminated
|
||||||
|
strings + unknown chars, span correctness.
|
||||||
|
|
||||||
|
## Out of scope
|
||||||
|
|
||||||
|
Deliberately deferred to keep this ADR focused:
|
||||||
|
|
||||||
|
1. **Per-command usage templates.** ADR-0021.
|
||||||
|
2. **`parse.usage.*` and `parse.token.*` catalog keys.**
|
||||||
|
ADR-0021.
|
||||||
|
3. **Error renderer composition** (caret + structural error
|
||||||
|
+ usage hint). ADR-0021.
|
||||||
|
4. **I3 completion UI** + cursor logic + identifier
|
||||||
|
completion from schema. Future I3 ADR.
|
||||||
|
5. **I4 colour scheme** + theme integration. Future I4 ADR.
|
||||||
|
6. **Recovery-based partial AST.** §11 — re-opens with I3.
|
||||||
|
7. **Comment syntax.** §8.
|
||||||
|
8. **Sharing the lexer with `sqlparser-rs`** (advanced-mode
|
||||||
|
SQL). The two parsers stay separate today; a future SQL
|
||||||
|
subset ADR may revisit.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- **Aggregation works.** Top-level `choice((create_*, drop_*,
|
||||||
|
add_*, …))` failures emit "expected `create`, `drop`,
|
||||||
|
`add`, …" structurally. The user sees the family of
|
||||||
|
available commands rather than one branch's report.
|
||||||
|
- **The bespoke `humanise()` machinery shrinks.** Roughly
|
||||||
|
half the helpers in `parser.rs` (~80-100 lines) are no
|
||||||
|
longer needed because token-level error patterns render
|
||||||
|
directly via the catalog. Less code is less code to
|
||||||
|
maintain.
|
||||||
|
- **I3 / I4 inherit a clean foundation.** Their ADRs can
|
||||||
|
focus on UX/UI rather than re-litigating parser shape.
|
||||||
|
- **Lex errors and parse errors share a render path.**
|
||||||
|
Unterminated strings and missing keywords both surface
|
||||||
|
through the same catalog-driven layer.
|
||||||
|
- **The token stream is a natural API for future tooling**:
|
||||||
|
schema-aware highlighting, structural editing, command
|
||||||
|
history rendering with token-level colour.
|
||||||
|
|
||||||
|
### Costs
|
||||||
|
|
||||||
|
- **One-time migration of `dsl/parser.rs`.** Every combinator
|
||||||
|
rewrites against `&[Token]`. Estimated 600-900 lines of
|
||||||
|
parser.rs change including the lexer module; the lexer
|
||||||
|
itself is probably 200-300 lines plus 150-250 lines of
|
||||||
|
unit tests.
|
||||||
|
- **Catalog grows by one entry per keyword and per punct.**
|
||||||
|
The macro-driven `Keyword` / `Punct` enums (§2a) collapse
|
||||||
|
the enum + lex table + as-str + catalog-key derivation into
|
||||||
|
one declaration site, so adding a keyword is one line of
|
||||||
|
Rust + one line of YAML. The catalog validator enforces
|
||||||
|
completeness at test time (ADR-0021 §7). Compared to today
|
||||||
|
— where adding a keyword means a new `keyword_ci("...")`
|
||||||
|
call site (or several, if used in multiple commands) — the
|
||||||
|
per-keyword cost is comparable; what shifts is *where* the
|
||||||
|
edit happens. A unit test additionally asserts every
|
||||||
|
`Keyword::ALL` variant is referenced by some parser
|
||||||
|
combinator, catching dead enum entries.
|
||||||
|
- **Span model needs care for UTF-8.** Byte offsets are the
|
||||||
|
contract; the lexer must split tokens at UTF-8 boundaries.
|
||||||
|
Identifier/string tests should include at least one
|
||||||
|
multi-byte case.
|
||||||
|
- **`replay` special-case** in the parser (§6). One command,
|
||||||
|
~10 lines, documented inline. Acceptable; not a precedent
|
||||||
|
for other commands.
|
||||||
|
- **Tests churn.** Two structural-error tests need new wording
|
||||||
|
(mechanical port). Existing content-error tests stay as-is.
|
||||||
|
|
||||||
|
### Neutral
|
||||||
|
|
||||||
|
- **chumsky stays.** No framework change, no new dependency.
|
||||||
|
The parser still expresses the grammar declaratively; only
|
||||||
|
the input atoms change.
|
||||||
|
- **`Command` AST is unchanged.** The parser produces the
|
||||||
|
same `Command` enum it does today; downstream code (runtime,
|
||||||
|
app, db) is untouched.
|
||||||
|
- **Public API of `dsl::parser` is unchanged.**
|
||||||
|
`parse_command(&str) → Result<Command, ParseError>` keeps
|
||||||
|
its signature. ADR-0021 may extend the public surface (e.g.
|
||||||
|
expose `lex()` and `parse_tokens()` for I3/I4) but does so
|
||||||
|
additively.
|
||||||
|
|
||||||
|
## Implementation notes
|
||||||
|
|
||||||
|
These are sketch-level — implementation will produce more
|
||||||
|
detailed work, but they're enough that a session picking
|
||||||
|
this up has direction.
|
||||||
|
|
||||||
|
### Order of operations
|
||||||
|
|
||||||
|
1. **Lexer module.** New file `src/dsl/lexer.rs`. Token /
|
||||||
|
TokenKind / Keyword / Punct types. `lex(input: &str) →
|
||||||
|
Vec<Token>` (always succeeds; embeds `Error(_)` tokens
|
||||||
|
for shape errors). Unit tests for representative inputs,
|
||||||
|
span correctness, error-token cases, multi-byte UTF-8.
|
||||||
|
2. **Token-aware combinator helpers.** Probably in
|
||||||
|
`src/dsl/parser.rs` next to the existing combinators.
|
||||||
|
`kw(Keyword)`, `punct(Punct)`, `ident()`, `number()`,
|
||||||
|
`string_lit()`, `flag(&str)`, `eof()`. Each parses a
|
||||||
|
single token by kind and returns its payload (or `()`).
|
||||||
|
3. **Rewrite `command_parser()` and its sub-parsers.** One
|
||||||
|
sub-parser at a time; run the existing tests after each.
|
||||||
|
Aim for green-after-every-step rather than a big-bang
|
||||||
|
port.
|
||||||
|
4. **The `replay` source-slice special case** (§6).
|
||||||
|
5. **Strip `humanise()` machinery** that's no longer needed.
|
||||||
|
The render path in `into_parse_error()` simplifies.
|
||||||
|
ADR-0021 owns the new render shape; until it lands, keep
|
||||||
|
a minimal humaniser that produces the existing wording.
|
||||||
|
6. **Public API check.** `parse_command` signature unchanged.
|
||||||
|
Add `lex(&str) → Vec<Token>` and `parse_tokens(&[Token],
|
||||||
|
&str) → Result<Command, ParseError>` as `pub` so I3/I4
|
||||||
|
can hook in later (their ADRs will use these).
|
||||||
|
|
||||||
|
### Things that interact subtly
|
||||||
|
|
||||||
|
- **The `try_map` content errors** survive unchanged. They
|
||||||
|
fire on tokens (e.g. `try_map` on a parsed `Identifier(s)`
|
||||||
|
that's expected to be a type name) but their messages and
|
||||||
|
classification are identical to today. The catalog
|
||||||
|
vocabulary they use ("unknown type", "expected one of",
|
||||||
|
"mutually exclusive", "specified twice") stays.
|
||||||
|
- **The `1:n` cardinality** lexes as `Number("1")`,
|
||||||
|
`Punct(Colon)`, `Identifier("n"|"N")`. The parser composes
|
||||||
|
these into the relationship-cardinality assertion as
|
||||||
|
today; no special token kind for it.
|
||||||
|
- **Negative number literals.** Lexer emits `Punct(Minus)`?
|
||||||
|
No — there is no Minus in the punct set above because the
|
||||||
|
current grammar has no need for a unary `-` outside number
|
||||||
|
literals. Decision: the lexer recognises `-` as part of a
|
||||||
|
number literal when followed by an ASCII digit, producing a
|
||||||
|
single `Number("-5")` token. A bare `-` not followed by a
|
||||||
|
digit is a `Punct(Minus)` only if a `Minus` variant is
|
||||||
|
added — for now, treat as `Error(UnknownChar)`. This
|
||||||
|
matches the current grammar (which only accepts `-` as a
|
||||||
|
number sign).
|
||||||
|
- **The `--` flag prefix vs. a future `--` line comment.**
|
||||||
|
Today `--all-rows` etc. are flags. If a future ADR
|
||||||
|
introduces SQL-style `--` line comments in advanced mode
|
||||||
|
only, the lexer may need a mode parameter. ADR-0020 doesn't
|
||||||
|
pre-empt that.
|
||||||
|
- **The hard-coded `"running: "` prefix in `app.rs`** for
|
||||||
|
caret padding: unchanged. The parser still reports a
|
||||||
|
byte-position; the caret math is the same.
|
||||||
@@ -0,0 +1,452 @@
|
|||||||
|
# ADR-0021: Parser-as-source-of-truth for H1a (per-command usage in parse errors)
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted.
|
||||||
|
|
||||||
|
Builds on ADR-0020 (tokenization layer). Addresses H1a from
|
||||||
|
`requirements.md` — the parse-error pedagogy gap that
|
||||||
|
ADR-0019's friendly-error layer left untouched.
|
||||||
|
|
||||||
|
Cross-references ADR-0019 (i18n catalog conventions; H1a's
|
||||||
|
output goes through the same catalog) and ADR-0009 (DSL
|
||||||
|
syntax conventions; usage templates render in the project's
|
||||||
|
documented surface form).
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
ADR-0019 dramatically improved engine-error wording.
|
||||||
|
Parse-error wording is now the visibly-weakest user surface —
|
||||||
|
the user-reported gap was concrete: typing `create` produces
|
||||||
|
|
||||||
|
```
|
||||||
|
parse error: after `create`, expected `table`
|
||||||
|
```
|
||||||
|
|
||||||
|
The error is *structurally* correct (chumsky has consumed
|
||||||
|
`create` and is now looking for the next required token) but
|
||||||
|
*pedagogically* silent. A learner who got this far typed
|
||||||
|
`create` because they'd been told that's how new tables are
|
||||||
|
made; what they need next is the shape of the command, not
|
||||||
|
a single missing-token pointer.
|
||||||
|
|
||||||
|
Comparable observations apply across the whole DSL surface:
|
||||||
|
|
||||||
|
- `add` → expected `column` or `1` (uninformative; user
|
||||||
|
needs the shape of `add column …` AND `add 1:n
|
||||||
|
relationship …`).
|
||||||
|
- `update Customers` → expected `set` (true; but `update`'s
|
||||||
|
full grammar with `set …`, `where …`, `--all-rows` is what
|
||||||
|
the user wants illustrated).
|
||||||
|
- `frobulate Customers` → expected one of `create`, `drop`,
|
||||||
|
`add`, `rename`, `change`, `show`, `insert`, `update`,
|
||||||
|
`delete`, `replay` (true after ADR-0020; the available-commands
|
||||||
|
list is now informative, but the no-prefix case wants its
|
||||||
|
own framing — "available commands" rather than "expected").
|
||||||
|
|
||||||
|
H1a's job is to close that gap by surfacing the **grammar**
|
||||||
|
of the command at the point of error, not just the next
|
||||||
|
token.
|
||||||
|
|
||||||
|
### What ADR-0020 supplies
|
||||||
|
|
||||||
|
ADR-0020 lands the lexer + parser-over-tokens architecture.
|
||||||
|
What that buys H1a:
|
||||||
|
|
||||||
|
- **Aggregated `expected` sets at the failure point** (top-level
|
||||||
|
`choice` failures now list every command-starting keyword,
|
||||||
|
not just one). The user-visible "available commands" list
|
||||||
|
becomes correct without any work in this ADR.
|
||||||
|
- **Token-kind error patterns** (`RichPattern<Token>` instead
|
||||||
|
of `RichPattern<char>`). Each pattern renders via a stable
|
||||||
|
catalog key — no per-character humanising.
|
||||||
|
- **A canonical entry-token for each command** (the first
|
||||||
|
`Keyword(_)` consumed). H1a keys per-command usage
|
||||||
|
templates off this token.
|
||||||
|
|
||||||
|
### What this ADR adds on top
|
||||||
|
|
||||||
|
- A registry of per-command **usage templates** (one
|
||||||
|
declaration per command).
|
||||||
|
- A renderer that composes the parse error with: caret +
|
||||||
|
structural error wording + matching usage template(s).
|
||||||
|
- New catalog keys under `parse.usage.*` (templates) and
|
||||||
|
`parse.token.*` (single-token rendering for expected-set
|
||||||
|
joins).
|
||||||
|
- A "no commands consumed" fallback that renders an
|
||||||
|
available-commands list under a different prefix
|
||||||
|
("available commands:" rather than "expected:") for the
|
||||||
|
zero-prefix case.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### 1. Per-command usage template registry
|
||||||
|
|
||||||
|
Each command parser is paired with a `UsageEntry`:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub struct UsageEntry {
|
||||||
|
/// First keyword that distinguishes this command. Used
|
||||||
|
/// as the registry key.
|
||||||
|
pub entry: Keyword,
|
||||||
|
/// Catalog key for the grammar template body (under
|
||||||
|
/// `parse.usage.*`). One key per command.
|
||||||
|
pub catalog_key: &'static str,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The registry is a `&'static [UsageEntry]` declared in one
|
||||||
|
place (`src/dsl/usage.rs`). Lookup: given a consumed entry
|
||||||
|
keyword, return all entries whose `entry == keyword`. For
|
||||||
|
`Keyword::Add` the registry returns the `add column` and
|
||||||
|
`add 1:n relationship` entries; for `Keyword::Drop` it
|
||||||
|
returns `drop table`, `drop column`, `drop relationship`;
|
||||||
|
for unique-entry keywords (e.g. `Keyword::Create` today) it
|
||||||
|
returns one.
|
||||||
|
|
||||||
|
The catalog key is what gets translated. Template bodies
|
||||||
|
live in `src/friendly/strings/en-US.yaml` under
|
||||||
|
`parse.usage.*`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
parse:
|
||||||
|
usage:
|
||||||
|
create_table: "create table <Name> with pk [<col>:<type>[, ...]]"
|
||||||
|
drop_table: "drop table <Name>"
|
||||||
|
add_column: "add column [to] [table] <Table>: <Name> (<Type>)"
|
||||||
|
add_relationship: |
|
||||||
|
add 1:n relationship [as <Name>]
|
||||||
|
from <Parent>.<col> to <Child>.<col>
|
||||||
|
[on delete <action>] [on update <action>]
|
||||||
|
[--create-fk]
|
||||||
|
rename_column: "rename column [in] [table] <Table>: <Old> to <New>"
|
||||||
|
change_column: |
|
||||||
|
change column [in] [table] <Table>: <Name> (<Type>)
|
||||||
|
[--force-conversion | --dont-convert]
|
||||||
|
show_data: "show data <Table>"
|
||||||
|
show_table: "show table <Table>"
|
||||||
|
insert: "insert into <Table> [(<col>[, ...])] [values] (<value>[, ...])"
|
||||||
|
update: "update <Table> set <col>=<value>[, ...] (where <col>=<value> | --all-rows)"
|
||||||
|
delete: "delete from <Table> (where <col>=<value> | --all-rows)"
|
||||||
|
drop_column: "drop column [from] [table] <Table>: <Name>"
|
||||||
|
drop_relationship: |
|
||||||
|
drop relationship <Name>
|
||||||
|
drop relationship from <Parent>.<col> to <Child>.<col>
|
||||||
|
replay: "replay <path> | replay '<path with spaces>'"
|
||||||
|
```
|
||||||
|
|
||||||
|
(Wording is illustrative; exact phrasing settled at
|
||||||
|
implementation time. The bracket convention `[...]` for
|
||||||
|
optional parts and angle-bracket `<...>` for placeholders
|
||||||
|
matches ADR-0009's documentation surface.)
|
||||||
|
|
||||||
|
### 2. The renderer composes three blocks
|
||||||
|
|
||||||
|
A parse error renders as:
|
||||||
|
|
||||||
|
```
|
||||||
|
running: <user input>
|
||||||
|
^ ← caret (existing, unchanged)
|
||||||
|
parse error: <structural-or-content message>
|
||||||
|
usage: <template1>
|
||||||
|
<template2> ← when multiple entries share the entry keyword
|
||||||
|
```
|
||||||
|
|
||||||
|
Block 1 (the echo + caret) is unchanged from today.
|
||||||
|
|
||||||
|
Block 2 is the structural or content error. ADR-0020
|
||||||
|
guarantees the structural error is now properly aggregated
|
||||||
|
("expected `data` or `table`" not "expected `table`"). The
|
||||||
|
content errors (unknown type, mutually-exclusive flags) are
|
||||||
|
unchanged in voice.
|
||||||
|
|
||||||
|
Block 3 (usage:) is new. It is rendered if and only if **at
|
||||||
|
least one keyword token was consumed** before the parser
|
||||||
|
failed AND that keyword is a registered entry. If no keyword
|
||||||
|
was consumed (e.g., `frobulate Customers`, where `frobulate`
|
||||||
|
is an `Identifier`, not a `Keyword`), Block 3 is replaced
|
||||||
|
with the no-prefix fallback (§5).
|
||||||
|
|
||||||
|
If multiple entries match (e.g., the `add` family), all are
|
||||||
|
listed under a single `usage:` prefix, one per line.
|
||||||
|
|
||||||
|
### 3. Identifying the consumed entry keyword
|
||||||
|
|
||||||
|
The parser surfaces, alongside the `ParseError`, the
|
||||||
|
**deepest successfully-consumed keyword token**. Mechanism:
|
||||||
|
|
||||||
|
- `parse_tokens` returns `(Result<Command, ParseError>,
|
||||||
|
ParseDiagnostics)` where `ParseDiagnostics` carries the
|
||||||
|
furthest position chumsky reached AND a snapshot of the
|
||||||
|
consumed prefix.
|
||||||
|
- The renderer walks the consumed prefix backward to find the
|
||||||
|
first `Keyword(_)` token. (Almost always the first token,
|
||||||
|
but a future grammar where a command starts with a
|
||||||
|
literal — none today — would still resolve correctly.)
|
||||||
|
|
||||||
|
This logic lives in `src/dsl/usage.rs::matched_entry()` so
|
||||||
|
the registry and the lookup sit together.
|
||||||
|
|
||||||
|
### 4. `parse.token.*` — single-token catalog vocabulary
|
||||||
|
|
||||||
|
Chumsky's expected-set rendering needs a name for each token
|
||||||
|
kind. Today `humanise()` hand-codes these
|
||||||
|
(`describe_pattern` returns "`create`", "identifier", etc.).
|
||||||
|
ADR-0021 moves the vocabulary into the catalog:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
parse:
|
||||||
|
token:
|
||||||
|
# Keywords — one entry per Keyword enum variant.
|
||||||
|
keyword.create: "`create`"
|
||||||
|
keyword.table: "`table`"
|
||||||
|
keyword.with: "`with`"
|
||||||
|
# ... one per Keyword variant ...
|
||||||
|
|
||||||
|
# Punctuation.
|
||||||
|
punct.colon: "`:`"
|
||||||
|
punct.open_paren: "`(`"
|
||||||
|
punct.close_paren: "`)`"
|
||||||
|
punct.comma: "`,`"
|
||||||
|
punct.equals: "`=`"
|
||||||
|
punct.dot: "`.`"
|
||||||
|
|
||||||
|
# Token-class labels.
|
||||||
|
identifier: "identifier"
|
||||||
|
number: "number"
|
||||||
|
string_literal: "string literal"
|
||||||
|
flag: "flag (--name)"
|
||||||
|
end_of_input: "end of input"
|
||||||
|
|
||||||
|
# Lexer-error tokens.
|
||||||
|
error.unterminated_string: "unterminated string starting at column {column}"
|
||||||
|
error.unknown_char: "unrecognised character {found}"
|
||||||
|
```
|
||||||
|
|
||||||
|
Joining ("`a`, `b`, or `c`") stays in code (`oxford_or` from
|
||||||
|
the current humanise machinery, lifted intact). Wording of
|
||||||
|
each token is in the catalog.
|
||||||
|
|
||||||
|
`parse.error` (existing wrapper key) stays. Its `{detail}`
|
||||||
|
placeholder is filled by:
|
||||||
|
|
||||||
|
```
|
||||||
|
{consumed_prefix} expected {oxford_or(expected)}, found {found_token}
|
||||||
|
```
|
||||||
|
|
||||||
|
— each piece sourced from the catalog, joined in code.
|
||||||
|
|
||||||
|
`parse.caret` (existing) and `parse.empty` (existing)
|
||||||
|
unchanged.
|
||||||
|
|
||||||
|
### 5. No-prefix fallback: "available commands"
|
||||||
|
|
||||||
|
When the parser fails with **no keyword consumed**, the
|
||||||
|
"expected" set lists every top-level command-starting
|
||||||
|
keyword. That's correct but the framing should be
|
||||||
|
"available commands" rather than "expected".
|
||||||
|
|
||||||
|
Renderer detects this case (consumed-keyword count == 0) and
|
||||||
|
substitutes Block 3 with:
|
||||||
|
|
||||||
|
```
|
||||||
|
available commands: create, drop, add, rename, change,
|
||||||
|
show, insert, update, delete, replay
|
||||||
|
```
|
||||||
|
|
||||||
|
via a new catalog key:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
parse:
|
||||||
|
available_commands: "available commands: {commands}"
|
||||||
|
```
|
||||||
|
|
||||||
|
The list is the alphabetised set of `entry` keywords from
|
||||||
|
the usage registry, each rendered via its `parse.token.keyword.*`
|
||||||
|
catalog entry (so the strings are catalog-sourced, not
|
||||||
|
hard-coded).
|
||||||
|
|
||||||
|
This case only fires when the user typed something the
|
||||||
|
parser couldn't classify as any known command keyword — the
|
||||||
|
"frobulate Customers" case. It's both rarer and more useful
|
||||||
|
than the with-prefix case: a user this lost benefits more
|
||||||
|
from the full menu than from a missing-token pointer.
|
||||||
|
|
||||||
|
### 6. Anchor-phrase compliance (ADR-0019 §10)
|
||||||
|
|
||||||
|
ADR-0019's anchor-phrase list contains nine substrings the
|
||||||
|
catalog commits to keeping stable. None are parse-error-specific,
|
||||||
|
so this ADR doesn't add to the list. The existing parser
|
||||||
|
test that asserts on "unknown type" and "expected one of"
|
||||||
|
substrings stays — those come from `Type::from_str`'s custom
|
||||||
|
error message which ADR-0020 §4 keeps unchanged.
|
||||||
|
|
||||||
|
The current structural-error tests assert on substrings like
|
||||||
|
"after `show data`", "expected identifier", "found end of
|
||||||
|
input", "after `change column Rich`", "expected `:`". The
|
||||||
|
new render shape preserves all of these — the rendering
|
||||||
|
template is `{prefix} expected {set}, found {token}` and
|
||||||
|
the prefix / set / token come from the catalog with the same
|
||||||
|
wording. Tests should port unchanged or with at most minor
|
||||||
|
adjustments.
|
||||||
|
|
||||||
|
### 7. Catalog validator covers the new keys
|
||||||
|
|
||||||
|
ADR-0019 §8.6's `KEYS_AND_PLACEHOLDERS` validator extends
|
||||||
|
to cover:
|
||||||
|
|
||||||
|
- Every `parse.usage.<command>` key referenced from the
|
||||||
|
registry exists.
|
||||||
|
- Every `parse.token.keyword.<variant>` key for every
|
||||||
|
`Keyword` enum variant exists.
|
||||||
|
- Every `parse.token.punct.<variant>` key for every `Punct`
|
||||||
|
variant exists.
|
||||||
|
- The `parse.token.{identifier, number, string_literal,
|
||||||
|
flag, end_of_input}` keys exist.
|
||||||
|
- The `parse.token.error.*` keys exist for every
|
||||||
|
`LexErrorKind` variant.
|
||||||
|
- The `parse.available_commands` key exists.
|
||||||
|
- No format specifiers (already enforced).
|
||||||
|
- No engine vocabulary (already enforced).
|
||||||
|
|
||||||
|
### 8. The `usage:` block respects the verbosity setting?
|
||||||
|
|
||||||
|
No. The `messages (short|verbose)` setting (ADR-0019)
|
||||||
|
governs *engine-error* verbosity (whether to render the
|
||||||
|
hint block of a `FriendlyError`). Parse errors don't go
|
||||||
|
through `FriendlyError`; they have their own render path,
|
||||||
|
and the usage block is always shown. Rationale: a learner
|
||||||
|
toggling to `messages short` is signalling they recognise
|
||||||
|
the engine-error patterns and want less explanation around
|
||||||
|
those — they're not signalling that they want less
|
||||||
|
parse-help. Parse errors mean the user couldn't even
|
||||||
|
formulate a runnable command; that's exactly the moment to
|
||||||
|
maximise pedagogical surface, regardless of the
|
||||||
|
engine-error verbosity preference.
|
||||||
|
|
||||||
|
If experience shows this is wrong, a future amendment can
|
||||||
|
gate the usage block on a separate setting. Doesn't need
|
||||||
|
to be designed now.
|
||||||
|
|
||||||
|
## Out of scope
|
||||||
|
|
||||||
|
1. **Tab completion (I3) and syntax highlighting (I4)**
|
||||||
|
themselves. ADR-0020 §9-10 commits to the parser
|
||||||
|
contract; ADR-0021 doesn't extend it.
|
||||||
|
2. **Schema-aware suggestions** ("did you mean `Customers`?"
|
||||||
|
when the user typed `Customrs`). Useful but a separate
|
||||||
|
feature; would land in I3 territory (completion + spell
|
||||||
|
check share a candidate list).
|
||||||
|
3. **Suggested fixes** ("change `crete` to `create`"). Same
|
||||||
|
bucket as schema-aware suggestions.
|
||||||
|
4. **Multi-error reporting.** Today and after this ADR, the
|
||||||
|
parser reports the first error and stops. Recovery-based
|
||||||
|
multi-error parsing is out of scope and re-opens with
|
||||||
|
I3's ADR (ADR-0020 §11).
|
||||||
|
5. **Persisting the verbosity setting** (which doesn't
|
||||||
|
affect parse errors anyway, per §8). ADR-0019 deferred it
|
||||||
|
to a future settings ADR.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- **Per-command usage at point of failure.** A learner who
|
||||||
|
types `create` sees the full `create table` grammar
|
||||||
|
instead of "expected `table`". The user-reported gap
|
||||||
|
closes.
|
||||||
|
- **Aggregated `available commands` for cold starts.**
|
||||||
|
`frobulate Customers` now lists the ten command-starting
|
||||||
|
keywords under a sensible framing.
|
||||||
|
- **Vocabulary lives in the catalog, not in code.** Renaming
|
||||||
|
a keyword's user-facing wording is one YAML edit. Adding
|
||||||
|
a new keyword adds two lines (registry + token-name key);
|
||||||
|
the validator catches both if forgotten.
|
||||||
|
- **The render path simplifies.** `humanise()` shrinks to a
|
||||||
|
small composer over catalog lookups — no per-character
|
||||||
|
description, no `RichPattern` walking, no
|
||||||
|
prefer-custom-over-structural switching (the latter
|
||||||
|
becomes "render the structural error and append the usage
|
||||||
|
template").
|
||||||
|
- **Composes with ADR-0019's `FriendlyError`.** Engine
|
||||||
|
errors and parse errors are rendered through different
|
||||||
|
paths but both go through the catalog, so vocabulary
|
||||||
|
drift between them is impossible.
|
||||||
|
|
||||||
|
### Costs
|
||||||
|
|
||||||
|
- **A second registry to keep in sync** with the parser. The
|
||||||
|
validator (§7) catches missing usage entries / missing
|
||||||
|
token keys at test time, but adding a new command means
|
||||||
|
three steps (parser combinator, usage-registry entry,
|
||||||
|
catalog YAML edit). Mitigation: a unit test asserts every
|
||||||
|
command in the parser has a registry entry (catches
|
||||||
|
forgotten entries; matches the friendly-module pattern).
|
||||||
|
- **Catalog grows by ~30-40 entries** (one usage template
|
||||||
|
per command, one keyword name per `Keyword` variant, a
|
||||||
|
handful of token-class names, a handful of error names).
|
||||||
|
Each entry is one line of YAML; total catalog grows from
|
||||||
|
~170 entries to ~210. Within budget.
|
||||||
|
- **Wording iteration** on the usage templates will probably
|
||||||
|
happen post-merge. This is normal for pedagogical text
|
||||||
|
and the catalog makes it cheap.
|
||||||
|
|
||||||
|
### Neutral
|
||||||
|
|
||||||
|
- **Public parser API is unchanged.** `parse_command(&str)`
|
||||||
|
signature stable. The new `lex` and `parse_tokens`
|
||||||
|
functions exposed by ADR-0020 are the I3/I4 hook;
|
||||||
|
ADR-0021 doesn't add to that surface.
|
||||||
|
- **`AppEvent` shape unchanged.** Parse errors continue to
|
||||||
|
flow through `dispatch_dsl`'s existing path (push echo,
|
||||||
|
push caret, push error). This ADR's render changes are
|
||||||
|
internal to that function plus the `t!()` calls inside it.
|
||||||
|
|
||||||
|
## Implementation notes
|
||||||
|
|
||||||
|
### Order of operations (within the joint ADR-0020 + ADR-0021 implementation session)
|
||||||
|
|
||||||
|
1. Land ADR-0020 (lexer + parser refactor + minimal
|
||||||
|
humaniser).
|
||||||
|
2. Add `src/dsl/usage.rs` with the registry struct, the
|
||||||
|
static table, and `matched_entry()`.
|
||||||
|
3. Populate `parse.usage.*` and `parse.token.*` catalog
|
||||||
|
sections.
|
||||||
|
4. Extend `friendly::keys::KEYS_AND_PLACEHOLDERS` with the
|
||||||
|
new keys.
|
||||||
|
5. Rewrite `dispatch_dsl`'s error-render arm in `app.rs` to
|
||||||
|
compose the three blocks per §2 (or §5 fallback).
|
||||||
|
6. Add tests:
|
||||||
|
- Unit: every registered usage entry resolves through the
|
||||||
|
catalog. Every `Keyword` variant has a `parse.token.keyword.*`
|
||||||
|
entry.
|
||||||
|
- Integration (`tests/parse_error_pedagogy.rs`, new):
|
||||||
|
`create`, `add`, `update Customers`, `frobulate
|
||||||
|
Customers`, `create table` (no PK clause), `insert into
|
||||||
|
T` (no values), each producing the expected
|
||||||
|
three-block output.
|
||||||
|
7. Update or port the two existing structural-error tests
|
||||||
|
in `parser.rs::tests` to the new render shape.
|
||||||
|
|
||||||
|
### Things that interact subtly
|
||||||
|
|
||||||
|
- **The "deepest consumed keyword" mechanism** (§3) walks
|
||||||
|
the prefix once per parse failure. Cheap; no perf concern.
|
||||||
|
But it must not pick up keywords from inside content that
|
||||||
|
is itself part of a partial AST (e.g. an identifier the
|
||||||
|
user is typing that happens to be the first letters of a
|
||||||
|
keyword); since the lexer commits to identifier-vs-keyword
|
||||||
|
classification before the parser sees tokens, this isn't a
|
||||||
|
real risk. Documented inline.
|
||||||
|
- **Multiple usage entries per `add` / `drop`** are rendered
|
||||||
|
under one `usage:` prefix per §2. This is one of the
|
||||||
|
pedagogically-best parts of the change: the user gets the
|
||||||
|
full family rather than guessing which sibling they
|
||||||
|
wanted.
|
||||||
|
- **`replay`'s special-case parsing** (ADR-0020 §6) is
|
||||||
|
invisible to the usage layer. The user typing `replay`
|
||||||
|
with no path gets the `parse.usage.replay` template.
|
||||||
|
- **`messages` is an app-level command, not a DSL command**,
|
||||||
|
so it is not in the parser registry and doesn't appear in
|
||||||
|
`available commands:`. Same posture as `mode`, `help`,
|
||||||
|
`quit`. Documented in the registry's prelude.
|
||||||
@@ -25,3 +25,5 @@ This directory contains the project's ADRs, recorded per
|
|||||||
- [ADR-0017 — Column type-change compatibility](0017-column-type-change-compatibility.md)
|
- [ADR-0017 — Column type-change compatibility](0017-column-type-change-compatibility.md)
|
||||||
- [ADR-0018 — Auto-fill contracts for `serial` and `shortid` columns](0018-auto-fill-contracts-for-serial-and-shortid.md)
|
- [ADR-0018 — Auto-fill contracts for `serial` and `shortid` columns](0018-auto-fill-contracts-for-serial-and-shortid.md)
|
||||||
- [ADR-0019 — Friendly error layer (H1) and i18n message catalog](0019-friendly-error-layer-and-i18n.md)
|
- [ADR-0019 — Friendly error layer (H1) and i18n message catalog](0019-friendly-error-layer-and-i18n.md)
|
||||||
|
- [ADR-0020 — Tokenization layer for the DSL parser](0020-tokenization-layer-for-the-dsl-parser.md)
|
||||||
|
- [ADR-0021 — Parser-as-source-of-truth for H1a (per-command usage in parse errors)](0021-parser-as-source-of-truth-for-h1a.md)
|
||||||
|
|||||||
Reference in New Issue
Block a user