ADR-0020 + ADR-0021: tokenization layer and parse-error pedagogy (H1a)
ADR-0020 amends ADR-0001 with a two-phase parse: a lexer
producing a span-tagged token stream, then chumsky over
&[Token]. Single source of truth for keywords and punct via
a define_keywords!/define_punct! macro pattern. Parser
contract committed for I3 (queryable expected-token-set)
and I4 (lexer always succeeds, Error tokens for invalid
input). Includes an honest history note: the no-lexer shape
in dsl/parser.rs arose incrementally without ADR-level
deliberation against the known H1a/I3/I4 requirements; this
ADR corrects that.
ADR-0021 builds on ADR-0020 to close the H1a gap: a
per-command UsageEntry registry keyed off entry-keyword,
with parse errors rendered as caret + structural error +
matching usage template(s). Multi-entry families (add,
drop, show) render together. New catalog sections under
parse.usage.* (per-command grammar) and parse.token.*
(single-token vocabulary). Zero-prefix case ("frobulate
Customers") falls back to an "available commands:" framing.
Anchor-phrase compliance preserved.
This commit is contained in:
@@ -0,0 +1,452 @@
|
||||
# ADR-0021: Parser-as-source-of-truth for H1a (per-command usage in parse errors)
|
||||
|
||||
## Status
|
||||
|
||||
Accepted.
|
||||
|
||||
Builds on ADR-0020 (tokenization layer). Addresses H1a from
|
||||
`requirements.md` — the parse-error pedagogy gap that
|
||||
ADR-0019's friendly-error layer left untouched.
|
||||
|
||||
Cross-references ADR-0019 (i18n catalog conventions; H1a's
|
||||
output goes through the same catalog) and ADR-0009 (DSL
|
||||
syntax conventions; usage templates render in the project's
|
||||
documented surface form).
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0019 dramatically improved engine-error wording.
|
||||
Parse-error wording is now the visibly-weakest user surface —
|
||||
the user-reported gap was concrete: typing `create` produces
|
||||
|
||||
```
|
||||
parse error: after `create`, expected `table`
|
||||
```
|
||||
|
||||
The error is *structurally* correct (chumsky has consumed
|
||||
`create` and is now looking for the next required token) but
|
||||
*pedagogically* silent. A learner who got this far typed
|
||||
`create` because they'd been told that's how new tables are
|
||||
made; what they need next is the shape of the command, not
|
||||
a single missing-token pointer.
|
||||
|
||||
Comparable observations apply across the whole DSL surface:
|
||||
|
||||
- `add` → expected `column` or `1` (uninformative; user
|
||||
needs the shape of `add column …` AND `add 1:n
|
||||
relationship …`).
|
||||
- `update Customers` → expected `set` (true; but `update`'s
|
||||
full grammar with `set …`, `where …`, `--all-rows` is what
|
||||
the user wants illustrated).
|
||||
- `frobulate Customers` → expected one of `create`, `drop`,
|
||||
`add`, `rename`, `change`, `show`, `insert`, `update`,
|
||||
`delete`, `replay` (true after ADR-0020; the available-commands
|
||||
list is now informative, but the no-prefix case wants its
|
||||
own framing — "available commands" rather than "expected").
|
||||
|
||||
H1a's job is to close that gap by surfacing the **grammar**
|
||||
of the command at the point of error, not just the next
|
||||
token.
|
||||
|
||||
### What ADR-0020 supplies
|
||||
|
||||
ADR-0020 lands the lexer + parser-over-tokens architecture.
|
||||
What that buys H1a:
|
||||
|
||||
- **Aggregated `expected` sets at the failure point** (top-level
|
||||
`choice` failures now list every command-starting keyword,
|
||||
not just one). The user-visible "available commands" list
|
||||
becomes correct without any work in this ADR.
|
||||
- **Token-kind error patterns** (`RichPattern<Token>` instead
|
||||
of `RichPattern<char>`). Each pattern renders via a stable
|
||||
catalog key — no per-character humanising.
|
||||
- **A canonical entry-token for each command** (the first
|
||||
`Keyword(_)` consumed). H1a keys per-command usage
|
||||
templates off this token.
|
||||
|
||||
### What this ADR adds on top
|
||||
|
||||
- A registry of per-command **usage templates** (one
|
||||
declaration per command).
|
||||
- A renderer that composes the parse error with: caret +
|
||||
structural error wording + matching usage template(s).
|
||||
- New catalog keys under `parse.usage.*` (templates) and
|
||||
`parse.token.*` (single-token rendering for expected-set
|
||||
joins).
|
||||
- A "no commands consumed" fallback that renders an
|
||||
available-commands list under a different prefix
|
||||
("available commands:" rather than "expected:") for the
|
||||
zero-prefix case.
|
||||
|
||||
## Decision
|
||||
|
||||
### 1. Per-command usage template registry
|
||||
|
||||
Each command parser is paired with a `UsageEntry`:
|
||||
|
||||
```rust
|
||||
pub struct UsageEntry {
|
||||
/// First keyword that distinguishes this command. Used
|
||||
/// as the registry key.
|
||||
pub entry: Keyword,
|
||||
/// Catalog key for the grammar template body (under
|
||||
/// `parse.usage.*`). One key per command.
|
||||
pub catalog_key: &'static str,
|
||||
}
|
||||
```
|
||||
|
||||
The registry is a `&'static [UsageEntry]` declared in one
|
||||
place (`src/dsl/usage.rs`). Lookup: given a consumed entry
|
||||
keyword, return all entries whose `entry == keyword`. For
|
||||
`Keyword::Add` the registry returns the `add column` and
|
||||
`add 1:n relationship` entries; for `Keyword::Drop` it
|
||||
returns `drop table`, `drop column`, `drop relationship`;
|
||||
for unique-entry keywords (e.g. `Keyword::Create` today) it
|
||||
returns one.
|
||||
|
||||
The catalog key is what gets translated. Template bodies
|
||||
live in `src/friendly/strings/en-US.yaml` under
|
||||
`parse.usage.*`:
|
||||
|
||||
```yaml
|
||||
parse:
|
||||
usage:
|
||||
create_table: "create table <Name> with pk [<col>:<type>[, ...]]"
|
||||
drop_table: "drop table <Name>"
|
||||
add_column: "add column [to] [table] <Table>: <Name> (<Type>)"
|
||||
add_relationship: |
|
||||
add 1:n relationship [as <Name>]
|
||||
from <Parent>.<col> to <Child>.<col>
|
||||
[on delete <action>] [on update <action>]
|
||||
[--create-fk]
|
||||
rename_column: "rename column [in] [table] <Table>: <Old> to <New>"
|
||||
change_column: |
|
||||
change column [in] [table] <Table>: <Name> (<Type>)
|
||||
[--force-conversion | --dont-convert]
|
||||
show_data: "show data <Table>"
|
||||
show_table: "show table <Table>"
|
||||
insert: "insert into <Table> [(<col>[, ...])] [values] (<value>[, ...])"
|
||||
update: "update <Table> set <col>=<value>[, ...] (where <col>=<value> | --all-rows)"
|
||||
delete: "delete from <Table> (where <col>=<value> | --all-rows)"
|
||||
drop_column: "drop column [from] [table] <Table>: <Name>"
|
||||
drop_relationship: |
|
||||
drop relationship <Name>
|
||||
drop relationship from <Parent>.<col> to <Child>.<col>
|
||||
replay: "replay <path> | replay '<path with spaces>'"
|
||||
```
|
||||
|
||||
(Wording is illustrative; exact phrasing settled at
|
||||
implementation time. The bracket convention `[...]` for
|
||||
optional parts and angle-bracket `<...>` for placeholders
|
||||
matches ADR-0009's documentation surface.)
|
||||
|
||||
### 2. The renderer composes three blocks
|
||||
|
||||
A parse error renders as:
|
||||
|
||||
```
|
||||
running: <user input>
|
||||
^ ← caret (existing, unchanged)
|
||||
parse error: <structural-or-content message>
|
||||
usage: <template1>
|
||||
<template2> ← when multiple entries share the entry keyword
|
||||
```
|
||||
|
||||
Block 1 (the echo + caret) is unchanged from today.
|
||||
|
||||
Block 2 is the structural or content error. ADR-0020
|
||||
guarantees the structural error is now properly aggregated
|
||||
("expected `data` or `table`" not "expected `table`"). The
|
||||
content errors (unknown type, mutually-exclusive flags) are
|
||||
unchanged in voice.
|
||||
|
||||
Block 3 (usage:) is new. It is rendered if and only if **at
|
||||
least one keyword token was consumed** before the parser
|
||||
failed AND that keyword is a registered entry. If no keyword
|
||||
was consumed (e.g., `frobulate Customers`, where `frobulate`
|
||||
is an `Identifier`, not a `Keyword`), Block 3 is replaced
|
||||
with the no-prefix fallback (§5).
|
||||
|
||||
If multiple entries match (e.g., the `add` family), all are
|
||||
listed under a single `usage:` prefix, one per line.
|
||||
|
||||
### 3. Identifying the consumed entry keyword
|
||||
|
||||
The parser surfaces, alongside the `ParseError`, the
|
||||
**deepest successfully-consumed keyword token**. Mechanism:
|
||||
|
||||
- `parse_tokens` returns `(Result<Command, ParseError>,
|
||||
ParseDiagnostics)` where `ParseDiagnostics` carries the
|
||||
furthest position chumsky reached AND a snapshot of the
|
||||
consumed prefix.
|
||||
- The renderer walks the consumed prefix backward to find the
|
||||
first `Keyword(_)` token. (Almost always the first token,
|
||||
but a future grammar where a command starts with a
|
||||
literal — none today — would still resolve correctly.)
|
||||
|
||||
This logic lives in `src/dsl/usage.rs::matched_entry()` so
|
||||
the registry and the lookup sit together.
|
||||
|
||||
### 4. `parse.token.*` — single-token catalog vocabulary
|
||||
|
||||
Chumsky's expected-set rendering needs a name for each token
|
||||
kind. Today `humanise()` hand-codes these
|
||||
(`describe_pattern` returns "`create`", "identifier", etc.).
|
||||
ADR-0021 moves the vocabulary into the catalog:
|
||||
|
||||
```yaml
|
||||
parse:
|
||||
token:
|
||||
# Keywords — one entry per Keyword enum variant.
|
||||
keyword.create: "`create`"
|
||||
keyword.table: "`table`"
|
||||
keyword.with: "`with`"
|
||||
# ... one per Keyword variant ...
|
||||
|
||||
# Punctuation.
|
||||
punct.colon: "`:`"
|
||||
punct.open_paren: "`(`"
|
||||
punct.close_paren: "`)`"
|
||||
punct.comma: "`,`"
|
||||
punct.equals: "`=`"
|
||||
punct.dot: "`.`"
|
||||
|
||||
# Token-class labels.
|
||||
identifier: "identifier"
|
||||
number: "number"
|
||||
string_literal: "string literal"
|
||||
flag: "flag (--name)"
|
||||
end_of_input: "end of input"
|
||||
|
||||
# Lexer-error tokens.
|
||||
error.unterminated_string: "unterminated string starting at column {column}"
|
||||
error.unknown_char: "unrecognised character {found}"
|
||||
```
|
||||
|
||||
Joining ("`a`, `b`, or `c`") stays in code (`oxford_or` from
|
||||
the current humanise machinery, lifted intact). Wording of
|
||||
each token is in the catalog.
|
||||
|
||||
`parse.error` (existing wrapper key) stays. Its `{detail}`
|
||||
placeholder is filled by:
|
||||
|
||||
```
|
||||
{consumed_prefix} expected {oxford_or(expected)}, found {found_token}
|
||||
```
|
||||
|
||||
— each piece sourced from the catalog, joined in code.
|
||||
|
||||
`parse.caret` (existing) and `parse.empty` (existing)
|
||||
unchanged.
|
||||
|
||||
### 5. No-prefix fallback: "available commands"
|
||||
|
||||
When the parser fails with **no keyword consumed**, the
|
||||
"expected" set lists every top-level command-starting
|
||||
keyword. That's correct but the framing should be
|
||||
"available commands" rather than "expected".
|
||||
|
||||
Renderer detects this case (consumed-keyword count == 0) and
|
||||
substitutes Block 3 with:
|
||||
|
||||
```
|
||||
available commands: create, drop, add, rename, change,
|
||||
show, insert, update, delete, replay
|
||||
```
|
||||
|
||||
via a new catalog key:
|
||||
|
||||
```yaml
|
||||
parse:
|
||||
available_commands: "available commands: {commands}"
|
||||
```
|
||||
|
||||
The list is the alphabetised set of `entry` keywords from
|
||||
the usage registry, each rendered via its `parse.token.keyword.*`
|
||||
catalog entry (so the strings are catalog-sourced, not
|
||||
hard-coded).
|
||||
|
||||
This case only fires when the user typed something the
|
||||
parser couldn't classify as any known command keyword — the
|
||||
"frobulate Customers" case. It's both rarer and more useful
|
||||
than the with-prefix case: a user this lost benefits more
|
||||
from the full menu than from a missing-token pointer.
|
||||
|
||||
### 6. Anchor-phrase compliance (ADR-0019 §10)
|
||||
|
||||
ADR-0019's anchor-phrase list contains nine substrings the
|
||||
catalog commits to keeping stable. None are parse-error-specific,
|
||||
so this ADR doesn't add to the list. The existing parser
|
||||
test that asserts on "unknown type" and "expected one of"
|
||||
substrings stays — those come from `Type::from_str`'s custom
|
||||
error message which ADR-0020 §4 keeps unchanged.
|
||||
|
||||
The current structural-error tests assert on substrings like
|
||||
"after `show data`", "expected identifier", "found end of
|
||||
input", "after `change column Rich`", "expected `:`". The
|
||||
new render shape preserves all of these — the rendering
|
||||
template is `{prefix} expected {set}, found {token}` and
|
||||
the prefix / set / token come from the catalog with the same
|
||||
wording. Tests should port unchanged or with at most minor
|
||||
adjustments.
|
||||
|
||||
### 7. Catalog validator covers the new keys
|
||||
|
||||
ADR-0019 §8.6's `KEYS_AND_PLACEHOLDERS` validator extends
|
||||
to cover:
|
||||
|
||||
- Every `parse.usage.<command>` key referenced from the
|
||||
registry exists.
|
||||
- Every `parse.token.keyword.<variant>` key for every
|
||||
`Keyword` enum variant exists.
|
||||
- Every `parse.token.punct.<variant>` key for every `Punct`
|
||||
variant exists.
|
||||
- The `parse.token.{identifier, number, string_literal,
|
||||
flag, end_of_input}` keys exist.
|
||||
- The `parse.token.error.*` keys exist for every
|
||||
`LexErrorKind` variant.
|
||||
- The `parse.available_commands` key exists.
|
||||
- No format specifiers (already enforced).
|
||||
- No engine vocabulary (already enforced).
|
||||
|
||||
### 8. The `usage:` block respects the verbosity setting?
|
||||
|
||||
No. The `messages (short|verbose)` setting (ADR-0019)
|
||||
governs *engine-error* verbosity (whether to render the
|
||||
hint block of a `FriendlyError`). Parse errors don't go
|
||||
through `FriendlyError`; they have their own render path,
|
||||
and the usage block is always shown. Rationale: a learner
|
||||
toggling to `messages short` is signalling they recognise
|
||||
the engine-error patterns and want less explanation around
|
||||
those — they're not signalling that they want less
|
||||
parse-help. Parse errors mean the user couldn't even
|
||||
formulate a runnable command; that's exactly the moment to
|
||||
maximise pedagogical surface, regardless of the
|
||||
engine-error verbosity preference.
|
||||
|
||||
If experience shows this is wrong, a future amendment can
|
||||
gate the usage block on a separate setting. Doesn't need
|
||||
to be designed now.
|
||||
|
||||
## Out of scope
|
||||
|
||||
1. **Tab completion (I3) and syntax highlighting (I4)**
|
||||
themselves. ADR-0020 §9-10 commits to the parser
|
||||
contract; ADR-0021 doesn't extend it.
|
||||
2. **Schema-aware suggestions** ("did you mean `Customers`?"
|
||||
when the user typed `Customrs`). Useful but a separate
|
||||
feature; would land in I3 territory (completion + spell
|
||||
check share a candidate list).
|
||||
3. **Suggested fixes** ("change `crete` to `create`"). Same
|
||||
bucket as schema-aware suggestions.
|
||||
4. **Multi-error reporting.** Today and after this ADR, the
|
||||
parser reports the first error and stops. Recovery-based
|
||||
multi-error parsing is out of scope and re-opens with
|
||||
I3's ADR (ADR-0020 §11).
|
||||
5. **Persisting the verbosity setting** (which doesn't
|
||||
affect parse errors anyway, per §8). ADR-0019 deferred it
|
||||
to a future settings ADR.
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Per-command usage at point of failure.** A learner who
|
||||
types `create` sees the full `create table` grammar
|
||||
instead of "expected `table`". The user-reported gap
|
||||
closes.
|
||||
- **Aggregated `available commands` for cold starts.**
|
||||
`frobulate Customers` now lists the ten command-starting
|
||||
keywords under a sensible framing.
|
||||
- **Vocabulary lives in the catalog, not in code.** Renaming
|
||||
a keyword's user-facing wording is one YAML edit. Adding
|
||||
a new keyword adds two lines (registry + token-name key);
|
||||
the validator catches both if forgotten.
|
||||
- **The render path simplifies.** `humanise()` shrinks to a
|
||||
small composer over catalog lookups — no per-character
|
||||
description, no `RichPattern` walking, no
|
||||
prefer-custom-over-structural switching (the latter
|
||||
becomes "render the structural error and append the usage
|
||||
template").
|
||||
- **Composes with ADR-0019's `FriendlyError`.** Engine
|
||||
errors and parse errors are rendered through different
|
||||
paths but both go through the catalog, so vocabulary
|
||||
drift between them is impossible.
|
||||
|
||||
### Costs
|
||||
|
||||
- **A second registry to keep in sync** with the parser. The
|
||||
validator (§7) catches missing usage entries / missing
|
||||
token keys at test time, but adding a new command means
|
||||
three steps (parser combinator, usage-registry entry,
|
||||
catalog YAML edit). Mitigation: a unit test asserts every
|
||||
command in the parser has a registry entry (catches
|
||||
forgotten entries; matches the friendly-module pattern).
|
||||
- **Catalog grows by ~30-40 entries** (one usage template
|
||||
per command, one keyword name per `Keyword` variant, a
|
||||
handful of token-class names, a handful of error names).
|
||||
Each entry is one line of YAML; total catalog grows from
|
||||
~170 entries to ~210. Within budget.
|
||||
- **Wording iteration** on the usage templates will probably
|
||||
happen post-merge. This is normal for pedagogical text
|
||||
and the catalog makes it cheap.
|
||||
|
||||
### Neutral
|
||||
|
||||
- **Public parser API is unchanged.** `parse_command(&str)`
|
||||
signature stable. The new `lex` and `parse_tokens`
|
||||
functions exposed by ADR-0020 are the I3/I4 hook;
|
||||
ADR-0021 doesn't add to that surface.
|
||||
- **`AppEvent` shape unchanged.** Parse errors continue to
|
||||
flow through `dispatch_dsl`'s existing path (push echo,
|
||||
push caret, push error). This ADR's render changes are
|
||||
internal to that function plus the `t!()` calls inside it.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
### Order of operations (within the joint ADR-0020 + ADR-0021 implementation session)
|
||||
|
||||
1. Land ADR-0020 (lexer + parser refactor + minimal
|
||||
humaniser).
|
||||
2. Add `src/dsl/usage.rs` with the registry struct, the
|
||||
static table, and `matched_entry()`.
|
||||
3. Populate `parse.usage.*` and `parse.token.*` catalog
|
||||
sections.
|
||||
4. Extend `friendly::keys::KEYS_AND_PLACEHOLDERS` with the
|
||||
new keys.
|
||||
5. Rewrite `dispatch_dsl`'s error-render arm in `app.rs` to
|
||||
compose the three blocks per §2 (or §5 fallback).
|
||||
6. Add tests:
|
||||
- Unit: every registered usage entry resolves through the
|
||||
catalog. Every `Keyword` variant has a `parse.token.keyword.*`
|
||||
entry.
|
||||
- Integration (`tests/parse_error_pedagogy.rs`, new):
|
||||
`create`, `add`, `update Customers`, `frobulate
|
||||
Customers`, `create table` (no PK clause), `insert into
|
||||
T` (no values), each producing the expected
|
||||
three-block output.
|
||||
7. Update or port the two existing structural-error tests
|
||||
in `parser.rs::tests` to the new render shape.
|
||||
|
||||
### Things that interact subtly
|
||||
|
||||
- **The "deepest consumed keyword" mechanism** (§3) walks
|
||||
the prefix once per parse failure. Cheap; no perf concern.
|
||||
But it must not pick up keywords from inside content that
|
||||
is itself part of a partial AST (e.g. an identifier the
|
||||
user is typing that happens to be the first letters of a
|
||||
keyword); since the lexer commits to identifier-vs-keyword
|
||||
classification before the parser sees tokens, this isn't a
|
||||
real risk. Documented inline.
|
||||
- **Multiple usage entries per `add` / `drop`** are rendered
|
||||
under one `usage:` prefix per §2. This is one of the
|
||||
pedagogically-best parts of the change: the user gets the
|
||||
full family rather than guessing which sibling they
|
||||
wanted.
|
||||
- **`replay`'s special-case parsing** (ADR-0020 §6) is
|
||||
invisible to the usage layer. The user typing `replay`
|
||||
with no path gets the `parse.usage.replay` template.
|
||||
- **`messages` is an app-level command, not a DSL command**,
|
||||
so it is not in the parser registry and doesn't appear in
|
||||
`available commands:`. Same posture as `mode`, `help`,
|
||||
`quit`. Documented in the registry's prelude.
|
||||
Reference in New Issue
Block a user