ADR-0020 + ADR-0021: tokenization layer and parse-error pedagogy (H1a)

ADR-0020 amends ADR-0001 with a two-phase parse: a lexer producing a span-tagged token stream, then chumsky over &[Token]. Single source of truth for keywords and punct via a define_keywords!/define_punct! macro pattern. Parser contract committed for I3 (queryable expected-token-set) and I4 (lexer always succeeds, Error tokens for invalid input). Includes an honest history note: the no-lexer shape in dsl/parser.rs arose incrementally without ADR-level deliberation against the known H1a/I3/I4 requirements; this ADR corrects that. ADR-0021 builds on ADR-0020 to close the H1a gap: a per-command UsageEntry registry keyed off entry-keyword, with parse errors rendered as caret + structural error + matching usage template(s). Multi-entry families (add, drop, show) render together. New catalog sections under parse.usage.* (per-command grammar) and parse.token.* (single-token vocabulary). Zero-prefix case ("frobulate Customers") falls back to an "available commands:" framing. Anchor-phrase compliance preserved.
2026-05-10 08:43:20 +00:00
parent 47601f7c85
commit 857ee753f2
3 changed files with 1092 additions and 0 deletions
@@ -0,0 +1,452 @@
+# ADR-0021: Parser-as-source-of-truth for H1a (per-command usage in parse errors)
+
+## Status
+
+Accepted.
+
+Builds on ADR-0020 (tokenization layer). Addresses H1a from
+`requirements.md` — the parse-error pedagogy gap that
+ADR-0019's friendly-error layer left untouched.
+
+Cross-references ADR-0019 (i18n catalog conventions; H1a's
+output goes through the same catalog) and ADR-0009 (DSL
+syntax conventions; usage templates render in the project's
+documented surface form).
+
+## Context
+
+ADR-0019 dramatically improved engine-error wording.
+Parse-error wording is now the visibly-weakest user surface —
+the user-reported gap was concrete: typing `create` produces
+
+```
+parse error: after `create`, expected `table`
+```
+
+The error is *structurally* correct (chumsky has consumed
+`create` and is now looking for the next required token) but
+*pedagogically* silent. A learner who got this far typed
+`create` because they'd been told that's how new tables are
+made; what they need next is the shape of the command, not
+a single missing-token pointer.
+
+Comparable observations apply across the whole DSL surface:
+
+- `add`  → expected `column` or `1` (uninformative; user
+  needs the shape of `add column …` AND `add 1:n
+  relationship …`).
+- `update Customers`  → expected `set` (true; but `update`'s
+  full grammar with `set …`, `where …`, `--all-rows` is what
+  the user wants illustrated).
+- `frobulate Customers`  → expected one of `create`, `drop`,
+  `add`, `rename`, `change`, `show`, `insert`, `update`,
+  `delete`, `replay` (true after ADR-0020; the available-commands
+  list is now informative, but the no-prefix case wants its
+  own framing — "available commands" rather than "expected").
+
+H1a's job is to close that gap by surfacing the **grammar**
+of the command at the point of error, not just the next
+token.
+
+### What ADR-0020 supplies
+
+ADR-0020 lands the lexer + parser-over-tokens architecture.
+What that buys H1a:
+
+- **Aggregated `expected` sets at the failure point** (top-level
+  `choice` failures now list every command-starting keyword,
+  not just one). The user-visible "available commands" list
+  becomes correct without any work in this ADR.
+- **Token-kind error patterns** (`RichPattern<Token>` instead
+  of `RichPattern<char>`). Each pattern renders via a stable
+  catalog key — no per-character humanising.
+- **A canonical entry-token for each command** (the first
+  `Keyword(_)` consumed). H1a keys per-command usage
+  templates off this token.
+
+### What this ADR adds on top
+
+- A registry of per-command **usage templates** (one
+  declaration per command).
+- A renderer that composes the parse error with: caret +
+  structural error wording + matching usage template(s).
+- New catalog keys under `parse.usage.*` (templates) and
+  `parse.token.*` (single-token rendering for expected-set
+  joins).
+- A "no commands consumed" fallback that renders an
+  available-commands list under a different prefix
+  ("available commands:" rather than "expected:") for the
+  zero-prefix case.
+
+## Decision
+
+### 1. Per-command usage template registry
+
+Each command parser is paired with a `UsageEntry`:
+
+```rust
+pub struct UsageEntry {
+    /// First keyword that distinguishes this command. Used
+    /// as the registry key.
+    pub entry: Keyword,
+    /// Catalog key for the grammar template body (under
+    /// `parse.usage.*`). One key per command.
+    pub catalog_key: &'static str,
+}
+```
+
+The registry is a `&'static [UsageEntry]` declared in one
+place (`src/dsl/usage.rs`). Lookup: given a consumed entry
+keyword, return all entries whose `entry == keyword`. For
+`Keyword::Add` the registry returns the `add column` and
+`add 1:n relationship` entries; for `Keyword::Drop` it
+returns `drop table`, `drop column`, `drop relationship`;
+for unique-entry keywords (e.g. `Keyword::Create` today) it
+returns one.
+
+The catalog key is what gets translated. Template bodies
+live in `src/friendly/strings/en-US.yaml` under
+`parse.usage.*`:
+
+```yaml
+parse:
+  usage:
+    create_table: "create table <Name> with pk [<col>:<type>[, ...]]"
+    drop_table:   "drop table <Name>"
+    add_column:   "add column [to] [table] <Table>: <Name> (<Type>)"
+    add_relationship: |
+      add 1:n relationship [as <Name>]
+        from <Parent>.<col> to <Child>.<col>
+        [on delete <action>] [on update <action>]
+        [--create-fk]
+    rename_column: "rename column [in] [table] <Table>: <Old> to <New>"
+    change_column: |
+      change column [in] [table] <Table>: <Name> (<Type>)
+        [--force-conversion | --dont-convert]
+    show_data:    "show data <Table>"
+    show_table:   "show table <Table>"
+    insert:       "insert into <Table> [(<col>[, ...])] [values] (<value>[, ...])"
+    update:       "update <Table> set <col>=<value>[, ...] (where <col>=<value> | --all-rows)"
+    delete:       "delete from <Table> (where <col>=<value> | --all-rows)"
+    drop_column:  "drop column [from] [table] <Table>: <Name>"
+    drop_relationship: |
+      drop relationship <Name>
+      drop relationship from <Parent>.<col> to <Child>.<col>
+    replay:       "replay <path>  |  replay '<path with spaces>'"
+```
+
+(Wording is illustrative; exact phrasing settled at
+implementation time. The bracket convention `[...]` for
+optional parts and angle-bracket `<...>` for placeholders
+matches ADR-0009's documentation surface.)
+
+### 2. The renderer composes three blocks
+
+A parse error renders as:
+
+```
+running: <user input>
+        ^               ← caret (existing, unchanged)
+parse error: <structural-or-content message>
+usage: <template1>
+       <template2>      ← when multiple entries share the entry keyword
+```
+
+Block 1 (the echo + caret) is unchanged from today.
+
+Block 2 is the structural or content error. ADR-0020
+guarantees the structural error is now properly aggregated
+("expected `data` or `table`" not "expected `table`"). The
+content errors (unknown type, mutually-exclusive flags) are
+unchanged in voice.
+
+Block 3 (usage:) is new. It is rendered if and only if **at
+least one keyword token was consumed** before the parser
+failed AND that keyword is a registered entry. If no keyword
+was consumed (e.g., `frobulate Customers`, where `frobulate`
+is an `Identifier`, not a `Keyword`), Block 3 is replaced
+with the no-prefix fallback (§5).
+
+If multiple entries match (e.g., the `add` family), all are
+listed under a single `usage:` prefix, one per line.
+
+### 3. Identifying the consumed entry keyword
+
+The parser surfaces, alongside the `ParseError`, the
+**deepest successfully-consumed keyword token**. Mechanism:
+
+- `parse_tokens` returns `(Result<Command, ParseError>,
+  ParseDiagnostics)` where `ParseDiagnostics` carries the
+  furthest position chumsky reached AND a snapshot of the
+  consumed prefix.
+- The renderer walks the consumed prefix backward to find the
+  first `Keyword(_)` token. (Almost always the first token,
+  but a future grammar where a command starts with a
+  literal — none today — would still resolve correctly.)
+
+This logic lives in `src/dsl/usage.rs::matched_entry()` so
+the registry and the lookup sit together.
+
+### 4. `parse.token.*` — single-token catalog vocabulary
+
+Chumsky's expected-set rendering needs a name for each token
+kind. Today `humanise()` hand-codes these
+(`describe_pattern` returns "`create`", "identifier", etc.).
+ADR-0021 moves the vocabulary into the catalog:
+
+```yaml
+parse:
+  token:
+    # Keywords — one entry per Keyword enum variant.
+    keyword.create: "`create`"
+    keyword.table:  "`table`"
+    keyword.with:   "`with`"
+    # ... one per Keyword variant ...
+
+    # Punctuation.
+    punct.colon:        "`:`"
+    punct.open_paren:   "`(`"
+    punct.close_paren:  "`)`"
+    punct.comma:        "`,`"
+    punct.equals:       "`=`"
+    punct.dot:          "`.`"
+
+    # Token-class labels.
+    identifier:     "identifier"
+    number:         "number"
+    string_literal: "string literal"
+    flag:           "flag (--name)"
+    end_of_input:   "end of input"
+
+    # Lexer-error tokens.
+    error.unterminated_string: "unterminated string starting at column {column}"
+    error.unknown_char:        "unrecognised character {found}"
+```
+
+Joining ("`a`, `b`, or `c`") stays in code (`oxford_or` from
+the current humanise machinery, lifted intact). Wording of
+each token is in the catalog.
+
+`parse.error` (existing wrapper key) stays. Its `{detail}`
+placeholder is filled by:
+
+```
+{consumed_prefix} expected {oxford_or(expected)}, found {found_token}
+```
+
+— each piece sourced from the catalog, joined in code.
+
+`parse.caret` (existing) and `parse.empty` (existing)
+unchanged.
+
+### 5. No-prefix fallback: "available commands"
+
+When the parser fails with **no keyword consumed**, the
+"expected" set lists every top-level command-starting
+keyword. That's correct but the framing should be
+"available commands" rather than "expected".
+
+Renderer detects this case (consumed-keyword count == 0) and
+substitutes Block 3 with:
+
+```
+available commands: create, drop, add, rename, change,
+                    show, insert, update, delete, replay
+```
+
+via a new catalog key:
+
+```yaml
+parse:
+  available_commands: "available commands: {commands}"
+```
+
+The list is the alphabetised set of `entry` keywords from
+the usage registry, each rendered via its `parse.token.keyword.*`
+catalog entry (so the strings are catalog-sourced, not
+hard-coded).
+
+This case only fires when the user typed something the
+parser couldn't classify as any known command keyword — the
+"frobulate Customers" case. It's both rarer and more useful
+than the with-prefix case: a user this lost benefits more
+from the full menu than from a missing-token pointer.
+
+### 6. Anchor-phrase compliance (ADR-0019 §10)
+
+ADR-0019's anchor-phrase list contains nine substrings the
+catalog commits to keeping stable. None are parse-error-specific,
+so this ADR doesn't add to the list. The existing parser
+test that asserts on "unknown type" and "expected one of"
+substrings stays — those come from `Type::from_str`'s custom
+error message which ADR-0020 §4 keeps unchanged.
+
+The current structural-error tests assert on substrings like
+"after `show data`", "expected identifier", "found end of
+input", "after `change column Rich`", "expected `:`". The
+new render shape preserves all of these — the rendering
+template is `{prefix} expected {set}, found {token}` and
+the prefix / set / token come from the catalog with the same
+wording. Tests should port unchanged or with at most minor
+adjustments.
+
+### 7. Catalog validator covers the new keys
+
+ADR-0019 §8.6's `KEYS_AND_PLACEHOLDERS` validator extends
+to cover:
+
+- Every `parse.usage.<command>` key referenced from the
+  registry exists.
+- Every `parse.token.keyword.<variant>` key for every
+  `Keyword` enum variant exists.
+- Every `parse.token.punct.<variant>` key for every `Punct`
+  variant exists.
+- The `parse.token.{identifier, number, string_literal,
+  flag, end_of_input}` keys exist.
+- The `parse.token.error.*` keys exist for every
+  `LexErrorKind` variant.
+- The `parse.available_commands` key exists.
+- No format specifiers (already enforced).
+- No engine vocabulary (already enforced).
+
+### 8. The `usage:` block respects the verbosity setting?
+
+No. The `messages (short|verbose)` setting (ADR-0019)
+governs *engine-error* verbosity (whether to render the
+hint block of a `FriendlyError`). Parse errors don't go
+through `FriendlyError`; they have their own render path,
+and the usage block is always shown. Rationale: a learner
+toggling to `messages short` is signalling they recognise
+the engine-error patterns and want less explanation around
+those — they're not signalling that they want less
+parse-help. Parse errors mean the user couldn't even
+formulate a runnable command; that's exactly the moment to
+maximise pedagogical surface, regardless of the
+engine-error verbosity preference.
+
+If experience shows this is wrong, a future amendment can
+gate the usage block on a separate setting. Doesn't need
+to be designed now.
+
+## Out of scope
+
+1. **Tab completion (I3) and syntax highlighting (I4)**
+   themselves. ADR-0020 §9-10 commits to the parser
+   contract; ADR-0021 doesn't extend it.
+2. **Schema-aware suggestions** ("did you mean `Customers`?"
+   when the user typed `Customrs`). Useful but a separate
+   feature; would land in I3 territory (completion + spell
+   check share a candidate list).
+3. **Suggested fixes** ("change `crete` to `create`"). Same
+   bucket as schema-aware suggestions.
+4. **Multi-error reporting.** Today and after this ADR, the
+   parser reports the first error and stops. Recovery-based
+   multi-error parsing is out of scope and re-opens with
+   I3's ADR (ADR-0020 §11).
+5. **Persisting the verbosity setting** (which doesn't
+   affect parse errors anyway, per §8). ADR-0019 deferred it
+   to a future settings ADR.
+
+## Consequences
+
+### Positive
+
+- **Per-command usage at point of failure.** A learner who
+  types `create` sees the full `create table` grammar
+  instead of "expected `table`". The user-reported gap
+  closes.
+- **Aggregated `available commands` for cold starts.**
+  `frobulate Customers` now lists the ten command-starting
+  keywords under a sensible framing.
+- **Vocabulary lives in the catalog, not in code.** Renaming
+  a keyword's user-facing wording is one YAML edit. Adding
+  a new keyword adds two lines (registry + token-name key);
+  the validator catches both if forgotten.
+- **The render path simplifies.** `humanise()` shrinks to a
+  small composer over catalog lookups — no per-character
+  description, no `RichPattern` walking, no
+  prefer-custom-over-structural switching (the latter
+  becomes "render the structural error and append the usage
+  template").
+- **Composes with ADR-0019's `FriendlyError`.** Engine
+  errors and parse errors are rendered through different
+  paths but both go through the catalog, so vocabulary
+  drift between them is impossible.
+
+### Costs
+
+- **A second registry to keep in sync** with the parser. The
+  validator (§7) catches missing usage entries / missing
+  token keys at test time, but adding a new command means
+  three steps (parser combinator, usage-registry entry,
+  catalog YAML edit). Mitigation: a unit test asserts every
+  command in the parser has a registry entry (catches
+  forgotten entries; matches the friendly-module pattern).
+- **Catalog grows by ~30-40 entries** (one usage template
+  per command, one keyword name per `Keyword` variant, a
+  handful of token-class names, a handful of error names).
+  Each entry is one line of YAML; total catalog grows from
+  ~170 entries to ~210. Within budget.
+- **Wording iteration** on the usage templates will probably
+  happen post-merge. This is normal for pedagogical text
+  and the catalog makes it cheap.
+
+### Neutral
+
+- **Public parser API is unchanged.** `parse_command(&str)`
+  signature stable. The new `lex` and `parse_tokens`
+  functions exposed by ADR-0020 are the I3/I4 hook;
+  ADR-0021 doesn't add to that surface.
+- **`AppEvent` shape unchanged.** Parse errors continue to
+  flow through `dispatch_dsl`'s existing path (push echo,
+  push caret, push error). This ADR's render changes are
+  internal to that function plus the `t!()` calls inside it.
+
+## Implementation notes
+
+### Order of operations (within the joint ADR-0020 + ADR-0021 implementation session)
+
+1. Land ADR-0020 (lexer + parser refactor + minimal
+   humaniser).
+2. Add `src/dsl/usage.rs` with the registry struct, the
+   static table, and `matched_entry()`.
+3. Populate `parse.usage.*` and `parse.token.*` catalog
+   sections.
+4. Extend `friendly::keys::KEYS_AND_PLACEHOLDERS` with the
+   new keys.
+5. Rewrite `dispatch_dsl`'s error-render arm in `app.rs` to
+   compose the three blocks per §2 (or §5 fallback).
+6. Add tests:
+   - Unit: every registered usage entry resolves through the
+     catalog. Every `Keyword` variant has a `parse.token.keyword.*`
+     entry.
+   - Integration (`tests/parse_error_pedagogy.rs`, new):
+     `create`, `add`, `update Customers`, `frobulate
+     Customers`, `create table` (no PK clause), `insert into
+     T` (no values), each producing the expected
+     three-block output.
+7. Update or port the two existing structural-error tests
+   in `parser.rs::tests` to the new render shape.
+
+### Things that interact subtly
+
+- **The "deepest consumed keyword" mechanism** (§3) walks
+  the prefix once per parse failure. Cheap; no perf concern.
+  But it must not pick up keywords from inside content that
+  is itself part of a partial AST (e.g. an identifier the
+  user is typing that happens to be the first letters of a
+  keyword); since the lexer commits to identifier-vs-keyword
+  classification before the parser sees tokens, this isn't a
+  real risk. Documented inline.
+- **Multiple usage entries per `add` / `drop`** are rendered
+  under one `usage:` prefix per §2. This is one of the
+  pedagogically-best parts of the change: the user gets the
+  full family rather than guessing which sibling they
+  wanted.
+- **`replay`'s special-case parsing** (ADR-0020 §6) is
+  invisible to the usage layer. The user typing `replay`
+  with no path gets the `parse.usage.replay` template.
+- **`messages` is an app-level command, not a DSL command**,
+  so it is not in the parser registry and doesn't appear in
+  `available commands:`. Same posture as `mode`, `help`,
+  `quit`. Documented in the registry's prelude.