docs: ADR-0042 — continue H1a parse-error pedagogy on the grammar tree

ADR-0020/0021 specified a chumsky-based H1a; ADR-0024 replaced chumsky with the scannerless walker, leaving both obsolete. Mark them superseded (kept as institutional memory) and add ADR-0042, which restates H1a against the architecture as built. ADR-0042 records that H1a is substantially shipped already — per-command usage block, available-commands fallback, source-derived ident slot labels, curated parse.custom.* near-miss messages, and schema-aware [ERR] diagnostics — and defines the remaining work: a verified per-command near-miss matrix (the definition of done), friendlier literal expectation labels that add role context while keeping the exact literal visible, and advanced-mode SQL parse parity (RETURNING scope, CROSS JOIN ON, INSERT…SELECT count), kept distinct from ADR-0019 §OOS-2 engine-error sanitisation. - docs/adr/0020,0021: superseded notes + README entries - docs/adr/0042: new ADR - docs/adr/README.md: index upkeep (ADR-0000 rule)
2026-06-03 14:05:09 +00:00
parent a8d0138d8b
commit 0e6f767848
4 changed files with 355 additions and 4 deletions
@@ -0,0 +1,298 @@
+# ADR-0042: H1a parse-error pedagogy in the grammar-tree era
+
+## Status
+
+**Accepted** — 2026-06-03.
+
+Continues H1a (`requirements.md`) from **ADR-0021**, whose
+chumsky-based mechanism was superseded by **ADR-0024** (unified
+grammar tree). ADR-0021's *intent* — surface the grammar of the
+command at the point of error, not just the next token — is
+re-stated here against the architecture as actually built, with
+an inventory of what already ships and a definition of done for
+the remaining work.
+
+Cross-references ADR-0019 (friendly-error layer + i18n catalog
+conventions; H1a output shares the catalog), ADR-0022 (ambient
+typing assistance, which shares the walker's expected-set
+machinery), ADR-0024 (the grammar tree), and ADR-0009 (DSL
+surface conventions; usage templates render in the documented
+surface form).
+
+## Context
+
+### Why a new ADR rather than amending ADR-0021
+
+ADR-0021 specifies a `UsageEntry` registry in `src/dsl/usage.rs`,
+`parse.token.*` catalog keys, and a renderer over chumsky's
+`RichPattern<Token>` expected sets. None of that exists. ADR-0024
+removed chumsky from the project, deleted `usage.rs`, and folded
+usage information onto the grammar nodes themselves. Amending
+ADR-0021 in place would force every reader to mentally translate a
+dead mechanism; a fresh ADR records the live state directly.
+ADR-0020 and ADR-0021 keep their superseding notes and remain as
+institutional memory.
+
+### What H1a is
+
+When a learner types something near-correct, the error should
+*name the missing keyword or clause* and *show the shape of the
+command*, rather than point a caret at the unexpected character.
+The user-reported gap: typing `create` once produced
+`parse error: after \`create\`, expected \`table\`` — structurally
+true, pedagogically silent.
+
+### What already ships (the baseline — do not re-build)
+
+Verified against code on 2026-06-03. The grammar-tree migration
+delivered most of ADR-0021's intent through different machinery:
+
+1. **Per-command usage block.** Every `CommandNode` carries
+   `usage_ids: &'static [&'static str]`
+   (`src/dsl/grammar/mod.rs`). On any parse error the renderer
+   emits a `usage:` block listing every form of the matched
+   command family — 38 templates under `parse.usage.*`
+   (`src/friendly/strings/en-US.yaml:499-571`), resolved by
+   `grammar::usage_keys_for_input` and rendered by
+   `render_usage_block` (`src/app.rs:2560`).
+
+2. **Available-commands fallback.** When no command keyword was
+   consumed, the block becomes
+   `available commands: …` (`parse.available_commands`,
+   `en-US.yaml:493`; `app.rs:2593`).
+
+3. **Structural error names the consumed prefix and expected
+   set.** `format_walker_error` (`src/dsl/parser.rs:289`) renders
+   `after \`<consumed>\`, expected <set>, found <token|end of
+   input>`, distinguishing incomplete-at-EOF (`at_eof = true`,
+   more input would help) from a definite mid-input mismatch.
+
+4. **Friendly slot labels for identifiers.** `format_expectation`
+   (`src/dsl/parser.rs:262`) renders `Ident` slots by source —
+   "table name", "column name", "relationship name", "index
+   name", "type" — instead of a bare "identifier" (ADR-0022 stage
+   8c).
+
+5. **Curated custom messages** for high-value near-misses under
+   `parse.custom.*` (`en-US.yaml:443-478`): `create_table_needs_pk`,
+   `insert_form_a_missing_values` ("looks like Form A — add
+   `values (...)`"), `change_column_flags_exclusive`,
+   `bind_type_mismatch`, the redundant-constraint and
+   alter-add-primary-key cases, etc.
+
+6. **Schema-aware pre-flight diagnostics** that light the
+   `[ERR]` validity indicator *at typing time* (ADR-0027 /
+   ADR-0033 / ADR-0036): INSERT arity for Forms A/B/C, unknown
+   table/column, type mismatch, `= NULL`, NOT-NULL-missing, and —
+   on the advanced-SQL surface — `cte_arity_mismatch`,
+   `compound_arity_mismatch`, and `projection_alias_misplaced`
+   (`diagnostic.*`, `en-US.yaml:577-620`; walker logic in
+   `src/dsl/walker/mod.rs`).
+
+7. **Ambient "Next:" hints** and the **simple→advanced cross-mode
+   pointer** (ADR-0022 / `advanced_alternative_note`,
+   `src/input_render.rs`).
+
+So H1a is *substantially* delivered at the intent level. The
+handoff's two canonical examples already behave: `insert into T
+('Oli')` → custom Form-A message; `update T set x=1` → structural
+"expected `where` or `--all-rows`" + usage block.
+
+### What remains — the genuine gap
+
+The remaining work is **systematic verification plus targeted
+polish**, not a missing feature:
+
+- **No enumerated coverage guarantee.** Coverage is curated
+  case-by-case; nothing asserts that *every required slot in every
+  command* produces a pedagogically-sound near-miss message.
+- **Literal expectations render terse.** `Word`/`Literal`/`Punct`/
+  `Flag` slots come out as backticked literals (`` `where` ``,
+  `` `=` ``, `` `--all-rows` ``). Correct, but a learner is helped
+  more by a short prose gloss in select high-value positions.
+- **Advanced-mode SQL parse pedagogy is thinner** than the DSL
+  surface (RETURNING scope, CTE-arity diagnostic positioning,
+  `CROSS JOIN … ON`, INSERT…SELECT column-count). No other ADR or
+  open issue covers this (ADR-0019 §OOS-2 covers advanced-SQL
+  *engine-error sanitisation* — a different layer).
+
+## Decision
+
+### 1. Definition of done — a verified near-miss matrix
+
+H1a is "done" when there is a test matrix that, for **every
+command in the REGISTRY**, exercises its salient near-miss inputs
+and asserts the rendered output reads pedagogically. "Salient
+near-misses" per command means at minimum:
+
+- the bare entry keyword alone (`create`, `add`, `update`);
+- each required clause omitted (e.g. `update T set x=1` with no
+  filter rail; `insert into T (cols)` with no `values`);
+- a wrong token where a specific slot is expected (e.g. a number
+  where a table name belongs);
+- the zero-prefix / unknown-command case (available-commands
+  fallback).
+
+The matrix lives in the existing surfaces — `tests/typing_surface/`
+(snapshot-based, the standalone `typing_surface_matrix` binary) for
+the typing-time hint/validity view, and
+`tests/it/parse_error_pedagogy.rs` (the consolidated `it` binary)
+for the submit-time rendered three-block output. New integration
+tests go in `tests/it/` per the handoff-57 §3 layout rule — **not**
+as new top-level `tests/*.rs`.
+
+Work is **test-first**: add the matrix entry, observe the current
+rendering, and only then adjust wording/labels where it reads
+poorly. A near-miss whose current rendering is already good is
+locked by a snapshot, not rewritten.
+
+### 2. Friendlier literal expectation labels
+
+`format_expectation` gains, for high-value keyword/punct positions,
+an optional prose gloss while **always keeping the exact literal
+visible** — a learner must still see the precise token to type.
+The principle: a label may *add* role context, never *replace* the
+literal.
+
+Illustrative target (final wording settled per-case against the
+matrix, as is normal for pedagogical text):
+
+- `expected \`where\` or \`--all-rows\`` →
+  `expected a filter clause: \`where …\` or \`--all-rows\``
+- `expected \`values\`` (after a Form-A column list) →
+  already covered by `parse.custom.insert_form_a_missing_values`;
+  the matrix confirms it fires.
+
+Mechanism (illustrative, finalised at implementation time): a
+grammar `Word`/`Punct` node may carry an optional expectation-label
+key, mirroring how `Ident` slots derive a label from
+`IdentSource`. Absent an override, rendering is unchanged (the
+backticked literal). This keeps the change additive and per-slot —
+no blanket reword that would churn the anchor-phrase tests
+needlessly.
+
+New glosses are catalog-sourced (`parse.expect.*` or reuse of
+`parse.usage.*` fragments — chosen at implementation time) so
+wording stays in `en-US.yaml`, not in code, consistent with
+ADR-0019.
+
+### 3. Advanced-mode SQL parse pedagogy — in scope
+
+The same matrix discipline (§1) extends to the advanced-mode SQL
+surface. Two of the relevant arity diagnostics **already exist** and
+must not be re-built — `cte_arity_mismatch` and
+`compound_arity_mismatch` (`en-US.yaml:590-591`); for these the work
+is matrix coverage and, for CTE, auditing whether the diagnostic is
+*positioned at the CTE name* (easiest to fix) rather than the body.
+The genuine **absences** the pass adds are: `RETURNING` column
+scope, `CROSS JOIN` rejecting an `ON` clause, and INSERT…SELECT
+projection/target column-count (verified absent from the catalog
+2026-06-03). Each gets a matrix entry; fixes land as walker
+diagnostics or `parse.custom.*` messages following the existing
+patterns. The `:` one-shot escape (a simple-mode line run once in
+advanced mode) is part of the advanced surface and gets at least one
+near-miss matrix entry.
+
+This stays clear of ADR-0019 §OOS-2 (advanced-SQL *engine-error*
+sanitisation): §OOS-2 reworks errors raised by *executing* SQL;
+H1a here concerns errors raised while *parsing* it. If a near-miss
+turns out to be an engine error rather than a parse error, it is
+out of H1a scope and noted against §OOS-2 instead.
+
+### 4. Catalog and anchor-phrase discipline
+
+All new or reworded user-facing strings go through the i18n catalog
+(`en-US.yaml`) and the `KEYS_AND_PLACEHOLDERS` validator, per
+ADR-0019. No engine vocabulary in any string (CLAUDE.md).
+
+Two anchor styles constrain §2's glosses and both are preserved by
+its "literal always visible" rule:
+
+- The **substring assertions** in `src/dsl/parser.rs` tests
+  ("after `…`", "expected table name", "found end of input",
+  "unknown type", "expected one of").
+- The **substring assertions** in `tests/it/parse_error_pedagogy.rs`,
+  which check for backticked literals and usage fragments
+  (e.g. `` `column` ``, `` `1` ``, "create table", "with pk"). This
+  test is `.contains()`-based, not snapshot-based, so a §2 gloss
+  that dropped the bare literal would fail it — which is precisely
+  the regression §2's rule prevents.
+
+The snapshot-based `tests/typing_surface/` matrix will re-baseline
+on any §2 wording change (expected; reviewed via `cargo insta`),
+but the two substring suites above must stay green without edits to
+their assertions.
+
+## Out of scope
+
+1. **Advanced-SQL engine-error sanitisation** — ADR-0019 §OOS-2.
+2. **Tab completion (I3) and syntax highlighting (I4)** as
+   features — they share the walker but are separate ADRs.
+3. **Schema-aware "did you mean `Customers`?" spell-correction** —
+   ADR-0021's out-of-scope §2; belongs with I3.
+4. **Multi-error reporting.** The walker reports the first error
+   and stops; unchanged.
+5. **`messages`-style verbosity gating of the usage block.** Per
+   ADR-0021 §8 the usage block is always shown; parse errors are
+   exactly when pedagogical surface should be maximal. Unchanged.
+6. **Auto-generating usage/help text from the grammar.** ADR-0024
+   left help prose hand-curated; templates stay hand-written.
+
+## Consequences
+
+### Positive
+
+- H1a gains an explicit, enumerated definition of done instead of
+  an open-ended "systematic pass still pending".
+- The matrix becomes a regression lock: future grammar changes
+  that degrade a near-miss message fail a snapshot.
+- Literal-label glosses close the last terse-wording gap without a
+  blanket reword.
+- The advanced-SQL surface reaches parity with the DSL surface for
+  the audience that has switched to raw SQL.
+
+### Costs
+
+- Wording iteration across many near-miss cases — but cheap,
+  catalog-driven, and snapshot-guarded.
+- The §2 per-node label field is one more annotation a new command
+  may set (optional; default unchanged).
+- Snapshot volume grows; acceptable given the existing ~160-entry
+  typing-surface matrix.
+
+### Neutral
+
+- No public API change. `parse_command*` signatures, the
+  `ParseError` shape, and the three-block render path are all
+  unchanged; this ADR adds wording, labels, and tests within them.
+
+## Implementation notes
+
+Order of operations (test-first throughout):
+
+1. Enumerate the per-command near-miss matrix (§1) as failing/asserting
+   tests in `tests/typing_surface/` + `tests/it/parse_error_pedagogy.rs`.
+   Capture current rendering as the starting baseline.
+2. Triage: which entries read poorly? Only those get wording work.
+3. Add the optional expectation-label mechanism (§2) and apply it
+   to the high-value keyword/punct positions surfaced in triage.
+4. Advanced-SQL near-miss audit + fixes (§3), distinguishing parse
+   from engine errors as they arise.
+5. Catalog validator + anchor-phrase checks stay green (§4).
+6. Update `requirements.md` H1a with the matrix as the done-marker;
+   flip to `[x]` only when the matrix is complete and green.
+
+## See also
+
+- ADR-0021 — Parser-as-source-of-truth for H1a (mechanism
+  superseded; intent continued here).
+- ADR-0020 — Tokenization layer (superseded by the scannerless
+  walker).
+- ADR-0024 — Unified grammar tree (the architecture H1a is built
+  on).
+- ADR-0022 — Ambient typing assistance (shares the expected-set
+  machinery).
+- ADR-0019 — Friendly-error layer and i18n catalog (§OOS-2 is the
+  adjacent engine-error scope).
+- ADR-0009 — DSL command-syntax conventions (usage surface form).
+- `requirements.md` — H1a tracking entry.