# ADR-0032: The full SQL `SELECT` grammar ## Status Accepted ## Context ADR-0030 commissions advanced mode as a body of **SQL grammar inside the unified grammar tree** (ADR-0023/0024), phased. Phase 1 ("Foundations + first `SELECT`") shipped: a single-table `SELECT` with projection, `WHERE`, `ORDER BY`, and `LIMIT`, executed as validated SQL text through the existing data-table renderer. ADR-0031 authored the SQL **expression grammar** the Phase-1 `SELECT` consumed. Phase 2 — "`SELECT` — full" — is the next slice. ADR-0030 §3 lists it: `JOIN`s, `GROUP BY` / `HAVING`, aggregates, subquery expressions, `UNION`/`INTERSECT`/`EXCEPT`, common table expressions, `LIMIT … OFFSET`, qualified column references. ADR-0030 §3 also says the full `SELECT` grammar "is each large enough to warrant their own focused ADR when implemented — the precedent is ADR-0026 for the `WHERE` grammar." This is that ADR. The architecture is fixed (ADR-0030 §1, §4, §6, §8): one walker, grammar-as-text execution, ambient assistance for free. This ADR fixes the **shape** of the grammar — the productions, the recursion, the additive extensions to ADR-0031's expression fragment, and the few execution-path implications (worker-side column-origin lookup so result columns recover their playground type). It deliberately does **not** revisit ADR-0030's structural decisions; references in this ADR's text to ADR-0030 §X mean "that decision is the controlling one." ### What ADR-0030 and ADR-0031 already fix - **No batch parser; SQL is grammar in the unified tree.** Subquery recursion is a `Node::Subgrammar(&NAMED)` reference, exactly as the expression ladder uses it (ADR-0031 §3). - **No AST builder for the parts that execute as text.** `Command::Select { sql: String }` carries the validated source; the worker prepares and runs it (ADR-0030 §4/§6, ADR-0031 §2). - **The `__rdbms_*` rejection** at every table-name slot (ADR-0030 §6) — re-applied to every Phase-2 table-source slot (`FROM`, `JOIN`, CTE-name). - **No allowlist for function names** (ADR-0030 §13 OOS-3, ADR-0031 §6). Aggregates (`count`, `sum`, `avg`, `min`, `max`) parse through the generic `name_or_call` path — the grammar is structurally aggregate-blind, by design. - **No quoted identifiers** (ADR-0031 §7 OOS-3) — unchanged. - **`MAX_SUBGRAMMAR_DEPTH = 64`** (ADR-0026) is the shared recursion budget across DSL `Expr`, SQL expression, and (added here) SQL `SELECT` recursion. No new walker capability is introduced (§9). ### The boundary with ADR-0031 ADR-0031 §7 named two additive extensions deferred to this ADR: - **OOS-1: subquery expressions** — `( SELECT … )` as a `primary`, `IN ( SELECT … )`, `EXISTS ( SELECT … )`. Their grammar is fixed in §6; they are additive `Choice` branches in `sql_expr.rs`, recursing into the named `SELECT` fragment authored here. - **OOS-2: qualified column references** — `t.c` / `alias.c`. Their grammar is fixed in §5; they are an additive tail on `name_or_call` in `sql_expr.rs`. `sql_expr.rs` was shaped to receive both branches without restructuring (ADR-0031 §7 promise). This ADR redeems that promise; the changes there are strictly additive. ## Decision ### 1. The top-level `SELECT` grammar The full statement decomposes into a top-level *compound query* (set-operator chains around per-leg *core selects*), wrapped by an optional `WITH` prefix and trailing `ORDER BY` / `LIMIT`: ``` select_statement := [ with_clause ] compound_select compound_select := select_core ( set_op select_core )* [ order_by_clause ] [ limit_clause ] set_op := UNION [ ALL ] | INTERSECT | EXCEPT select_core := SELECT [ DISTINCT | ALL ] projection_list [ from_clause ] [ where_clause ] [ group_by_clause ] [ having_clause ] with_clause := WITH [ RECURSIVE ] cte_def ( ',' cte_def )* cte_def := identifier [ '(' column_name_list ')' ] AS '(' compound_select ')' projection_list := projection_item ( ',' projection_item )* projection_item := '*' | identifier '.' '*' | sql_expr [ [ AS ] identifier ] from_clause := FROM table_source ( join_clause )* table_source := identifier [ [ AS ] identifier ] join_clause := [ INNER ] JOIN table_source ON sql_expr | LEFT [ OUTER ] JOIN table_source ON sql_expr | RIGHT [ OUTER ] JOIN table_source ON sql_expr | FULL [ OUTER ] JOIN table_source ON sql_expr | CROSS JOIN table_source where_clause := WHERE sql_expr group_by_clause := GROUP BY sql_expr ( ',' sql_expr )* having_clause := HAVING sql_expr order_by_clause := ORDER BY order_item ( ',' order_item )* order_item := sql_expr [ ASC | DESC ] limit_clause := LIMIT sql_expr [ OFFSET sql_expr ] ``` `sql_expr` is ADR-0031's `SQL_OR_EXPR`, extended additively per §5 and §6. `column_name_list` is `identifier (, identifier)*`. The named `static Node` exported by the new `src/dsl/grammar/sql_select.rs` is `SQL_SELECT_STATEMENT` (matching the full statement) and `SQL_SELECT_COMPOUND` (the embedded form, omitting the outer `WITH`; this is what subqueries recurse into — see §6, §9). Notes on specific productions: - **`FROM` stays optional.** Phase 1's autonomous decision §4.1 is upheld: `SELECT 1` and `SELECT upper('x')` continue to parse. With JOINs landing, the absence of a `FROM` simply means no `from_clause`/`join_clause` was matched; no extra shape is needed. - **Bare-alias projection (`select a x`) is admitted.** Phase 1's autonomous decision §4.2 deliberately rejected it as structurally ambiguous. With Phase 2's grammar — `FROM` is the only word that can legitimately follow a projection list, and it is a keyword in the walker's expected-set — the ambiguity dissolves: an identifier following the last projection expression that is not `FROM`, `,`, `WHERE`, `GROUP`, `ORDER`, `LIMIT`, or a set-op keyword is a bare alias, and is so admitted. This lifts a small but visible Phase-1 limitation. - **`SELECT [ DISTINCT | ALL ]`.** `ALL` is the default and is admitted for symmetry; `DISTINCT` is the meaningful case. They are mutually exclusive at this position (a `Choice`, not two `Optional`s). - **`identifier '.' '*'`** lives only in `projection_item`, never in `sql_expr`. This is intentional: `t.*` is *projection syntax*, not an expression, and admitting it as an expression primary would let it appear in `WHERE` / `ORDER BY` / etc. where the engine would reject it and the engine-neutral error would be hard to phrase. The grammar simply refuses it structurally outside projection. - **`UNION ALL` is a single set-op,** not `UNION` followed by an `ALL` modifier on the next leg. `set_op` is a `Choice` of the four atoms (with `UNION` and `UNION ALL` as separate branches); factoring `UNION [ ALL ]` is also valid but the explicit four branches keep the matched-path classes cleaner for highlighting. ### 2. JOIN flavours admitted The grammar admits exactly the flavours the user picked: - `INNER JOIN` / bare `JOIN` - `LEFT [ OUTER ] JOIN` - `RIGHT [ OUTER ] JOIN` - `FULL [ OUTER ] JOIN` - `CROSS JOIN` The first four take a mandatory `ON sql_expr`; `CROSS JOIN` takes none. `OUTER` is the optional explicit modifier on `LEFT` / `RIGHT` / `FULL`. **Explicitly out (§11):** `NATURAL JOIN`, `JOIN … USING (col)`, and comma-list `FROM t1, t2` (the legacy implicit cross join). The first two add grammar weight for limited teaching value; comma-FROM teaches habits we do not want to encourage — `CROSS JOIN` covers the same shape explicitly. JOIN chains are admitted as a flat `( join_clause )*`. Standard SQL is left-associative; since the grammar builds no AST and the engine receives the source text verbatim (ADR-0030 §4), the engine resolves the associativity. The grammar's job ends at "the chain parses". ### 3. Set operators and compound queries `UNION`, `UNION ALL`, `INTERSECT`, `EXCEPT` all admitted — ADR-0030 §3's full set. The compound shape (§1) is `select_core (set_op select_core)*`, flat. Standard SQL gives `INTERSECT` higher precedence than `UNION` / `EXCEPT`; the engine resolves this — the grammar admits the chain as written. This mirrors §2's JOIN-chain decision. A user who wants explicit grouping writes `(SELECT … INTERSECT SELECT …) UNION SELECT …`, which falls out of the subquery-`primary` branch (§6) — though for a top-level statement this requires an extra `SELECT` wrapping. In practice the engine's precedence is what learners encounter; calling it out in the `help sql` page (ADR-0030 Phase 6) is sufficient. `ORDER BY` / `LIMIT` on a compound apply to the whole compound, not to a leg — fixed by the position of `order_by_clause` and `limit_clause` in §1's `compound_select`. ### 4. CTEs (`WITH` and `WITH RECURSIVE`) The full `with_clause` per §1. Both forms admitted: non-recursive `WITH` for naming intermediate results, and `WITH RECURSIVE` for recursive queries (tree traversals, transitive closure, generated sequences). The `cte_def` body is a parenthesised `compound_select`, so the recursion is into `SQL_SELECT_COMPOUND` via `Subgrammar` — the same recursion mechanism subqueries use (§9). **CTE-name collisions.** A CTE name shares the table-name namespace at the engine. Standard SQL: the CTE shadows a same-named base table within the statement. The grammar is agnostic — both are identifiers in a table-source slot — so the shadowing falls out of engine resolution. The `reject_internal_table` validator still rejects any `__rdbms_*` identifier in any table-source slot, **including** CTE-name slots and the `FROM`s inside CTE bodies. That is the right posture: the reserved namespace is reserved everywhere. Recursive CTEs use the standard `cte_name AS ( base_case UNION [ALL] recursive_case )` shape — already admitted by §1's `compound_select` body. No grammar branch specific to recursion is needed; the `RECURSIVE` keyword is a hint to the engine, not a grammar gate. ### 5. Qualified column references Additive extension to `sql_expr.rs` (ADR-0031 §7 OOS-2). `name_or_call`'s identifier prefix gains a `Choice` tail: ``` name_or_call := identifier ( '.' identifier | '(' call_args? ')' )? ``` The leading identifier is matched once (preserving ADR-0031 §1's factoring — no `Choice` branch begins with an identifier). The optional tail is *either* a qualified-reference suffix (`. identifier`) *or* a function-call argument list (`( … )`), not both. A bare identifier with no tail remains a plain column reference. A function call with a qualified name — `schema.f(…)` — is not in scope (we have no schemas) and is structurally inadmissible by construction: there is no production that admits both a `.`-tail and a `(`-tail. Completion for the qualified form: when the cursor is past `identifier '.'`, the completion source is "columns of the table or alias named by the leading identifier", resolved from the active `SchemaCache` (the same source the DSL completion uses, ADR-0030 §8). This is a small extension to the existing `IdentSource::Columns` machinery — when in scope, column completion is scoped to the named source. ### 6. Subquery expressions Additive extensions to `sql_expr.rs` (ADR-0031 §7 OOS-1): - **Scalar subquery as `primary`.** A `Choice` branch `'(' compound_select ')'`. The existing `'(' or_expr ')'` branch handles parenthesised expressions. Both start with `'('`, so per ADR-0031 §1's factoring principle, the `'('` is matched once and the inside is a `Choice` between `compound_select` and `or_expr`. The first inside token disambiguates: `SELECT` or `WITH` → subquery; anything else → expression. The two `Choice` branches have non-overlapping first-token sets, so the walker's expected-set at the ambiguity point merges naturally without `Optional`-first hazards. - **`IN ( subquery )`.** The existing `predicate_tail`'s `IN '(' additive (',' additive)* ')'` branch gains a sibling `IN '(' compound_select ')'`. Same `'('` factoring as the scalar case: after `'('`, branch on `SELECT`/`WITH` (subquery) vs additive-first-token (literal list). `NOT IN` follows from the existing `[ NOT ]` factoring on the predicate tail. - **`[ NOT ] EXISTS ( subquery )`.** Added as a `primary` `Choice` branch: ``` primary := … | EXISTS '(' compound_select ')' ``` The bare `EXISTS` form lives in `primary`; `NOT EXISTS` falls out of the existing `not_expr := NOT not_expr` tier above `primary` in the precedence ladder. This is structurally cleaner than putting `[ NOT ] EXISTS` inside `primary`: there is only one place `NOT` is admitted, and it composes uniformly. All three branches recurse through `Subgrammar(&SQL_SELECT_COMPOUND)`. Correlated subqueries fall out for free — a subquery's `sql_expr` reaches identifiers, which the engine resolves against outer scopes. The grammar imposes no correlation constraint; correlation is engine-side semantics. ### 7. `GROUP BY` and `HAVING` `GROUP BY` takes a comma-separated list of `sql_expr`s. Standard SQL admits any expression as a grouping key (not just bare columns) — e.g. `GROUP BY date(created_at)`. The grammar admits this without special-casing. `HAVING` is a single `sql_expr`. Its semantics is "boolean over grouped rows"; the grammar does not enforce that — the expression's typing is the engine's concern. **Aggregate correctness is not grammar-checked.** Whether a projection's non-aggregated columns are valid given the `GROUP BY` keys is a semantic question. ADR-0030 §9 settled this: the grammar admits structurally, the engine rejects semantically, and the friendly-error layer renders engine-neutral wording (ADR-0019). A learner who writes `SELECT Name, COUNT(*) FROM t` sees an engine-neutral "Name must appear in a GROUP BY clause or be wrapped in an aggregate function"-style message, not a raw engine string and not a parse error. This is the project's honest limitation (ADR-0030 §7) and remains so. ### 8. `LIMIT` / `OFFSET` and `ORDER BY` extras `LIMIT n [ OFFSET m ]` — the standard form. Both `n` and `m` are `sql_expr`s (in practice integer literals, but the grammar admits the general form so e.g. `LIMIT max(10, x) OFFSET 0` is structurally accepted; the engine constrains values). The MySQL/SQLite legacy comma form `LIMIT m, n` is **out** (§11). Its argument order (offset first, then count) inverts the keyword form — a needless source of confusion. `ORDER BY` already admits `sql_expr` items with optional `ASC` / `DESC` (Phase 1). With Phase 2: - **Column-position references** (`ORDER BY 1, 3 DESC`) fall out for free — an integer literal is a valid `sql_expr`, and the engine interprets a bare positive integer in `ORDER BY` as a column position. The grammar does not distinguish the case; rendering interprets the position. Document in `help sql`. - **Qualified refs** in `ORDER BY` (e.g. `ORDER BY t.c`) fall out of §5 — the grammar uses the same `sql_expr` body. ### 9. Recursion, the depth budget, and the walker `SELECT` recurses into itself at four points: - A subquery `primary` in `sql_expr` (§6). - An `IN ( subquery )` predicate tail (§6). - An `EXISTS ( subquery )` primary (§6). - A CTE body (§4). Every recursion is wired through `Node::Subgrammar(&SQL_SELECT_COMPOUND)` — the named `static` Node exported by `sql_select.rs`. The recursion is token-guarded in every case: a subquery `primary` is preceded by `'('`; an `IN ( subquery )` by `IN (`; an `EXISTS ( subquery )` by `EXISTS (`; a CTE body by `AS (`. There is no left recursion; the walker always makes progress. `MAX_SUBGRAMMAR_DEPTH = 64` (ADR-0026, reused by ADR-0031) is **shared**: DSL `Expr` recursion, SQL expression recursion, and SQL `SELECT` recursion all increment the same `WalkContext::subgrammar_depth`. A worst-case learner query might be `SELECT … WHERE id IN (SELECT … WHERE id IN (SELECT …))` with each inner select carrying a few-deep expression — well below the cap. The cap remains purely a stack-overflow guard; **this ADR does not raise it**. If pathological-but-realistic learner queries reach 64 in practice, a focused ADR lifts it with measurements. Speculative raising would weaken the guard without evidence. **No new walker capability is introduced.** `Subgrammar`, the depth counter, the cap, and the friendly depth-exceeded error all carry over from ADR-0026 unchanged — the same posture ADR-0031 took. This is a non-trivial property: Phase 2 is the biggest single grammar slice in the project, and it lands without changing the walker's contract. ### 10. Completion scope and the `WalkContext` extension ADR-0030 §8 promises that "ambient assistance comes for free" because SQL is grammar in the unified tree. For Phase 1's single-table `SELECT` this was substantially true: the existing `WalkContext::current_table` mechanism (populated via the `writes_table: true` flag on the `FROM` table-name slot) gave `WHERE` and `ORDER BY` column-name completion against the right table at no incremental cost. Phase 2 breaks the "free" claim. Multiple `FROM` tables via `JOIN`s, aliases, CTE-defined table sources, subqueries with their own `FROM` scope, qualified `t.c` references, projection aliases referenced in `ORDER BY` — every Phase-2 surface needs **scope information that `WalkContext` does not currently carry**. §9's "no new walker capability" claim holds for grammar recursion (`Subgrammar` and the depth cap suffice); for completion scope it is too strong, and is softened here to an honest split. The current `WalkContext` carries one table at a time (`current_table: Option` + `current_table_columns`), set by `writes_table: true` on a `Tables` identifier. DSL paths (`update T`, `delete from T`, `insert into T`) rely on this single-table contract and continue to work unchanged. Phase 2 adds layered accumulators alongside, not in place of. #### 10.1. The from-scope accumulator A new `WalkContext` field: ``` from_scope: Vec TableBinding { table: String, alias: Option, columns: Vec } ``` Populated incrementally as the walker descends through `from_clause` and each `join_clause` (§1). The first table-source slot pushes a binding; every subsequent `JOIN` pushes another. `Ident` slots whose `IdentSource` is `Columns` now resolve against the union of every binding's columns, with deduplication. `current_table` / `current_table_columns` remain as derived helpers: when `from_scope.len() == 1`, they expose that single binding's data, preserving the contract every existing DSL path relies on. DSL `UPDATE` / `DELETE` / `INSERT` continue to push exactly one binding via the existing `writes_table: true` mechanism, unchanged. #### 10.2. Scope-stack discipline at `Subgrammar` boundaries Subqueries (§6) and CTE bodies (§4) introduce new lexical scopes. A column reference inside `SELECT … WHERE id IN (SELECT id FROM u)` resolves first against the inner `SELECT`'s `FROM` (`u`), and — for correlation — also against the outer scope. `subgrammar_depth` is a counter; it suffices for §9's depth cap but not for scope. Phase 2 layers a stack on top. A new field: ``` from_scope_stack: Vec ScopeFrame { from_scope: Vec, cte_bindings: Vec, projection_aliases: Vec, } ``` The new walker node variant — `Node::ScopedSubgrammar(&Node)` — is what triggers a scope push. It is a sibling of the existing `Node::Subgrammar(&Node)`, with the same recursion semantics (reference-following, depth-counted) and one additional driver behaviour: on entry, push the current `ScopeFrame` onto `from_scope_stack` and start a fresh empty frame; on exit, pop back. The existing `Node::Subgrammar` variant is unchanged — DSL `Expr` recursion (ADR-0026) and the `sql_expr.rs` precedence- ladder recursion (ADR-0031) keep using it and never push a scope. The grammar source spells the choice explicitly at each call site: subqueries in `sql_expr.rs` and CTE bodies in `sql_select.rs` reference the compound-SELECT through `Node::ScopedSubgrammar(&SQL_SELECT_COMPOUND)`; predicate-ladder recursion in `sql_expr.rs` continues to use `Node::Subgrammar(&SQL_OR_EXPR)`. Self-documenting, no flag bookkeeping, and the walker change is localised to one extra arm in the driver's `match` over `Node` variants. Column-completion candidates inside a scope frame are the union of the current frame's `from_scope` and (for correlated refs) all outer frames; outer-frame columns are admitted as additional candidates so correlated references work. Ordering or visual differentiation between current-frame and outer-frame candidates is completion-tier polish and is not specified by this ADR — the current completion API (`candidates_at_cursor*`) returns a flat `Vec`, and adding a priority dimension is a separate concern. CTE bindings resolve the same way (outward-walking) — a CTE defined in an outer query is visible inside an inner subquery as a table source, unless the inner subquery defines a CTE of the same name and shadows it. This is the one explicit walker-capability extension Phase 2 makes. It is scoped: one new node variant, no new walker entry point, no change to how Subgrammar bodies are entered structurally. The depth cap (§9) applies to both variants uniformly through the shared `subgrammar_depth` counter. #### 10.3. CTE bindings A frame-local accumulator carries CTE definitions visible in the current scope: ``` cte_bindings: Vec CteBinding { name: String, columns: Vec, } CteColumn { name: Option, // None for unnamed // computed projections type_: Option, // resolved playground type // if derivable } ``` A CTE definition `cte_name [(col-list)] AS (compound_select)` produces a binding in two stages: 1. **Pre-body push** (so `WITH RECURSIVE` self-references resolve). When the walker reaches `AS` and is about to enter the body's `Node::ScopedSubgrammar(&SQL_SELECT_COMPOUND)`, it pushes a placeholder binding into the *outer* frame's `cte_bindings` with `columns = []` (an empty stand-in). The CTE name is now visible as a table source from inside the body. 2. **Body-finalised harvest** (when the body's scope frame completes). On `ScopedSubgrammar` exit, before popping the frame, the driver derives the body's projection-list output columns (rules below) and rewrites the placeholder binding in the outer frame. **Output-column derivation rules.** Walking the body's projection items: | Projection item | Derived CTE column(s) | |---------------------------------------|----------------------------------------------------------------------------------------| | `*` | Every column from the body frame's `from_scope`, in order, with their resolved types | | `t.*` (qualified wildcard) | Every column from binding `t` in the body frame's `from_scope`, with their types | | `col` (bare ref, resolves uniquely) | One column: name = `col`, type = the resolved column's playground type | | `t.col` (qualified ref) | One column: name = `col`, type = `t`'s column's type | | `expr AS alias` or bare `expr alias` | One column: name = `alias`, type = the underlying type if `expr` is a single column ref, else `None` | | `expr` (computed, no alias) | One column: name = `None`, type = `None` — engine assigns an implementation-defined name | For compound bodies (`UNION` / `INTERSECT` / `EXCEPT`) the columns come from the **first leg** per standard SQL. For recursive CTE bodies (`WITH RECURSIVE`) the same rule — the non-recursive leg dictates. If a `(col-list)` was supplied on the CTE name, it **renames** the derived columns positionally and overrides their names; types are preserved from the derivation. If the column-count of `(col-list)` disagrees with the body's projection arity, the grammar admits this and the engine surfaces the mismatch — `do_run_select`'s engine-neutral error layer carries the message (ADR-0030 §9, ADR-0019). **Completion past `cte_alias.|`.** Where the derivation produced named columns (every form above except computed-no-alias), they complete with their names and (where typed) participate in §11's result-type resolution if the CTE's columns are projected upstream. Where the derivation produced an unnamed column slot, that slot is silently skipped from the qualified-prefix candidate list — the user typing `cte.|` past it sees only the nameable columns. The cure for "I want my expression to be referenceable from outside the CTE" is to add an alias, which is the same cure the engine itself enforces at execution time. This is substantially better than the earlier "honest limitation" posture: the common `SELECT *` body is fully resolvable; explicit projections are resolvable; only un-aliased computed columns elude us, and the right learner response there is the same as the engine's right learner response — write an alias. `cte_bindings` lives on the scope frame, so a CTE defined in an outer query is visible inside an inner subquery as a table source unless that subquery defines a CTE of the same name (which shadows it, per standard SQL). #### 10.4. Projection-alias bindings Standard SQL admits `ORDER BY` referencing a SELECT-list alias: `SELECT a + b AS total FROM t ORDER BY total`. A third frame-local accumulator: ``` projection_aliases: Vec ``` Each `projection_item`'s optional alias (whether `AS x` or bare `x` — see §1) appends its name. `Ident` slots inside the trailing `ORDER BY`'s `sql_expr`s offer projection aliases as additional candidates alongside column names. This addresses §1's bare-alias admission's completion behaviour at the same time. The accumulator is not consulted inside `WHERE`, `GROUP BY`, or `HAVING` — standard SQL forbids alias references there (aliases are not yet bound at evaluation time). The grammar admits them structurally regardless; the engine rejects; ADR-0019 renders the engine-neutral error. #### 10.5. Qualified-prefix completion §5 fixed the grammar for `t.c` references. The completion behaviour at qualified positions: - At an `Ident` cursor with **no prefix**, candidates are the union of every `from_scope` binding's columns, plus `projection_aliases` when in `ORDER BY`, deduplicated. CTE-name candidates apply only in table-source slots, not column slots. - At an `Ident` cursor immediately after `prefix '.'`, candidates are **scoped**: resolve `prefix` against the active `from_scope` (preferring alias matches over table matches, since aliases shadow), and offer that binding's columns alone. If `prefix` doesn't resolve to a binding, the candidate list is empty — the walker's expected-set still surfaces the syntactic alternatives (the user sees no column candidates but the structural error message reports the unresolved prefix). The qualified-prefix narrowing is a small extension to the existing `IdentSource::Columns` handling: when the matched-path immediately preceding the `Ident` ends with `Ident '.'`, the completer is told the prefix and narrows accordingly. This is the only completion-source-level change; the rest is data flowing through the new accumulators. #### 10.6. The projection-before-FROM problem Standard SQL writes projection **before** `FROM`. A user typing `select col1, col2 from mytable` produces, mid-typing, a state where the projection list has been parsed but the `FROM` has not. At that point the column-name completer cannot scope to `mytable` — it does not know `mytable` is coming. Validation and highlighting face the same problem: `col1` and `col2` cannot be checked as belonging to `mytable` until the user types `from mytable`. The debounced re-walk on every keystroke (ADR-0027) is **not** sufficient on its own to fix this in a single-pass walker, because by the time the FROM is parsed, the projection identifiers have already been resolved (left-to-right) against the only scope information available at that moment — the empty `from_scope`. There is no fully satisfying single-pass answer. Phase 2's posture is therefore explicit: 1. **During-typing completion** of projection-list column names, when `from_scope` is empty (no `FROM` yet), uses the unioned `SchemaCache.columns` — every column known to the schema — as the candidate set. This is the same global fallback Phase 1 uses and remains the right behaviour: a noisier-but-useful completion is better than no completion. 2. **A post-walk fixup pass** re-evaluates projection-list column refs against the *final* `from_scope` after the walk completes. The walk records each projection `Ident`'s span and matched-path location; once the walk reaches end-of- input (or end-of-statement), the fixup walks the recorded list, looks up each identifier against the final `from_scope`, and: - **Rewrites the highlight class** on that terminal — downgrading "column" → "unknown identifier" when the identifier doesn't belong to any in-scope binding, upgrading "unknown identifier" → "column" when it does. - **Updates the diagnostic** for the validity indicator (ADR-0027) — a column-not-found ERROR either appears or disappears based on the post-walk scope. **Integration point.** The fixup runs as the **final stage of the walk itself**, after all grammar nodes have been processed but before `WalkResult` is returned to the caller. It mutates the walker's accumulated highlight runs and diagnostics vector in place, so the consumer (the renderer, the validity indicator) sees a single coherent snapshot. This keeps the walker the single source of truth for what reaches the renderer — the fixup is conceptually part of "what the walker produces", not a separate post-processing layer. The same convention applies to the §11.6 SQL-expression predicate warnings, which also run as a final walk stage. 3. The fixup runs on every debounced re-walk (ADR-0027 already triggers the full walk per keystroke), so the user observes: typing `col1, col2 from mytable`, the `col1` / `col2` initially highlight as generic identifiers (with a soft warning if not found anywhere in the schema); the moment `mytable` is typed, the highlight snaps to the column class if `col1` / `col2` belong to `mytable`, or to the unknown-identifier diagnostic if they don't — within one debounce cycle. The fixup pass does not re-parse; it only re-resolves identifiers against the final `from_scope`. `ORDER BY` alias resolution needs no fixup. Projection precedes `ORDER BY` in walk order, so `projection_aliases` is fully populated by the time the walker reaches an `ORDER BY` `Ident`; the alias-as-column-candidate is resolved in the single forward pass. This is the answer to the user's "I think this may be automatic" intuition: the debounced re-walk is automatic; the post-walk fixup pass is the new infrastructure that makes the re-walk produce *correct* results. Without it, projection-list column refs would forever validate against the global column set even after the `FROM` is typed. #### 10.7. The honest split §9 still holds for **grammar recursion**: `Subgrammar` and the depth cap are reused unchanged. For **completion scope**, this section introduces: - New `WalkContext` fields: `from_scope`, `from_scope_stack`, `cte_bindings`, `projection_aliases`. - Scope push/pop discipline at `SQL_SELECT_COMPOUND` `Subgrammar` boundaries — driven by a marker on the Subgrammar target so DSL Subgrammars are unaffected. - A qualified-prefix narrowing in the `IdentSource::Columns` completion path. - A post-walk fixup pass for projection-list identifier highlighting and validity (§10.6). These are real walker-contract extensions. They are scoped: no new node kinds, no new walk-driver entry points, no changes to how Subgrammar bodies are entered structurally. The existing DSL paths are unaffected — their grammars never push a SELECT scope, never define a CTE, never carry projection aliases — and the single-table `current_table` / `current_table_columns` view is preserved as a derived helper. §9's claim is therefore restated honestly: **grammar recursion needs no new walker capability; completion scope needs the additions above.** ### 11. Diagnostics for Phase-2 validation cases ADR-0027 fixes the warning-vs-error guideline verbatim: > **ERROR** — the input is *known* to fail. Either it does not > parse (incomplete, or a mismatched / invalid token), or it > parses but names something that does not exist (an unknown > table or column). > > **WARNING** — the input is valid and *will* run, but is very > likely not what a knowledgeable user wants: a type-mismatched > comparison, or `= NULL` (both from ADR-0026 §7). Amendment 1 > adds a third trigger — `LIKE` against a numeric column. > > The split is *certainty of failure* versus *likely misleading*. This section walks the Phase-2 surface case-by-case, classifies each against that guideline, and identifies the diagnostic machinery additions needed. It also flags a Phase-1 carry-over gap (§11.6) that Phase 2 closes. #### 11.1. Existing diagnostics, briefly Two post-walk passes today (`src/dsl/walker/mod.rs`): - **Schema-existence pass** (ERROR). Walks the `MatchedPath`, checks every `IdentSource::Tables` / `IdentSource::Columns` ident against `SchemaCache`. Emits `diagnostic.unknown_table` and `diagnostic.unknown_column`. Today this assumes a single `current_table` for column resolution. - **Expression predicate-warnings pass** (WARNING). Walks the parsed DSL `Expr` AST emitted by `expr.rs`'s builder. Emits `diagnostic.eq_null`, `diagnostic.type_mismatch`, `diagnostic.like_numeric`. Runs only on WHERE expressions in the DSL. Phase 2 extends both, and §11.6 fills a SQL-side gap. #### 11.2. Phase-2 new ERROR cases Every case below is "known to fail on the engine" — the engine would surface a message the friendly-error layer would translate (ADR-0019). Surfacing them as pre-flight ERROR diagnostics gives the learner the answer one debounce cycle faster, with the walker as the single source of truth. - **Unknown table in any `FROM`/`JOIN` slot.** The existing schema-existence pass extends from "the one `current_table`" to walking every `from_scope` binding's `table` and emitting `diagnostic.unknown_table` per unresolved name. CTE-name slots in the active `cte_bindings` are valid table sources and exempt from this check. - **Unknown CTE-as-table.** A table-source slot whose name is not in `SchemaCache.tables` *and* not in the active `cte_bindings` chain emits `diagnostic.unknown_table` (same catalog key — from the learner's perspective the engine message is the same; the slot is a "table that doesn't exist", whether they meant a CTE or a base table). - **Unknown table or alias in a qualified column reference** (`t.c` where `t` doesn't resolve in the active `from_scope`). New catalog key `diagnostic.unknown_qualifier` `{qualifier}`. - **Unknown column in a qualified reference** (`t.c` where `t` resolves but `c` is not a column of that binding). Reuses `diagnostic.unknown_column` with the column name in context. - **Ambiguous unqualified column reference** — a column name used unqualified that exists in two or more `from_scope` bindings. The engine raises "ambiguous column name"; we surface it as ERROR with a new catalog key `diagnostic.ambiguous_column` `{column}, {qualifiers}` so the learner sees which two tables the name appeared in. - **Reference to a projection alias in `WHERE` / `GROUP BY` / `HAVING`.** Standard SQL forbids it (aliases are not bound at evaluation time). The grammar admits the identifier structurally; a new diagnostic pass emits ERROR with a new catalog key `diagnostic.projection_alias_misplaced` `{alias}, {clause}`. - **CTE column-list arity mismatch.** When `cte_name (col1, col2, …) AS (compound_select)` declares N columns and the body's projection (§10.3) derives M columns with N ≠ M, the CTE harvest pass (§10.3 stage 2) emits ERROR with a new catalog key `diagnostic.cte_arity_mismatch` `{cte}, {declared}, {actual}`. - **Compound-query column-count mismatch.** When a `UNION` / `INTERSECT` / `EXCEPT` chain has legs whose projection arities differ, the engine errors at execution. Phase 2 catches it pre-flight: each leg's derived arity (the same derivation the CTE harvest uses) is compared as the compound is assembled. ERROR with a new catalog key `diagnostic.compound_arity_mismatch` `{op}, {left_n}, {right_n}`. - **Internal-table reference in any new table-source slot.** Already a parse-time rejection via `reject_internal_table` (§1, §4) — surfaces as a parse error, not a post-walk diagnostic. Listed here for completeness: the catalog key `select.internal_table` authored in Phase 1 covers every Phase-2 slot too. #### 11.3. Phase-2 new WARNING cases The existing WARNING set (`= NULL`, type-mismatched comparison, `LIKE`-on-numeric) is the right set. Phase-2 adds **no new WARNING categories** — every Phase-2-specific case falls into ERROR (§11.2) or engine-rejected (§11.4). Considered and rejected as WARNINGs: - **CTE name shadowing a base table.** Standard SQL behaviour; often intentional (the canonical "filter to a subset, then query as if it were the base table" pattern). No diagnostic. - **Correlated reference without explicit qualification.** Correlation is implicit in standard SQL; per the user guideline a knowledgeable user does want this. The walker validates the reference silently against the outer-frame scope; no warning, no diagnostic. - **Unused CTE.** A CTE defined in `WITH` but never referenced. The engine ignores it; many learners write CTEs as intermediate scratch space. Not a warning. #### 11.4. Engine-rejected (no diagnostic) These fail on the engine and surface via ADR-0019's friendly-error layer at execution time. The walker does not attempt pre-flight detection because: - **Non-aggregated columns in projection with `GROUP BY`** — detecting requires knowing which function names are aggregates; ADR-0030 §13 OOS-3 / ADR-0031 §6 keep us allowlist-free. - **Aggregate function in `WHERE`** — same reason. - **Scalar subquery returning multiple rows** — semantic, not syntactic; requires execution. - **Recursive CTE without a `UNION`** — requires inspection of the body's compound shape against the recursive contract; doable in principle, deferred as engine territory. - **Duplicate CTE names within the same `WITH`** — checkable in principle (walking `cte_bindings` for duplicates), but the engine catches it cleanly. Phase 2 does not pre-flight it; could be added later if its absence proves confusing. - **Type-mismatched JOIN ON predicates** — the existing expression type-mismatch warning (extended per §11.6) handles the explicit-literal case; arbitrary-expression cases require type inference and stay engine-side. #### 11.5. Catalog additions Phase 2 adds the following message-catalog keys (ADR-0019). Every key is engine-neutral by construction. Parse-time-detectable (post-walk diagnostic passes): | Key | Slots | |----------------------------------------|--------------------------------------------------| | `diagnostic.unknown_qualifier` | `{qualifier}` | | `diagnostic.ambiguous_column` | `{column}, {qualifiers}` | | `diagnostic.projection_alias_misplaced`| `{alias}, {clause}` | | `diagnostic.cte_arity_mismatch` | `{cte}, {declared}, {actual}` | | `diagnostic.compound_arity_mismatch` | `{op}, {left_n}, {right_n}` | Engine-error translations (friendly-error layer; reached on execution failure): | Key | Engine cause | |----------------------------------------|--------------------------------------------------| | `engine.no_such_table` | `no such table: ` (post-execution path) | | `engine.no_such_column` | `no such column: ` (post-execution path) | | `engine.ambiguous_column` | `ambiguous column name: ` | | `engine.aggregate_misuse` | `misuse of aggregate function ()` | | `engine.group_by_required` | `column must appear in the GROUP BY clause or be used in an aggregate function` (or equivalent) | | `engine.compound_arity_mismatch` | `SELECTs to the left and right of UNION do not have the same number of result columns` (or equivalent) | | `engine.scalar_subquery_too_many_rows` | scalar subquery cardinality violation | | `engine.recursive_cte_malformed` | recursive CTE shape errors | The parse-time keys and the engine keys are intentionally separate even when they describe the same situation (`engine.ambiguous_column` mirrors `diagnostic.ambiguous_column`) — the parse-time message can include the learner's typed text and span; the engine-time message catches what the parser missed and routes through the friendly-error layer with whatever context the engine yielded. Two pre-existing parse-time keys are reused unchanged for Phase-2 slots: `diagnostic.unknown_table`, `diagnostic.unknown_column`, and the Phase-1 `select.internal_table`. #### 11.6. The Phase-1 SQL-expression predicate-warning gap ADR-0027 Amendment 1's `LIKE`-on-numeric warning, and ADR-0026 §7's `= NULL` and type-mismatch warnings, are emitted by a pass that walks the **DSL** `Expr` AST. Phase 1's `sql_expr.rs` deliberately builds **no AST** (ADR-0031 §2). The consequence is a Phase-1 carry-over gap: **SQL `WHERE` expressions today emit none of these warnings** — `select * from t where name like 5` parses, the engine runs it, and the learner gets the engine's verdict, not the friendly pre-flight nudge ADR-0027 Amendment 1 promised. Phase 2 closes this. The predicate-warnings pass gains a **MatchedPath-walking variant** that runs over the SQL expression nodes and identifies the predicate shapes structurally (a `LIKE` predicate-tail with a column-ref left operand; a `=`/`!=` predicate-tail with a `NULL` literal operand; a comparison predicate-tail with a column-literal operand pair of mismatched types). It does not need an `Expr` AST because the matched-path terminals carry both the byte spans (for the diagnostic) and the node-name labels (for shape identification). The same catalog keys (`diagnostic.eq_null`, `diagnostic.type_mismatch`, `diagnostic.like_numeric`) apply unchanged; only the pass implementation is new. The MatchedPath-walking pass runs over **every** Phase-2 `sql_expr` slot — `WHERE`, `HAVING`, `ON`, `CASE` branches, projection items, `ORDER BY` items — so warnings surface uniformly across the SQL surface rather than just `WHERE`. This is a strict improvement over Phase 1's behaviour, where even Phase-1 SELECT WHERE expressions got no predicate warnings. Type-resolution for the MatchedPath-walking pass: a column ref's type comes from §10's `from_scope` (or, for `t.c`, the specific binding); a literal's type comes from its lexical class. When the column ref doesn't resolve (the schema-existence ERROR pass will already have flagged it), the warning pass skips the predicate — no point compounding diagnostics on an already- broken reference. #### 11.7. Mechanism summary Three diagnostic passes by end of Phase 2, all running as final stages of the walk (per §10.6's integration-point convention): 1. **Schema-existence ERROR pass** — extended from single `current_table` to walking every `from_scope` binding and the active `cte_bindings`. Adds the qualified-reference and ambiguity checks (§11.2). 2. **Arity-check ERROR pass** (new) — runs at CTE-body and compound-query frame-exits (the same `ScopedSubgrammar` exit hook §10.3 uses), comparing declared vs derived column counts. 3. **Predicate-warnings pass** — extended with a MatchedPath-walking variant for `sql_expr` (§11.6) covering `= NULL`, type mismatch, and `LIKE`-on-numeric across every SQL expression slot, in addition to the existing DSL `Expr` AST variant for DSL expressions. Per the integration-point convention (§10.6), each pass mutates the walker's accumulated highlight runs and diagnostics in place; the consumer sees a single coherent snapshot. The projection-list fixup of §10.6 is conceptually part of pass (1) — it is the same "re-resolve identifier against final scope" operation, applied to the small subset of identifiers whose scope wasn't fully known at first-pass walk time. ### 12. Result-column type resolution Phase 1's `column_types: Vec` is partially lifted: where a projection item is structurally a single column reference, the worker resolves it back to the source column's playground type (ADR-0005) and populates that slot in `DataResult.column_types`. Everything else stays `None`. This addresses Phase-1 autonomous decision §4.5 (bool SELECT results render as `0`/`1`): a bare `bool` column now renders as `true` / `false` again, alignment recovers, and the `show data` rendering path is reached for the common case. **Resolution rule.** A projection item is "structurally a single column reference" when, after stripping an optional `[ AS ] alias`, its expression is one of: - An unqualified identifier (`Name`) that resolves uniquely to a single column across the FROM tables; - A qualified reference (`t.c` / `alias.c`) that resolves unambiguously through the FROM aliases. Anything else — function calls, arithmetic, `CASE`, literals, subquery expressions, the `*` and `t.*` wildcards — keeps `column_types[i] = None`. When resolution is ambiguous (unqualified column name appears in two FROM tables) the grammar admits it (engine resolves or errors); the type-resolver returns `None` and the renderer falls back to neutral alignment. **Implementation seam.** The strongly preferred mechanism is **engine-side column-origin lookup**: after preparing the statement, query the prepared statement for each result column's underlying table and column. The engine knows authoritatively which result columns are direct references and which are expressions; for direct references it returns the source table+column, for expressions it returns nothing. This avoids re-parsing the source or adding structured projection-item data to the `MatchedPath` — the grammar tier is not involved at all, which preserves ADR-0031 §2's "no AST" decision and stays on the right side of ADR-0030's "one source of truth" rule. The Phase-2 implementer verifies that the rusqlite version pinned in `Cargo.toml` exposes this metadata (the SQLite C API calls are `sqlite3_column_table_name` / `sqlite3_column_origin_name` — they have been stable for two decades; rusqlite either exposes them directly or via the underlying `*mut sqlite3_stmt` handle). If exposure turns out to be awkward, the fallback is a small post-parse walk over the projection-item subtrees in the `MatchedPath` — strictly worse because it duplicates a slice of parsing, but available. The resolution pass adds one method on `Database` (something like `resolve_select_column_types`) called from `do_run_select` before the `DataResult` is shipped. It takes the prepared statement and the active `SchemaCache`, and returns `Vec>`. The renderer needs no change — `None` slots already render as typeless. This is the only execution-path change Phase 2 makes; everything else routes through Phase 1's grammar-as-text execution. ### 13. Out of scope - **OOS-1. Derived tables in `FROM`** — `FROM (SELECT …) [AS] alias`. The same shapes are reachable via CTEs (§4), which Phase 2 ships. Derived tables in `FROM` are not authored here. - **OOS-2. `NATURAL JOIN` and `JOIN … USING (col)`.** Both are convenience forms. NATURAL is widely considered a footgun; USING is cleaner but adds grammar weight without lifting any expressive ceiling. Out. - **OOS-3. Comma-list `FROM t1, t2` (implicit cross join).** Out. `CROSS JOIN` covers the same shape explicitly. - **OOS-4. `LIMIT m, n` (the legacy comma form).** Out (§8). - **OOS-5. Window functions** (`OVER (…)`, `PARTITION BY`, window-frame syntax). A meaningful learning topic, but a large surface of its own and out of ADR-0030's commissioned set. - **OOS-6. `LATERAL` joins.** Not commissioned by ADR-0030. - **OOS-7. `VALUES (…)` as a row source.** Not commissioned. - **OOS-8. A function/aggregate allowlist** — ADR-0030 §13 OOS-3 / ADR-0031 §7 OOS-4 still apply: aggregate names parse generically through `name_or_call`. - **OOS-9. Quoted identifiers** (`"column name"`). Tracked as ADR-0031 §7 OOS-3, still tracked. - **OOS-10. Engine-checked aggregate correctness at parse time.** The grammar admits structurally; engine rejects semantically; ADR-0019 surfaces the engine's verdict in engine-neutral wording (§7). - **OOS-11. Result-column type resolution beyond bare column refs.** Computed columns (`a + b`, `upper(name)`, `CASE …`) stay typeless (§10). - **OOS-12. The `help sql` page and parse-error usage entries** for the Phase-2 surface. The grammar carries the `help_id`s authored in this phase, but the page content and the rich per-command usage messages are Phase 6 (ADR-0030 §10) and ADR-0021. Phase 2 leaves the same `help_id: None` shape Phase 1 used for `select`. ## Consequences - A new grammar file, `src/dsl/grammar/sql_select.rs`, parallel to `sql_expr.rs`, exporting `pub static SQL_SELECT_STATEMENT: Node` and `pub static SQL_SELECT_COMPOUND: Node`. The Phase-1 `data::SELECT` `CommandNode` is rebuilt against `SQL_SELECT_STATEMENT` (its body becomes a `Subgrammar` reference); the `CommandNode` itself stays. - **Phase-1 SQL `SELECT` grammar nodes migrate.** The Phase-1 static nodes that live in `src/dsl/grammar/data.rs` for the single-table SELECT (the projection, FROM, WHERE, ORDER-BY, LIMIT sub-trees) move into `sql_select.rs` as the starting-point for the §1 productions; the file leaves only the `CommandNode` shell behind. The seven Phase-1 SQL `SELECT` integration tests are part of the safety net for this migration — they must continue to pass under the rebuilt grammar, in addition to the new Phase-2 integration tests authored in step 4 of the implementation notes. - **Hint-panel prose** for the new clauses (JOIN flavours, ON, GROUP BY, HAVING, UNION / INTERSECT / EXCEPT, WITH, OFFSET, the qualified-prefix and CTE-prefix completion states) is authored at the structural level alongside each grammar node in step 1 — a one-liner per slot, enough to drive the hint panel. Richer per-clause teaching prose and the `help sql` reference page remain ADR-0030 Phase 6 work (§12 OOS-12). - **Walker cost is expected to stay proportional to source length.** The new accumulators are `O(bindings + aliases)` per frame; the scope stack is bounded by `MAX_SUBGRAMMAR_DEPTH = 64` (§9); the §10.6 post-walk fixup pass touches one entry per projection-list `Ident` (a small set). Each debounced keystroke (ADR-0027) walks once, fixes up once, and emits a single coherent output. No new pathological case is introduced — if a learner-realistic query produces a noticeable typing-time stall, measure first and revisit the recursion budget or the accumulator structure on evidence. - `sql_expr.rs` gains three additive `Choice` branches and one additive tail on `name_or_call` (§5, §6). The existing tiers and the depth-cap discipline are unchanged. The Phase-1 tests continue to exercise the existing branches as they stand. - **No new walker capability** (§9). `Subgrammar`, the depth counter, the cap, and the friendly depth error are all reused unchanged — the same posture ADR-0031 took. - `Command::Select { sql: String }` is unchanged. The validated source SQL is simply larger; the worker still routes it through `Database::run_select` and `do_run_select` (Phase 1 path). - The worker gains a post-prepare type-resolution helper that populates `column_types` for direct-reference projection items (§12) via the engine's column-origin metadata. **`Cargo.toml` gains `column_metadata` to `rusqlite`'s feature list** (alongside the existing `bundled`); this pulls in the SQLite `SQLITE_ENABLE_COLUMN_METADATA` compile flag and exposes `RawStatement::column_table_name` / `column_origin_name` / `column_database_name` on the prepared statement. Verified against the project's pinned rusqlite 0.39.0. This is the only Phase-2 execution-path change. - **Three diagnostic passes** (§11.7) — schema-existence (extended), CTE/compound arity-check (new), and predicate warnings (extended with a MatchedPath-walking variant for `sql_expr` — §11.6). All run as final walk stages and mutate the walker's accumulated output in place. Closes the Phase-1 carry-over gap where SQL `WHERE` expressions emitted no `LIKE`-on-numeric / type-mismatch / `= NULL` warnings. - **Catalog additions** (§11.5) — five new `diagnostic.*` keys for parse-time-detectable cases and eight new `engine.*` keys for friendly-error layer translations of engine messages. - The walker's `WalkContext` gains the completion-scope accumulators of §10 — a `from_scope_stack: Vec` whose top frame is the active `from_scope` / `cte_bindings` / `projection_aliases`. A **new node variant `Node::Scoped­ Subgrammar(&Node)`** (§10.2) is the trigger for push/pop; existing `Node::Subgrammar` is unchanged so DSL `Expr` and `sql_expr` recursion are unaffected. A post-walk fixup pass re-resolves projection-list identifier highlighting and validity once the final `from_scope` is known (§10.6). CTE output columns are derived from the body's projection list at body-frame exit, populating the binding back into the outer frame (§10.3) — so `SELECT *` and explicit-projection CTE bodies both yield real column completion past `cte_alias.|`. This **softens §9's "no new walker capability" claim** for completion scope; grammar recursion still needs nothing new. - `__rdbms_*` rejection extends to **every** table-source slot introduced by Phase 2: the `FROM` table, each `JOIN`'s table, each CTE name, and the `FROM` table inside any CTE body (§4, §6). The `reject_internal_table` validator is reused. - Completion gains: SQL keywords for joins / set ops / `WITH` / `GROUP` / `HAVING` / `OFFSET` (all walker-derived, no bespoke code); column completion scoped to a qualified prefix `t.` resolves through the active `SchemaCache` (§5). - Phase-1 autonomous decisions §4.1 and §4.3–§4.4 stand (optional `FROM`, `help_id: None`, walker-mode defaults). §4.2 is lifted (bare-alias projection admitted, §1). §4.5 is partially lifted (bare bool column refs recover their type via §12). - `requirements.md`'s `Q1` / `Q2` advance further; `Q4` was already ticked by ADR-0030 and ADR-0031. ## Implementation notes A build order, each step guarded by the test suite. The phases within Phase 2 mirror the ADR-0030 / ADR-0031 staging — grammar first, execution-path change last. **Detailed plan: `docs/plans/20260520-adr-0032-phase-2.md`.** The notes below are the outline; the plan refines them into seven sub-phases (2a–2g) with per-gate exit criteria, a cross-cut verification matrix that explicitly tests every "X comes for free" claim from ADR-0030/0031/0032 (the kind of implicit claim that produced the Phase-1 SQL-expression predicate-warning gap §11.6 closes), and a final phase-exit verification report template. Implementers work through the plan; the ADR remains the decisions. 1. **The `sql_select.rs` grammar fragment.** Author the stratified tiers of §1 as named `static` `Node`s, recursion via `Subgrammar`. Export `SQL_SELECT_STATEMENT` and `SQL_SELECT_COMPOUND`. The existing `data::SELECT` `CommandNode` is rebuilt against `SQL_SELECT_STATEMENT`. 2. **Unit tests** against the fragment directly (the `expr.rs` / `sql_expr.rs` test pattern): JOIN flavours, GROUP BY / HAVING, qualified refs, every set-op, recursive and non-recursive CTEs, `LIMIT … OFFSET`, `DISTINCT`, `t.*` projection, the bare-alias projection, plus the keyword-case-insensitivity check. 3. **`sql_expr.rs` additive extensions** (§5, §6): the qualified-ref tail on `name_or_call`; the scalar-subquery `primary` branch; the `IN (subquery)` predicate-tail branch; the `EXISTS (subquery)` `primary` branch. Unit tests for each. 4. **Integration tests** (the `tests/` Tier-3 path, building on Phase 1's SQL `SELECT` tests): each JOIN flavour returns the expected rows; GROUP BY / HAVING aggregates over real data; `UNION` / `INTERSECT` / `EXCEPT` between two SELECTs; a non-recursive CTE; a recursive CTE (a small tree traversal or generated-sequence example); a scalar subquery in `WHERE`; `IN (SELECT …)`; `EXISTS (…)`; qualified refs resolving correctly. 5. **The `WalkContext` scope accumulators** (§10). Add the `ScopeFrame` type (`from_scope` / `cte_bindings` / `projection_aliases`) and the `from_scope_stack`; add the `Node::ScopedSubgrammar(&Node)` variant alongside the existing `Node::Subgrammar`; teach the driver to push/pop a fresh frame on `ScopedSubgrammar` entry/exit; rewrite every reference to `&SQL_SELECT_COMPOUND` from outside its own definition to use the new variant (subqueries in `sql_expr.rs`, CTE bodies in `sql_select.rs`); teach `from_clause` / `join_clause` to populate the frame's `from_scope`; teach `with_clause` to push placeholder CTE bindings before the body and harvest derived output columns on body-exit per §10.3; teach `projection_item` to append to `projection_aliases`. Keep `current_table` / `current_table_columns` as derived helpers (top frame's single-binding view) so the DSL paths stay green. 6. **Qualified-prefix completion** (§10.5). When the matched-path immediately preceding an `IdentSource::Columns` slot ends with `Ident '.'`, narrow candidates to the named binding's columns. Unit tests: `select t.` Tab offers `t`'s columns; an unresolved prefix returns an empty list. 7. **Post-walk fixup pass** (§10.6). Collect projection-list `Ident` terminals during the walk; after the walk, re-resolve each against the final `from_scope`, rewriting the highlight class and validity diagnostic. Tests: typing `select col1 from t` lights `col1` correctly once `t` is typed; typing `select bogus from t` produces a column-not-found diagnostic. 8. **Diagnostic passes** (§11). Extend the schema-existence ERROR pass to walk every `from_scope` binding plus `cte_bindings`; add the qualified-reference and ambiguity checks (§11.2). Add the new arity-check ERROR pass at the CTE-body and compound-query frame-exit hooks (§11.7 case 2). Extend the predicate-warnings pass with a MatchedPath-walking variant covering every Phase-2 `sql_expr` slot (§11.6) — closes the Phase-1 carry-over gap. Author the five new `diagnostic.*` catalog keys and the eight new `engine.*` translation keys (§11.5). Tests: one positive and one negative case per new ERROR key; predicate warnings firing on `select * from t where col like 5` (the Phase-1 gap closure); arity-mismatch ERRORs on a CTE and on a `UNION`. 9. **Result-column type resolution** (§12). Add `"column_metadata"` to rusqlite's feature list in `Cargo.toml`. The worker's `do_run_select` calls the new resolver — `RawStatement::column_table_name` / `column_origin_name` per result column — before constructing the `DataResult`. Tests: a single-column SELECT recovers the playground type (covering each of the ten types, the pedagogically important one being `bool` → `true` / `false`); a SELECT with a computed projection keeps it typeless; a SELECT through a CTE recovers the underlying column's type if the engine's column-origin metadata follows through the CTE (verified, not assumed). 10. **Highlighting / completion / hint** spot-checks via the typing-surface matrix (ADR-0022 / ADR-0030 §8): a SELECT with a JOIN highlights the JOIN keywords; Tab past `select t.` offers columns of `t`; column completion inside a `WHERE` after `from a join b on …` offers both `a`'s and `b`'s columns; column completion inside a correlated subquery sees the outer scope; the `[ERR]` indicator fires on a malformed subquery; an out-of-subset construct (e.g. `OVER (…)`) produces an engine-neutral parse error. 11. **`reject_internal_table`** spot-checks against every new table-source slot: a `FROM __rdbms_columns` parse-rejects; a `WITH __rdbms_x AS (…)` parse-rejects; a `FROM` inside a CTE body referencing `__rdbms_*` parse-rejects. Later phases continue ADR-0030's plan unchanged — Phase 3 (DML), Phase 4 (DDL), Phase 5 (DSL → SQL echo), Phase 6 (polish). ADR-0030 §13 OOS items (window functions, `LATERAL`, function allowlist, quoted identifiers) remain tracked separately and are authored if and when they are taken up; they are not implicit follow-ups of Phase 2. ## Amendment 1 — Empirical scope of column-origin metadata (2026-05-20) §12 was written conservatively: it constrained type recovery to projection items "structurally a single column reference" and listed "subquery expressions" alongside arithmetic and `CASE` as cases that stay `None`. The implementation plan's Open Question 1 (`docs/plans/20260520-adr-0032-phase-2.md`) captured the matching uncertainty about CTEs and scalar subqueries, leaving the test in sub-phase 2f to "assert the actual behaviour (not the wished-for behaviour)". A throwaway probe against the pinned bundled SQLite (run 2026-05-20, with `rusqlite` 0.39.0 + `column_metadata`) settles the question. Across twenty representative query shapes, the engine's `sqlite3_column_table_name` / `sqlite3_column_origin_name` metadata follows through: - direct bare column refs (the baseline); - `AS alias` projections (the alias remaps the output name but the origin pair stays the source `(table, column)`); - table-alias qualified refs (`u.name` → `(users, name)`); - non-recursive CTEs, including `SELECT *` bodies, bare-ref bodies, qualified-ref bodies, and `(col-list)`-renamed bodies (the rename remaps the output name; origin stays the underlying column); - CTE chains (a CTE that selects from a prior CTE — origin traces back to the base table); - derived tables in `FROM (SELECT …) AS sub` (out-of-scope for Phase 2 per §13 OOS-1, but useful to note: if ever admitted, type recovery comes for free); - scalar subqueries used as a projection primary (`SELECT (SELECT name FROM users WHERE id = 1)` — origin is preserved whether the subquery has an outer alias or not); - `UNION` / `UNION ALL` / `INTERSECT` / `EXCEPT` compound queries (result columns carry the first leg's origin); - multi-table `JOIN` projections (per-column origin per leg); - `IN (SELECT …)` subqueries in `WHERE` (the inner subquery does not affect the outer projection's origin). The metadata returns `None` for exactly two structural classes: - **Computed projections** — function calls, arithmetic expressions, string concatenation, `CASE` expressions, literals, the `*` and `t.*` wildcards. Expected; pedagogically obvious; no surprise for the learner. - **Recursive CTE result columns** (`WITH RECURSIVE r(n) AS (SELECT 1 UNION ALL SELECT n + 1 FROM r WHERE n < 5) SELECT n FROM r`). The recursion materialises through an internal temporary table that has no base-column origin to point at. This is the one structural surprise — a recursive-CTE result column is typeless even when it is structurally a bare name reference, because the engine cannot trace the column back past the recursion. ### What §12's resolution rule becomes The original §12 rule classifies projection items structurally (unqualified ident / qualified ref → recover; everything else → None). The empirical finding makes that classification redundant and slightly wrong: it misses scalar subqueries and CTE-routed refs that the engine does carry through, and it would have needed extending for `(col-list)`-renamed CTEs. The amended posture: **trust the engine's column-origin metadata verbatim**. For each result column, call `column_table_name(i)` / `column_origin_name(i)`. If both return `Some`, look the pair up in the active `SchemaCache` and use the playground type. If either is `None`, the slot stays `None` and the renderer falls back to neutral alignment. No structural classification of the projection item is needed; the grammar tier stays uninvolved (preserving ADR-0031 §2's "no AST" decision and ADR-0030's "one source of truth" rule, both as before). The "structurally a single column reference" definition in §12's **Resolution rule** is superseded by the engine-driven rule above. The §12 **Implementation seam** is unchanged in approach (engine-side column-origin lookup is still the mechanism), but the speculative fallback paragraph ("If exposure turns out to be awkward, the fallback is a small post-parse walk over the projection-item subtrees in the `MatchedPath`") is moot — the exposure works, and the engine's metadata is broader than a grammar-side walk could be without re-implementing SQLite's query-planner traceback. The fallback path is removed. ### Effect on the Phase-2 plan's sub-phase 2f The 2f exit gate's "CTE pass-through" row should be asserted positive (recovers `Some(text)`). The "Subquery result" row, which the plan left as "assert whichever behaviour the engine exhibits", should be asserted positive as well. A new explicit 2f test row covers the named limitation: a recursive CTE result column must produce `column_types[0] = None` and the renderer must fall back to neutral alignment without panicking. The catalog and grammar-side work in 2a–2e is unaffected by this amendment. Only 2f's test list and the worker's `resolve_select_column_types` helper change shape (the helper becomes simpler — no structural classification, just a direct metadata lookup per result column). This amendment narrows the honest limitation in §12 from "computed / non-direct projection items" to "computed projections and recursive CTE result columns" — a tighter, factually verified carve-out. ## Amendment 2 — §10.6 fixup-pass mechanism (2026-05-20) §10.6's prescription for the post-walk fixup is written in terms of "rewriting the highlight class" on projection-list `Ident` terminals — downgrading "column" → "unknown identifier" when an ident doesn't belong to the eventual `from_scope`, or upgrading the reverse direction once a `FROM` is typed. The implementation chose a different mechanism that achieves the identical user-visible effect; this Amendment records the choice so a reader of §10.6 doesn't go looking for a literal `per_byte_class` rewrite step that does not exist. ### Mechanism actually used Two pieces, both already in the codebase by the end of sub-phase 2d: 1. **Two-pass schema-existence diagnostic.** The 2d rewrite of `schema_existence_diagnostics` (`src/dsl/walker/mod.rs`) runs a pre-pass over the matched path that collects every `IdentSource::Tables` / `cte_name` / `table_alias` ident into a single binding vec, regardless of where in the path it sits. The main pass then resolves each `sql_expr_ident` against the **complete** binding set. A projection ident that resolves under the eventual FROM scope produces no diagnostic; one that doesn't produces an `unknown_column` diagnostic on its own span. 2. **Diagnostic-overlay renderer.** `src/input_render.rs` reads the walker's diagnostic list at every keystroke and overlays each diagnostic's span with the appropriate colour (Error red for unknown-column, Warning for type-mismatch / `LIKE`-on-numeric / etc.). The overlay sits on top of the walker's `per_byte_class` (which keeps all idents at `HighlightClass::Identifier`). Combined, the two yield the §10.6 user-visible behaviour: typing `select bogus_col`, the diagnostic emits and the overlay paints the ident red as soon as a FROM appears that shows the column doesn't exist; typing `select real_col`, no diagnostic emits and the ident stays Identifier-coloured. Within one debounce cycle. ### Why this is equivalent §10.6's stated goal is correctness of the end-of-walk classification — "rewriting the highlight class" is one implementation strategy for that goal. The HighlightClass enum in the codebase has only one identifier slot (`Identifier`); the Error tint comes from diagnostic overlay, not from a separate `Column` vs `UnknownIdentifier` class. The two-pass diagnostic pass is the "post-walk fixup" that §10.6 calls for — it just runs inside the diagnostic emitter rather than as a separate rewrite step. The integration point (§10.6's "final stage of the walk itself") still holds: `schema_existence_diagnostics` runs after the walk's main work, mutating the walker's accumulated diagnostic vector in place. Consumers see a single coherent snapshot. ### Completion mid-typing §10.6's second user-visible promise — "during-typing completion of projection-list column names uses the global fallback" — is preserved as a posture, but improved at the edges in sub-phase 2e by a look-ahead probe in `src/completion.rs`. When the leading walk produces no `from_scope` (the projection-before-FROM state) **and** the full input does have a FROM after the cursor, a second walk on the full input populates the binding set, and column candidates narrow to that scope. The fallback to global `SchemaCache.columns` remains the path when the full input doesn't parse cleanly (e.g., the user deleted `*` and is mid-edit). This is a strict improvement: the realistic "edit an existing query" workflow now narrows correctly. ### What §10.6's prescription becomes The "rewrite the highlight class" wording is superseded by: **the post-walk diagnostic pass re-resolves projection idents against the complete scope and emits / withholds the unknown-column diagnostic accordingly; the renderer's diagnostic-overlay path achieves the visual change**. No new `HighlightClass` variant is required. §10.6's other prescriptions stand verbatim — the integration point (final walk stage, in-place mutation of walker accumulators), the per-keystroke re-walk (ADR-0027's debounced cadence), and the ORDER BY no-fixup-needed clarification. ## See also - ADR-0005 — the ten-type vocabulary §10 resolves back to. - ADR-0016 — the data-table renderer SELECT results reuse. - ADR-0019 — the friendly-error layer engine-side rejections route through (§7). - ADR-0021 — per-command parse-error usage; the Phase-2 surface inherits the framework, Phase 6 polishes per-clause messages (§11 OOS-12). - ADR-0022 — ambient typing assistance. §5/§6/§8 inherit its keyword-completion / highlighting / hint mechanisms for free, but §10 extends its `IdentSource::Columns` / `SchemaCache` / `WalkContext` infrastructure with the scope accumulators, qualified-prefix narrowing, and the post-walk fixup pass that Phase 2 needs. - ADR-0023 / ADR-0024 — the unified grammar tree Phase 2 extends. - ADR-0026 — the `WHERE` grammar's `Subgrammar` node, depth counter, and `MAX_SUBGRAMMAR_DEPTH = 64` cap, all reused unchanged (§9). - ADR-0027 — the validity indicator, free for the Phase-2 surface; §1 (ERROR/WARNING guideline) is the source quoted verbatim in §11; Amendment 1 (`LIKE`-on-numeric WARNING) is the one that the SQL-expression predicate-warnings gap of §11.6 closes for the SQL surface. - ADR-0028 — the styled `OutputLine` mechanism the renderer uses; not directly touched by Phase 2. - ADR-0030 — the parent ADR; §3 commissions this phase, §4/§6 fix execution-as-text, §7 fixes engine neutrality, §11 fixes history / replay, §13 fixes the long-running OOS list. - ADR-0031 — the SQL expression grammar this ADR extends additively (§5, §6); §7 named the two extensions implemented here. - `docs/simple-mode-limitations.md` — the DSL limits advanced mode lifts; Phase 2 lifts the JOIN, subquery, set-op, CTE, and grouping limits.