Files
rdbms-playground/docs/adr/0032-sql-select-grammar.md
T
claude@clouddev1 ee0dafd86b docs: ADR-0032 Amendment 2 + §10.6 regression tests
Amendment 2 records the §10.6 fixup-pass mechanism choice. §10.6
prescribes "rewriting the highlight class" on projection-list
idents at end-of-walk; the actual implementation uses a different
mechanism that achieves the identical user-visible behavior:

1. 2d's two-pass schema-existence diagnostic collects every FROM
   binding from the matched path first, then resolves projection
   idents against the complete scope. The post-walk re-resolve
   §10.6 calls for, just embedded in the diagnostic emitter.

2. input_render.rs's diagnostic-overlay path colors each
   diagnostic span Error/Warning, achieving the visual change
   §10.6 describes without needing a new HighlightClass variant.

The completion-mid-typing piece is improved by the §10.5
look-ahead probe (sub-phase 2e earlier).

Four new regression tests in `projection_before_from_tests` pin
the behavior so a future refactor can't silently regress it:
correct ident resolves silently, unknown ident flags via
diagnostic on its span, multi-projection only flags unknowns,
projection-without-FROM is silent.

ADR index entry updated to reference Amendment 2.

Test totals: 1424 → 1428 passing (+4). Clippy clean.
2026-05-20 21:19:57 +00:00

72 KiB
Raw Blame History

ADR-0032: The full SQL SELECT grammar

Status

Accepted

Context

ADR-0030 commissions advanced mode as a body of SQL grammar inside the unified grammar tree (ADR-0023/0024), phased. Phase 1 ("Foundations + first SELECT") shipped: a single-table SELECT with projection, WHERE, ORDER BY, and LIMIT, executed as validated SQL text through the existing data-table renderer. ADR-0031 authored the SQL expression grammar the Phase-1 SELECT consumed.

Phase 2 — "SELECT — full" — is the next slice. ADR-0030 §3 lists it: JOINs, GROUP BY / HAVING, aggregates, subquery expressions, UNION/INTERSECT/EXCEPT, common table expressions, LIMIT … OFFSET, qualified column references. ADR-0030 §3 also says the full SELECT grammar "is each large enough to warrant their own focused ADR when implemented — the precedent is ADR-0026 for the WHERE grammar." This is that ADR.

The architecture is fixed (ADR-0030 §1, §4, §6, §8): one walker, grammar-as-text execution, ambient assistance for free. This ADR fixes the shape of the grammar — the productions, the recursion, the additive extensions to ADR-0031's expression fragment, and the few execution-path implications (worker-side column-origin lookup so result columns recover their playground type). It deliberately does not revisit ADR-0030's structural decisions; references in this ADR's text to ADR-0030 §X mean "that decision is the controlling one."

What ADR-0030 and ADR-0031 already fix

  • No batch parser; SQL is grammar in the unified tree. Subquery recursion is a Node::Subgrammar(&NAMED) reference, exactly as the expression ladder uses it (ADR-0031 §3).
  • No AST builder for the parts that execute as text. Command::Select { sql: String } carries the validated source; the worker prepares and runs it (ADR-0030 §4/§6, ADR-0031 §2).
  • The __rdbms_* rejection at every table-name slot (ADR-0030 §6) — re-applied to every Phase-2 table-source slot (FROM, JOIN, CTE-name).
  • No allowlist for function names (ADR-0030 §13 OOS-3, ADR-0031 §6). Aggregates (count, sum, avg, min, max) parse through the generic name_or_call path — the grammar is structurally aggregate-blind, by design.
  • No quoted identifiers (ADR-0031 §7 OOS-3) — unchanged.
  • MAX_SUBGRAMMAR_DEPTH = 64 (ADR-0026) is the shared recursion budget across DSL Expr, SQL expression, and (added here) SQL SELECT recursion. No new walker capability is introduced (§9).

The boundary with ADR-0031

ADR-0031 §7 named two additive extensions deferred to this ADR:

  • OOS-1: subquery expressions( SELECT … ) as a primary, IN ( SELECT … ), EXISTS ( SELECT … ). Their grammar is fixed in §6; they are additive Choice branches in sql_expr.rs, recursing into the named SELECT fragment authored here.
  • OOS-2: qualified column referencest.c / alias.c. Their grammar is fixed in §5; they are an additive tail on name_or_call in sql_expr.rs.

sql_expr.rs was shaped to receive both branches without restructuring (ADR-0031 §7 promise). This ADR redeems that promise; the changes there are strictly additive.

Decision

1. The top-level SELECT grammar

The full statement decomposes into a top-level compound query (set-operator chains around per-leg core selects), wrapped by an optional WITH prefix and trailing ORDER BY / LIMIT:

select_statement   := [ with_clause ] compound_select
compound_select    := select_core ( set_op select_core )*
                      [ order_by_clause ]
                      [ limit_clause ]
set_op             := UNION [ ALL ] | INTERSECT | EXCEPT
select_core        := SELECT [ DISTINCT | ALL ]
                      projection_list
                      [ from_clause ]
                      [ where_clause ]
                      [ group_by_clause ]
                      [ having_clause ]
with_clause        := WITH [ RECURSIVE ] cte_def
                      ( ',' cte_def )*
cte_def            := identifier [ '(' column_name_list ')' ]
                      AS '(' compound_select ')'
projection_list    := projection_item ( ',' projection_item )*
projection_item    := '*'
                    | identifier '.' '*'
                    | sql_expr [ [ AS ] identifier ]
from_clause        := FROM table_source ( join_clause )*
table_source       := identifier [ [ AS ] identifier ]
join_clause        := [ INNER ] JOIN table_source ON sql_expr
                    | LEFT  [ OUTER ] JOIN table_source ON sql_expr
                    | RIGHT [ OUTER ] JOIN table_source ON sql_expr
                    | FULL  [ OUTER ] JOIN table_source ON sql_expr
                    | CROSS JOIN table_source
where_clause       := WHERE sql_expr
group_by_clause    := GROUP BY sql_expr ( ',' sql_expr )*
having_clause      := HAVING sql_expr
order_by_clause    := ORDER BY order_item ( ',' order_item )*
order_item         := sql_expr [ ASC | DESC ]
limit_clause       := LIMIT sql_expr [ OFFSET sql_expr ]

sql_expr is ADR-0031's SQL_OR_EXPR, extended additively per §5 and §6. column_name_list is identifier (, identifier)*.

The named static Node exported by the new src/dsl/grammar/sql_select.rs is SQL_SELECT_STATEMENT (matching the full statement) and SQL_SELECT_COMPOUND (the embedded form, omitting the outer WITH; this is what subqueries recurse into — see §6, §9).

Notes on specific productions:

  • FROM stays optional. Phase 1's autonomous decision §4.1 is upheld: SELECT 1 and SELECT upper('x') continue to parse. With JOINs landing, the absence of a FROM simply means no from_clause/join_clause was matched; no extra shape is needed.
  • Bare-alias projection (select a x) is admitted. Phase 1's autonomous decision §4.2 deliberately rejected it as structurally ambiguous. With Phase 2's grammar — FROM is the only word that can legitimately follow a projection list, and it is a keyword in the walker's expected-set — the ambiguity dissolves: an identifier following the last projection expression that is not FROM, ,, WHERE, GROUP, ORDER, LIMIT, or a set-op keyword is a bare alias, and is so admitted. This lifts a small but visible Phase-1 limitation.
  • SELECT [ DISTINCT | ALL ]. ALL is the default and is admitted for symmetry; DISTINCT is the meaningful case. They are mutually exclusive at this position (a Choice, not two Optionals).
  • identifier '.' '*' lives only in projection_item, never in sql_expr. This is intentional: t.* is projection syntax, not an expression, and admitting it as an expression primary would let it appear in WHERE / ORDER BY / etc. where the engine would reject it and the engine-neutral error would be hard to phrase. The grammar simply refuses it structurally outside projection.
  • UNION ALL is a single set-op, not UNION followed by an ALL modifier on the next leg. set_op is a Choice of the four atoms (with UNION and UNION ALL as separate branches); factoring UNION [ ALL ] is also valid but the explicit four branches keep the matched-path classes cleaner for highlighting.

2. JOIN flavours admitted

The grammar admits exactly the flavours the user picked:

  • INNER JOIN / bare JOIN
  • LEFT [ OUTER ] JOIN
  • RIGHT [ OUTER ] JOIN
  • FULL [ OUTER ] JOIN
  • CROSS JOIN

The first four take a mandatory ON sql_expr; CROSS JOIN takes none. OUTER is the optional explicit modifier on LEFT / RIGHT / FULL.

Explicitly out (§11): NATURAL JOIN, JOIN … USING (col), and comma-list FROM t1, t2 (the legacy implicit cross join). The first two add grammar weight for limited teaching value; comma-FROM teaches habits we do not want to encourage — CROSS JOIN covers the same shape explicitly.

JOIN chains are admitted as a flat ( join_clause )*. Standard SQL is left-associative; since the grammar builds no AST and the engine receives the source text verbatim (ADR-0030 §4), the engine resolves the associativity. The grammar's job ends at "the chain parses".

3. Set operators and compound queries

UNION, UNION ALL, INTERSECT, EXCEPT all admitted — ADR-0030 §3's full set.

The compound shape (§1) is select_core (set_op select_core)*, flat. Standard SQL gives INTERSECT higher precedence than UNION / EXCEPT; the engine resolves this — the grammar admits the chain as written. This mirrors §2's JOIN-chain decision. A user who wants explicit grouping writes (SELECT … INTERSECT SELECT …) UNION SELECT …, which falls out of the subquery-primary branch (§6) — though for a top-level statement this requires an extra SELECT wrapping. In practice the engine's precedence is what learners encounter; calling it out in the help sql page (ADR-0030 Phase 6) is sufficient.

ORDER BY / LIMIT on a compound apply to the whole compound, not to a leg — fixed by the position of order_by_clause and limit_clause in §1's compound_select.

4. CTEs (WITH and WITH RECURSIVE)

The full with_clause per §1. Both forms admitted: non-recursive WITH for naming intermediate results, and WITH RECURSIVE for recursive queries (tree traversals, transitive closure, generated sequences).

The cte_def body is a parenthesised compound_select, so the recursion is into SQL_SELECT_COMPOUND via Subgrammar — the same recursion mechanism subqueries use (§9).

CTE-name collisions. A CTE name shares the table-name namespace at the engine. Standard SQL: the CTE shadows a same-named base table within the statement. The grammar is agnostic — both are identifiers in a table-source slot — so the shadowing falls out of engine resolution. The reject_internal_table validator still rejects any __rdbms_* identifier in any table-source slot, including CTE-name slots and the FROMs inside CTE bodies. That is the right posture: the reserved namespace is reserved everywhere.

Recursive CTEs use the standard cte_name AS ( base_case UNION [ALL] recursive_case ) shape — already admitted by §1's compound_select body. No grammar branch specific to recursion is needed; the RECURSIVE keyword is a hint to the engine, not a grammar gate.

5. Qualified column references

Additive extension to sql_expr.rs (ADR-0031 §7 OOS-2). name_or_call's identifier prefix gains a Choice tail:

name_or_call    := identifier
                   ( '.' identifier
                   | '(' call_args? ')'
                   )?

The leading identifier is matched once (preserving ADR-0031 §1's factoring — no Choice branch begins with an identifier). The optional tail is either a qualified-reference suffix (. identifier) or a function-call argument list (( … )), not both. A bare identifier with no tail remains a plain column reference.

A function call with a qualified name — schema.f(…) — is not in scope (we have no schemas) and is structurally inadmissible by construction: there is no production that admits both a .-tail and a (-tail.

Completion for the qualified form: when the cursor is past identifier '.', the completion source is "columns of the table or alias named by the leading identifier", resolved from the active SchemaCache (the same source the DSL completion uses, ADR-0030 §8). This is a small extension to the existing IdentSource::Columns machinery — when in scope, column completion is scoped to the named source.

6. Subquery expressions

Additive extensions to sql_expr.rs (ADR-0031 §7 OOS-1):

  • Scalar subquery as primary. A Choice branch '(' compound_select ')'. The existing '(' or_expr ')' branch handles parenthesised expressions. Both start with '(', so per ADR-0031 §1's factoring principle, the '(' is matched once and the inside is a Choice between compound_select and or_expr. The first inside token disambiguates: SELECT or WITH → subquery; anything else → expression. The two Choice branches have non-overlapping first-token sets, so the walker's expected-set at the ambiguity point merges naturally without Optional-first hazards.

  • IN ( subquery ). The existing predicate_tail's IN '(' additive (',' additive)* ')' branch gains a sibling IN '(' compound_select ')'. Same '(' factoring as the scalar case: after '(', branch on SELECT/WITH (subquery) vs additive-first-token (literal list). NOT IN follows from the existing [ NOT ] factoring on the predicate tail.

  • [ NOT ] EXISTS ( subquery ). Added as a primary Choice branch:

    primary := … | EXISTS '(' compound_select ')'
    

    The bare EXISTS form lives in primary; NOT EXISTS falls out of the existing not_expr := NOT not_expr tier above primary in the precedence ladder. This is structurally cleaner than putting [ NOT ] EXISTS inside primary: there is only one place NOT is admitted, and it composes uniformly.

All three branches recurse through Subgrammar(&SQL_SELECT_COMPOUND). Correlated subqueries fall out for free — a subquery's sql_expr reaches identifiers, which the engine resolves against outer scopes. The grammar imposes no correlation constraint; correlation is engine-side semantics.

7. GROUP BY and HAVING

GROUP BY takes a comma-separated list of sql_exprs. Standard SQL admits any expression as a grouping key (not just bare columns) — e.g. GROUP BY date(created_at). The grammar admits this without special-casing.

HAVING is a single sql_expr. Its semantics is "boolean over grouped rows"; the grammar does not enforce that — the expression's typing is the engine's concern.

Aggregate correctness is not grammar-checked. Whether a projection's non-aggregated columns are valid given the GROUP BY keys is a semantic question. ADR-0030 §9 settled this: the grammar admits structurally, the engine rejects semantically, and the friendly-error layer renders engine-neutral wording (ADR-0019). A learner who writes SELECT Name, COUNT(*) FROM t sees an engine-neutral "Name must appear in a GROUP BY clause or be wrapped in an aggregate function"-style message, not a raw engine string and not a parse error. This is the project's honest limitation (ADR-0030 §7) and remains so.

8. LIMIT / OFFSET and ORDER BY extras

LIMIT n [ OFFSET m ] — the standard form. Both n and m are sql_exprs (in practice integer literals, but the grammar admits the general form so e.g. LIMIT max(10, x) OFFSET 0 is structurally accepted; the engine constrains values).

The MySQL/SQLite legacy comma form LIMIT m, n is out (§11). Its argument order (offset first, then count) inverts the keyword form — a needless source of confusion.

ORDER BY already admits sql_expr items with optional ASC / DESC (Phase 1). With Phase 2:

  • Column-position references (ORDER BY 1, 3 DESC) fall out for free — an integer literal is a valid sql_expr, and the engine interprets a bare positive integer in ORDER BY as a column position. The grammar does not distinguish the case; rendering interprets the position. Document in help sql.
  • Qualified refs in ORDER BY (e.g. ORDER BY t.c) fall out of §5 — the grammar uses the same sql_expr body.

9. Recursion, the depth budget, and the walker

SELECT recurses into itself at four points:

  • A subquery primary in sql_expr (§6).
  • An IN ( subquery ) predicate tail (§6).
  • An EXISTS ( subquery ) primary (§6).
  • A CTE body (§4).

Every recursion is wired through Node::Subgrammar(&SQL_SELECT_COMPOUND) — the named static Node exported by sql_select.rs. The recursion is token-guarded in every case: a subquery primary is preceded by '('; an IN ( subquery ) by IN (; an EXISTS ( subquery ) by EXISTS (; a CTE body by AS (. There is no left recursion; the walker always makes progress.

MAX_SUBGRAMMAR_DEPTH = 64 (ADR-0026, reused by ADR-0031) is shared: DSL Expr recursion, SQL expression recursion, and SQL SELECT recursion all increment the same WalkContext::subgrammar_depth. A worst-case learner query might be SELECT … WHERE id IN (SELECT … WHERE id IN (SELECT …)) with each inner select carrying a few-deep expression — well below the cap. The cap remains purely a stack-overflow guard; this ADR does not raise it. If pathological-but-realistic learner queries reach 64 in practice, a focused ADR lifts it with measurements. Speculative raising would weaken the guard without evidence.

No new walker capability is introduced. Subgrammar, the depth counter, the cap, and the friendly depth-exceeded error all carry over from ADR-0026 unchanged — the same posture ADR-0031 took. This is a non-trivial property: Phase 2 is the biggest single grammar slice in the project, and it lands without changing the walker's contract.

10. Completion scope and the WalkContext extension

ADR-0030 §8 promises that "ambient assistance comes for free" because SQL is grammar in the unified tree. For Phase 1's single-table SELECT this was substantially true: the existing WalkContext::current_table mechanism (populated via the writes_table: true flag on the FROM table-name slot) gave WHERE and ORDER BY column-name completion against the right table at no incremental cost.

Phase 2 breaks the "free" claim. Multiple FROM tables via JOINs, aliases, CTE-defined table sources, subqueries with their own FROM scope, qualified t.c references, projection aliases referenced in ORDER BY — every Phase-2 surface needs scope information that WalkContext does not currently carry. §9's "no new walker capability" claim holds for grammar recursion (Subgrammar and the depth cap suffice); for completion scope it is too strong, and is softened here to an honest split.

The current WalkContext carries one table at a time (current_table: Option<String> + current_table_columns), set by writes_table: true on a Tables identifier. DSL paths (update T, delete from T, insert into T) rely on this single-table contract and continue to work unchanged. Phase 2 adds layered accumulators alongside, not in place of.

10.1. The from-scope accumulator

A new WalkContext field:

from_scope: Vec<TableBinding>
TableBinding { table: String, alias: Option<String>,
                columns: Vec<TableColumn> }

Populated incrementally as the walker descends through from_clause and each join_clause (§1). The first table-source slot pushes a binding; every subsequent JOIN pushes another. Ident slots whose IdentSource is Columns now resolve against the union of every binding's columns, with deduplication.

current_table / current_table_columns remain as derived helpers: when from_scope.len() == 1, they expose that single binding's data, preserving the contract every existing DSL path relies on. DSL UPDATE / DELETE / INSERT continue to push exactly one binding via the existing writes_table: true mechanism, unchanged.

10.2. Scope-stack discipline at Subgrammar boundaries

Subqueries (§6) and CTE bodies (§4) introduce new lexical scopes. A column reference inside SELECT … WHERE id IN (SELECT id FROM u) resolves first against the inner SELECT's FROM (u), and — for correlation — also against the outer scope. subgrammar_depth is a counter; it suffices for §9's depth cap but not for scope.

Phase 2 layers a stack on top. A new field:

from_scope_stack: Vec<ScopeFrame>
ScopeFrame {
    from_scope: Vec<TableBinding>,
    cte_bindings: Vec<CteBinding>,
    projection_aliases: Vec<String>,
}

The new walker node variant — Node::ScopedSubgrammar(&Node) — is what triggers a scope push. It is a sibling of the existing Node::Subgrammar(&Node), with the same recursion semantics (reference-following, depth-counted) and one additional driver behaviour: on entry, push the current ScopeFrame onto from_scope_stack and start a fresh empty frame; on exit, pop back. The existing Node::Subgrammar variant is unchanged — DSL Expr recursion (ADR-0026) and the sql_expr.rs precedence- ladder recursion (ADR-0031) keep using it and never push a scope.

The grammar source spells the choice explicitly at each call site: subqueries in sql_expr.rs and CTE bodies in sql_select.rs reference the compound-SELECT through Node::ScopedSubgrammar(&SQL_SELECT_COMPOUND); predicate-ladder recursion in sql_expr.rs continues to use Node::Subgrammar(&SQL_OR_EXPR). Self-documenting, no flag bookkeeping, and the walker change is localised to one extra arm in the driver's match over Node variants.

Column-completion candidates inside a scope frame are the union of the current frame's from_scope and (for correlated refs) all outer frames; outer-frame columns are admitted as additional candidates so correlated references work. Ordering or visual differentiation between current-frame and outer-frame candidates is completion-tier polish and is not specified by this ADR — the current completion API (candidates_at_cursor*) returns a flat Vec, and adding a priority dimension is a separate concern. CTE bindings resolve the same way (outward-walking) — a CTE defined in an outer query is visible inside an inner subquery as a table source, unless the inner subquery defines a CTE of the same name and shadows it.

This is the one explicit walker-capability extension Phase 2 makes. It is scoped: one new node variant, no new walker entry point, no change to how Subgrammar bodies are entered structurally. The depth cap (§9) applies to both variants uniformly through the shared subgrammar_depth counter.

10.3. CTE bindings

A frame-local accumulator carries CTE definitions visible in the current scope:

cte_bindings: Vec<CteBinding>
CteBinding {
    name: String,
    columns: Vec<CteColumn>,
}
CteColumn {
    name: Option<String>,            // None for unnamed
                                     //   computed projections
    type_: Option<Type>,             // resolved playground type
                                     //   if derivable
}

A CTE definition cte_name [(col-list)] AS (compound_select) produces a binding in two stages:

  1. Pre-body push (so WITH RECURSIVE self-references resolve). When the walker reaches AS and is about to enter the body's Node::ScopedSubgrammar(&SQL_SELECT_COMPOUND), it pushes a placeholder binding into the outer frame's cte_bindings with columns = [] (an empty stand-in). The CTE name is now visible as a table source from inside the body.
  2. Body-finalised harvest (when the body's scope frame completes). On ScopedSubgrammar exit, before popping the frame, the driver derives the body's projection-list output columns (rules below) and rewrites the placeholder binding in the outer frame.

Output-column derivation rules. Walking the body's projection items:

Projection item Derived CTE column(s)
* Every column from the body frame's from_scope, in order, with their resolved types
t.* (qualified wildcard) Every column from binding t in the body frame's from_scope, with their types
col (bare ref, resolves uniquely) One column: name = col, type = the resolved column's playground type
t.col (qualified ref) One column: name = col, type = t's column's type
expr AS alias or bare expr alias One column: name = alias, type = the underlying type if expr is a single column ref, else None
expr (computed, no alias) One column: name = None, type = None — engine assigns an implementation-defined name

For compound bodies (UNION / INTERSECT / EXCEPT) the columns come from the first leg per standard SQL. For recursive CTE bodies (WITH RECURSIVE) the same rule — the non-recursive leg dictates.

If a (col-list) was supplied on the CTE name, it renames the derived columns positionally and overrides their names; types are preserved from the derivation. If the column-count of (col-list) disagrees with the body's projection arity, the grammar admits this and the engine surfaces the mismatch — do_run_select's engine-neutral error layer carries the message (ADR-0030 §9, ADR-0019).

Completion past cte_alias.|. Where the derivation produced named columns (every form above except computed-no-alias), they complete with their names and (where typed) participate in §11's result-type resolution if the CTE's columns are projected upstream. Where the derivation produced an unnamed column slot, that slot is silently skipped from the qualified-prefix candidate list — the user typing cte.| past it sees only the nameable columns. The cure for "I want my expression to be referenceable from outside the CTE" is to add an alias, which is the same cure the engine itself enforces at execution time.

This is substantially better than the earlier "honest limitation" posture: the common SELECT * body is fully resolvable; explicit projections are resolvable; only un-aliased computed columns elude us, and the right learner response there is the same as the engine's right learner response — write an alias.

cte_bindings lives on the scope frame, so a CTE defined in an outer query is visible inside an inner subquery as a table source unless that subquery defines a CTE of the same name (which shadows it, per standard SQL).

10.4. Projection-alias bindings

Standard SQL admits ORDER BY referencing a SELECT-list alias: SELECT a + b AS total FROM t ORDER BY total. A third frame-local accumulator:

projection_aliases: Vec<String>

Each projection_item's optional alias (whether AS x or bare x — see §1) appends its name. Ident slots inside the trailing ORDER BY's sql_exprs offer projection aliases as additional candidates alongside column names. This addresses §1's bare-alias admission's completion behaviour at the same time.

The accumulator is not consulted inside WHERE, GROUP BY, or HAVING — standard SQL forbids alias references there (aliases are not yet bound at evaluation time). The grammar admits them structurally regardless; the engine rejects; ADR-0019 renders the engine-neutral error.

10.5. Qualified-prefix completion

§5 fixed the grammar for t.c references. The completion behaviour at qualified positions:

  • At an Ident cursor with no prefix, candidates are the union of every from_scope binding's columns, plus projection_aliases when in ORDER BY, deduplicated. CTE-name candidates apply only in table-source slots, not column slots.
  • At an Ident cursor immediately after prefix '.', candidates are scoped: resolve prefix against the active from_scope (preferring alias matches over table matches, since aliases shadow), and offer that binding's columns alone. If prefix doesn't resolve to a binding, the candidate list is empty — the walker's expected-set still surfaces the syntactic alternatives (the user sees no column candidates but the structural error message reports the unresolved prefix).

The qualified-prefix narrowing is a small extension to the existing IdentSource::Columns handling: when the matched-path immediately preceding the Ident ends with Ident '.', the completer is told the prefix and narrows accordingly. This is the only completion-source-level change; the rest is data flowing through the new accumulators.

10.6. The projection-before-FROM problem

Standard SQL writes projection before FROM. A user typing select col1, col2 from mytable produces, mid-typing, a state where the projection list has been parsed but the FROM has not. At that point the column-name completer cannot scope to mytable — it does not know mytable is coming. Validation and highlighting face the same problem: col1 and col2 cannot be checked as belonging to mytable until the user types from mytable. The debounced re-walk on every keystroke (ADR-0027) is not sufficient on its own to fix this in a single-pass walker, because by the time the FROM is parsed, the projection identifiers have already been resolved (left-to-right) against the only scope information available at that moment — the empty from_scope.

There is no fully satisfying single-pass answer. Phase 2's posture is therefore explicit:

  1. During-typing completion of projection-list column names, when from_scope is empty (no FROM yet), uses the unioned SchemaCache.columns — every column known to the schema — as the candidate set. This is the same global fallback Phase 1 uses and remains the right behaviour: a noisier-but-useful completion is better than no completion.

  2. A post-walk fixup pass re-evaluates projection-list column refs against the final from_scope after the walk completes. The walk records each projection Ident's span and matched-path location; once the walk reaches end-of- input (or end-of-statement), the fixup walks the recorded list, looks up each identifier against the final from_scope, and:

    • Rewrites the highlight class on that terminal — downgrading "column" → "unknown identifier" when the identifier doesn't belong to any in-scope binding, upgrading "unknown identifier" → "column" when it does.
    • Updates the diagnostic for the validity indicator (ADR-0027) — a column-not-found ERROR either appears or disappears based on the post-walk scope.

    Integration point. The fixup runs as the final stage of the walk itself, after all grammar nodes have been processed but before WalkResult is returned to the caller. It mutates the walker's accumulated highlight runs and diagnostics vector in place, so the consumer (the renderer, the validity indicator) sees a single coherent snapshot. This keeps the walker the single source of truth for what reaches the renderer — the fixup is conceptually part of "what the walker produces", not a separate post-processing layer. The same convention applies to the §11.6 SQL-expression predicate warnings, which also run as a final walk stage.

  3. The fixup runs on every debounced re-walk (ADR-0027 already triggers the full walk per keystroke), so the user observes: typing col1, col2 from mytable, the col1 / col2 initially highlight as generic identifiers (with a soft warning if not found anywhere in the schema); the moment mytable is typed, the highlight snaps to the column class if col1 / col2 belong to mytable, or to the unknown-identifier diagnostic if they don't — within one debounce cycle.

The fixup pass does not re-parse; it only re-resolves identifiers against the final from_scope.

ORDER BY alias resolution needs no fixup. Projection precedes ORDER BY in walk order, so projection_aliases is fully populated by the time the walker reaches an ORDER BY Ident; the alias-as-column-candidate is resolved in the single forward pass.

This is the answer to the user's "I think this may be automatic" intuition: the debounced re-walk is automatic; the post-walk fixup pass is the new infrastructure that makes the re-walk produce correct results. Without it, projection-list column refs would forever validate against the global column set even after the FROM is typed.

10.7. The honest split

§9 still holds for grammar recursion: Subgrammar and the depth cap are reused unchanged. For completion scope, this section introduces:

  • New WalkContext fields: from_scope, from_scope_stack, cte_bindings, projection_aliases.
  • Scope push/pop discipline at SQL_SELECT_COMPOUND Subgrammar boundaries — driven by a marker on the Subgrammar target so DSL Subgrammars are unaffected.
  • A qualified-prefix narrowing in the IdentSource::Columns completion path.
  • A post-walk fixup pass for projection-list identifier highlighting and validity (§10.6).

These are real walker-contract extensions. They are scoped: no new node kinds, no new walk-driver entry points, no changes to how Subgrammar bodies are entered structurally. The existing DSL paths are unaffected — their grammars never push a SELECT scope, never define a CTE, never carry projection aliases — and the single-table current_table / current_table_columns view is preserved as a derived helper.

§9's claim is therefore restated honestly: grammar recursion needs no new walker capability; completion scope needs the additions above.

11. Diagnostics for Phase-2 validation cases

ADR-0027 fixes the warning-vs-error guideline verbatim:

ERROR — the input is known to fail. Either it does not parse (incomplete, or a mismatched / invalid token), or it parses but names something that does not exist (an unknown table or column).

WARNING — the input is valid and will run, but is very likely not what a knowledgeable user wants: a type-mismatched comparison, or = NULL (both from ADR-0026 §7). Amendment 1 adds a third trigger — LIKE against a numeric column.

The split is certainty of failure versus likely misleading.

This section walks the Phase-2 surface case-by-case, classifies each against that guideline, and identifies the diagnostic machinery additions needed. It also flags a Phase-1 carry-over gap (§11.6) that Phase 2 closes.

11.1. Existing diagnostics, briefly

Two post-walk passes today (src/dsl/walker/mod.rs):

  • Schema-existence pass (ERROR). Walks the MatchedPath, checks every IdentSource::Tables / IdentSource::Columns ident against SchemaCache. Emits diagnostic.unknown_table and diagnostic.unknown_column. Today this assumes a single current_table for column resolution.
  • Expression predicate-warnings pass (WARNING). Walks the parsed DSL Expr AST emitted by expr.rs's builder. Emits diagnostic.eq_null, diagnostic.type_mismatch, diagnostic.like_numeric. Runs only on WHERE expressions in the DSL.

Phase 2 extends both, and §11.6 fills a SQL-side gap.

11.2. Phase-2 new ERROR cases

Every case below is "known to fail on the engine" — the engine would surface a message the friendly-error layer would translate (ADR-0019). Surfacing them as pre-flight ERROR diagnostics gives the learner the answer one debounce cycle faster, with the walker as the single source of truth.

  • Unknown table in any FROM/JOIN slot. The existing schema-existence pass extends from "the one current_table" to walking every from_scope binding's table and emitting diagnostic.unknown_table per unresolved name. CTE-name slots in the active cte_bindings are valid table sources and exempt from this check.
  • Unknown CTE-as-table. A table-source slot whose name is not in SchemaCache.tables and not in the active cte_bindings chain emits diagnostic.unknown_table (same catalog key — from the learner's perspective the engine message is the same; the slot is a "table that doesn't exist", whether they meant a CTE or a base table).
  • Unknown table or alias in a qualified column reference (t.c where t doesn't resolve in the active from_scope). New catalog key diagnostic.unknown_qualifier {qualifier}.
  • Unknown column in a qualified reference (t.c where t resolves but c is not a column of that binding). Reuses diagnostic.unknown_column with the column name in context.
  • Ambiguous unqualified column reference — a column name used unqualified that exists in two or more from_scope bindings. The engine raises "ambiguous column name"; we surface it as ERROR with a new catalog key diagnostic.ambiguous_column {column}, {qualifiers} so the learner sees which two tables the name appeared in.
  • Reference to a projection alias in WHERE / GROUP BY / HAVING. Standard SQL forbids it (aliases are not bound at evaluation time). The grammar admits the identifier structurally; a new diagnostic pass emits ERROR with a new catalog key diagnostic.projection_alias_misplaced {alias}, {clause}.
  • CTE column-list arity mismatch. When cte_name (col1, col2, …) AS (compound_select) declares N columns and the body's projection (§10.3) derives M columns with N ≠ M, the CTE harvest pass (§10.3 stage 2) emits ERROR with a new catalog key diagnostic.cte_arity_mismatch {cte}, {declared}, {actual}.
  • Compound-query column-count mismatch. When a UNION / INTERSECT / EXCEPT chain has legs whose projection arities differ, the engine errors at execution. Phase 2 catches it pre-flight: each leg's derived arity (the same derivation the CTE harvest uses) is compared as the compound is assembled. ERROR with a new catalog key diagnostic.compound_arity_mismatch {op}, {left_n}, {right_n}.
  • Internal-table reference in any new table-source slot. Already a parse-time rejection via reject_internal_table (§1, §4) — surfaces as a parse error, not a post-walk diagnostic. Listed here for completeness: the catalog key select.internal_table authored in Phase 1 covers every Phase-2 slot too.

11.3. Phase-2 new WARNING cases

The existing WARNING set (= NULL, type-mismatched comparison, LIKE-on-numeric) is the right set. Phase-2 adds no new WARNING categories — every Phase-2-specific case falls into ERROR (§11.2) or engine-rejected (§11.4).

Considered and rejected as WARNINGs:

  • CTE name shadowing a base table. Standard SQL behaviour; often intentional (the canonical "filter to a subset, then query as if it were the base table" pattern). No diagnostic.
  • Correlated reference without explicit qualification. Correlation is implicit in standard SQL; per the user guideline a knowledgeable user does want this. The walker validates the reference silently against the outer-frame scope; no warning, no diagnostic.
  • Unused CTE. A CTE defined in WITH but never referenced. The engine ignores it; many learners write CTEs as intermediate scratch space. Not a warning.

11.4. Engine-rejected (no diagnostic)

These fail on the engine and surface via ADR-0019's friendly-error layer at execution time. The walker does not attempt pre-flight detection because:

  • Non-aggregated columns in projection with GROUP BY — detecting requires knowing which function names are aggregates; ADR-0030 §13 OOS-3 / ADR-0031 §6 keep us allowlist-free.
  • Aggregate function in WHERE — same reason.
  • Scalar subquery returning multiple rows — semantic, not syntactic; requires execution.
  • Recursive CTE without a UNION — requires inspection of the body's compound shape against the recursive contract; doable in principle, deferred as engine territory.
  • Duplicate CTE names within the same WITH — checkable in principle (walking cte_bindings for duplicates), but the engine catches it cleanly. Phase 2 does not pre-flight it; could be added later if its absence proves confusing.
  • Type-mismatched JOIN ON predicates — the existing expression type-mismatch warning (extended per §11.6) handles the explicit-literal case; arbitrary-expression cases require type inference and stay engine-side.

11.5. Catalog additions

Phase 2 adds the following message-catalog keys (ADR-0019). Every key is engine-neutral by construction.

Parse-time-detectable (post-walk diagnostic passes):

Key Slots
diagnostic.unknown_qualifier {qualifier}
diagnostic.ambiguous_column {column}, {qualifiers}
diagnostic.projection_alias_misplaced {alias}, {clause}
diagnostic.cte_arity_mismatch {cte}, {declared}, {actual}
diagnostic.compound_arity_mismatch {op}, {left_n}, {right_n}

Engine-error translations (friendly-error layer; reached on execution failure):

Key Engine cause
engine.no_such_table no such table: <name> (post-execution path)
engine.no_such_column no such column: <name> (post-execution path)
engine.ambiguous_column ambiguous column name: <name>
engine.aggregate_misuse misuse of aggregate function <name>()
engine.group_by_required column must appear in the GROUP BY clause or be used in an aggregate function (or equivalent)
engine.compound_arity_mismatch SELECTs to the left and right of UNION do not have the same number of result columns (or equivalent)
engine.scalar_subquery_too_many_rows scalar subquery cardinality violation
engine.recursive_cte_malformed recursive CTE shape errors

The parse-time keys and the engine keys are intentionally separate even when they describe the same situation (engine.ambiguous_column mirrors diagnostic.ambiguous_column) — the parse-time message can include the learner's typed text and span; the engine-time message catches what the parser missed and routes through the friendly-error layer with whatever context the engine yielded.

Two pre-existing parse-time keys are reused unchanged for Phase-2 slots: diagnostic.unknown_table, diagnostic.unknown_column, and the Phase-1 select.internal_table.

11.6. The Phase-1 SQL-expression predicate-warning gap

ADR-0027 Amendment 1's LIKE-on-numeric warning, and ADR-0026 §7's = NULL and type-mismatch warnings, are emitted by a pass that walks the DSL Expr AST. Phase 1's sql_expr.rs deliberately builds no AST (ADR-0031 §2). The consequence is a Phase-1 carry-over gap: SQL WHERE expressions today emit none of these warningsselect * from t where name like 5 parses, the engine runs it, and the learner gets the engine's verdict, not the friendly pre-flight nudge ADR-0027 Amendment 1 promised.

Phase 2 closes this. The predicate-warnings pass gains a MatchedPath-walking variant that runs over the SQL expression nodes and identifies the predicate shapes structurally (a LIKE predicate-tail with a column-ref left operand; a =/!= predicate-tail with a NULL literal operand; a comparison predicate-tail with a column-literal operand pair of mismatched types). It does not need an Expr AST because the matched-path terminals carry both the byte spans (for the diagnostic) and the node-name labels (for shape identification). The same catalog keys (diagnostic.eq_null, diagnostic.type_mismatch, diagnostic.like_numeric) apply unchanged; only the pass implementation is new.

The MatchedPath-walking pass runs over every Phase-2 sql_expr slot — WHERE, HAVING, ON, CASE branches, projection items, ORDER BY items — so warnings surface uniformly across the SQL surface rather than just WHERE. This is a strict improvement over Phase 1's behaviour, where even Phase-1 SELECT WHERE expressions got no predicate warnings.

Type-resolution for the MatchedPath-walking pass: a column ref's type comes from §10's from_scope (or, for t.c, the specific binding); a literal's type comes from its lexical class. When the column ref doesn't resolve (the schema-existence ERROR pass will already have flagged it), the warning pass skips the predicate — no point compounding diagnostics on an already- broken reference.

11.7. Mechanism summary

Three diagnostic passes by end of Phase 2, all running as final stages of the walk (per §10.6's integration-point convention):

  1. Schema-existence ERROR pass — extended from single current_table to walking every from_scope binding and the active cte_bindings. Adds the qualified-reference and ambiguity checks (§11.2).
  2. Arity-check ERROR pass (new) — runs at CTE-body and compound-query frame-exits (the same ScopedSubgrammar exit hook §10.3 uses), comparing declared vs derived column counts.
  3. Predicate-warnings pass — extended with a MatchedPath-walking variant for sql_expr (§11.6) covering = NULL, type mismatch, and LIKE-on-numeric across every SQL expression slot, in addition to the existing DSL Expr AST variant for DSL expressions.

Per the integration-point convention (§10.6), each pass mutates the walker's accumulated highlight runs and diagnostics in place; the consumer sees a single coherent snapshot.

The projection-list fixup of §10.6 is conceptually part of pass (1) — it is the same "re-resolve identifier against final scope" operation, applied to the small subset of identifiers whose scope wasn't fully known at first-pass walk time.

12. Result-column type resolution

Phase 1's column_types: Vec<None> is partially lifted: where a projection item is structurally a single column reference, the worker resolves it back to the source column's playground type (ADR-0005) and populates that slot in DataResult.column_types. Everything else stays None.

This addresses Phase-1 autonomous decision §4.5 (bool SELECT results render as 0/1): a bare bool column now renders as true / false again, alignment recovers, and the show data rendering path is reached for the common case.

Resolution rule. A projection item is "structurally a single column reference" when, after stripping an optional [ AS ] alias, its expression is one of:

  • An unqualified identifier (Name) that resolves uniquely to a single column across the FROM tables;
  • A qualified reference (t.c / alias.c) that resolves unambiguously through the FROM aliases.

Anything else — function calls, arithmetic, CASE, literals, subquery expressions, the * and t.* wildcards — keeps column_types[i] = None. When resolution is ambiguous (unqualified column name appears in two FROM tables) the grammar admits it (engine resolves or errors); the type-resolver returns None and the renderer falls back to neutral alignment.

Implementation seam. The strongly preferred mechanism is engine-side column-origin lookup: after preparing the statement, query the prepared statement for each result column's underlying table and column. The engine knows authoritatively which result columns are direct references and which are expressions; for direct references it returns the source table+column, for expressions it returns nothing. This avoids re-parsing the source or adding structured projection-item data to the MatchedPath — the grammar tier is not involved at all, which preserves ADR-0031 §2's "no AST" decision and stays on the right side of ADR-0030's "one source of truth" rule.

The Phase-2 implementer verifies that the rusqlite version pinned in Cargo.toml exposes this metadata (the SQLite C API calls are sqlite3_column_table_name / sqlite3_column_origin_name — they have been stable for two decades; rusqlite either exposes them directly or via the underlying *mut sqlite3_stmt handle). If exposure turns out to be awkward, the fallback is a small post-parse walk over the projection-item subtrees in the MatchedPath — strictly worse because it duplicates a slice of parsing, but available.

The resolution pass adds one method on Database (something like resolve_select_column_types) called from do_run_select before the DataResult is shipped. It takes the prepared statement and the active SchemaCache, and returns Vec<Option<Type>>. The renderer needs no change — None slots already render as typeless.

This is the only execution-path change Phase 2 makes; everything else routes through Phase 1's grammar-as-text execution.

13. Out of scope

  • OOS-1. Derived tables in FROMFROM (SELECT …) [AS] alias. The same shapes are reachable via CTEs (§4), which Phase 2 ships. Derived tables in FROM are not authored here.
  • OOS-2. NATURAL JOIN and JOIN … USING (col). Both are convenience forms. NATURAL is widely considered a footgun; USING is cleaner but adds grammar weight without lifting any expressive ceiling. Out.
  • OOS-3. Comma-list FROM t1, t2 (implicit cross join). Out. CROSS JOIN covers the same shape explicitly.
  • OOS-4. LIMIT m, n (the legacy comma form). Out (§8).
  • OOS-5. Window functions (OVER (…), PARTITION BY, window-frame syntax). A meaningful learning topic, but a large surface of its own and out of ADR-0030's commissioned set.
  • OOS-6. LATERAL joins. Not commissioned by ADR-0030.
  • OOS-7. VALUES (…) as a row source. Not commissioned.
  • OOS-8. A function/aggregate allowlist — ADR-0030 §13 OOS-3 / ADR-0031 §7 OOS-4 still apply: aggregate names parse generically through name_or_call.
  • OOS-9. Quoted identifiers ("column name"). Tracked as ADR-0031 §7 OOS-3, still tracked.
  • OOS-10. Engine-checked aggregate correctness at parse time. The grammar admits structurally; engine rejects semantically; ADR-0019 surfaces the engine's verdict in engine-neutral wording (§7).
  • OOS-11. Result-column type resolution beyond bare column refs. Computed columns (a + b, upper(name), CASE …) stay typeless (§10).
  • OOS-12. The help sql page and parse-error usage entries for the Phase-2 surface. The grammar carries the help_ids authored in this phase, but the page content and the rich per-command usage messages are Phase 6 (ADR-0030 §10) and ADR-0021. Phase 2 leaves the same help_id: None shape Phase 1 used for select.

Consequences

  • A new grammar file, src/dsl/grammar/sql_select.rs, parallel to sql_expr.rs, exporting pub static SQL_SELECT_STATEMENT: Node and pub static SQL_SELECT_COMPOUND: Node. The Phase-1 data::SELECT CommandNode is rebuilt against SQL_SELECT_STATEMENT (its body becomes a Subgrammar reference); the CommandNode itself stays.
  • Phase-1 SQL SELECT grammar nodes migrate. The Phase-1 static nodes that live in src/dsl/grammar/data.rs for the single-table SELECT (the projection, FROM, WHERE, ORDER-BY, LIMIT sub-trees) move into sql_select.rs as the starting-point for the §1 productions; the file leaves only the CommandNode shell behind. The seven Phase-1 SQL SELECT integration tests are part of the safety net for this migration — they must continue to pass under the rebuilt grammar, in addition to the new Phase-2 integration tests authored in step 4 of the implementation notes.
  • Hint-panel prose for the new clauses (JOIN flavours, ON, GROUP BY, HAVING, UNION / INTERSECT / EXCEPT, WITH, OFFSET, the qualified-prefix and CTE-prefix completion states) is authored at the structural level alongside each grammar node in step 1 — a one-liner per slot, enough to drive the hint panel. Richer per-clause teaching prose and the help sql reference page remain ADR-0030 Phase 6 work (§12 OOS-12).
  • Walker cost is expected to stay proportional to source length. The new accumulators are O(bindings + aliases) per frame; the scope stack is bounded by MAX_SUBGRAMMAR_DEPTH = 64 (§9); the §10.6 post-walk fixup pass touches one entry per projection-list Ident (a small set). Each debounced keystroke (ADR-0027) walks once, fixes up once, and emits a single coherent output. No new pathological case is introduced — if a learner-realistic query produces a noticeable typing-time stall, measure first and revisit the recursion budget or the accumulator structure on evidence.
  • sql_expr.rs gains three additive Choice branches and one additive tail on name_or_call (§5, §6). The existing tiers and the depth-cap discipline are unchanged. The Phase-1 tests continue to exercise the existing branches as they stand.
  • No new walker capability (§9). Subgrammar, the depth counter, the cap, and the friendly depth error are all reused unchanged — the same posture ADR-0031 took.
  • Command::Select { sql: String } is unchanged. The validated source SQL is simply larger; the worker still routes it through Database::run_select and do_run_select (Phase 1 path).
  • The worker gains a post-prepare type-resolution helper that populates column_types for direct-reference projection items (§12) via the engine's column-origin metadata. Cargo.toml gains column_metadata to rusqlite's feature list (alongside the existing bundled); this pulls in the SQLite SQLITE_ENABLE_COLUMN_METADATA compile flag and exposes RawStatement::column_table_name / column_origin_name / column_database_name on the prepared statement. Verified against the project's pinned rusqlite 0.39.0. This is the only Phase-2 execution-path change.
  • Three diagnostic passes (§11.7) — schema-existence (extended), CTE/compound arity-check (new), and predicate warnings (extended with a MatchedPath-walking variant for sql_expr — §11.6). All run as final walk stages and mutate the walker's accumulated output in place. Closes the Phase-1 carry-over gap where SQL WHERE expressions emitted no LIKE-on-numeric / type-mismatch / = NULL warnings.
  • Catalog additions (§11.5) — five new diagnostic.* keys for parse-time-detectable cases and eight new engine.* keys for friendly-error layer translations of engine messages.
  • The walker's WalkContext gains the completion-scope accumulators of §10 — a from_scope_stack: Vec<ScopeFrame> whose top frame is the active from_scope / cte_bindings / projection_aliases. A new node variant Node::Scoped­ Subgrammar(&Node) (§10.2) is the trigger for push/pop; existing Node::Subgrammar is unchanged so DSL Expr and sql_expr recursion are unaffected. A post-walk fixup pass re-resolves projection-list identifier highlighting and validity once the final from_scope is known (§10.6). CTE output columns are derived from the body's projection list at body-frame exit, populating the binding back into the outer frame (§10.3) — so SELECT * and explicit-projection CTE bodies both yield real column completion past cte_alias.|. This softens §9's "no new walker capability" claim for completion scope; grammar recursion still needs nothing new.
  • __rdbms_* rejection extends to every table-source slot introduced by Phase 2: the FROM table, each JOIN's table, each CTE name, and the FROM table inside any CTE body (§4, §6). The reject_internal_table validator is reused.
  • Completion gains: SQL keywords for joins / set ops / WITH / GROUP / HAVING / OFFSET (all walker-derived, no bespoke code); column completion scoped to a qualified prefix t. resolves through the active SchemaCache (§5).
  • Phase-1 autonomous decisions §4.1 and §4.3–§4.4 stand (optional FROM, help_id: None, walker-mode defaults). §4.2 is lifted (bare-alias projection admitted, §1). §4.5 is partially lifted (bare bool column refs recover their type via §12).
  • requirements.md's Q1 / Q2 advance further; Q4 was already ticked by ADR-0030 and ADR-0031.

Implementation notes

A build order, each step guarded by the test suite. The phases within Phase 2 mirror the ADR-0030 / ADR-0031 staging — grammar first, execution-path change last.

Detailed plan: docs/plans/20260520-adr-0032-phase-2.md. The notes below are the outline; the plan refines them into seven sub-phases (2a2g) with per-gate exit criteria, a cross-cut verification matrix that explicitly tests every "X comes for free" claim from ADR-0030/0031/0032 (the kind of implicit claim that produced the Phase-1 SQL-expression predicate-warning gap §11.6 closes), and a final phase-exit verification report template. Implementers work through the plan; the ADR remains the decisions.

  1. The sql_select.rs grammar fragment. Author the stratified tiers of §1 as named static Nodes, recursion via Subgrammar. Export SQL_SELECT_STATEMENT and SQL_SELECT_COMPOUND. The existing data::SELECT CommandNode is rebuilt against SQL_SELECT_STATEMENT.
  2. Unit tests against the fragment directly (the expr.rs / sql_expr.rs test pattern): JOIN flavours, GROUP BY / HAVING, qualified refs, every set-op, recursive and non-recursive CTEs, LIMIT … OFFSET, DISTINCT, t.* projection, the bare-alias projection, plus the keyword-case-insensitivity check.
  3. sql_expr.rs additive extensions (§5, §6): the qualified-ref tail on name_or_call; the scalar-subquery primary branch; the IN (subquery) predicate-tail branch; the EXISTS (subquery) primary branch. Unit tests for each.
  4. Integration tests (the tests/ Tier-3 path, building on Phase 1's SQL SELECT tests): each JOIN flavour returns the expected rows; GROUP BY / HAVING aggregates over real data; UNION / INTERSECT / EXCEPT between two SELECTs; a non-recursive CTE; a recursive CTE (a small tree traversal or generated-sequence example); a scalar subquery in WHERE; IN (SELECT …); EXISTS (…); qualified refs resolving correctly.
  5. The WalkContext scope accumulators (§10). Add the ScopeFrame type (from_scope / cte_bindings / projection_aliases) and the from_scope_stack; add the Node::ScopedSubgrammar(&Node) variant alongside the existing Node::Subgrammar; teach the driver to push/pop a fresh frame on ScopedSubgrammar entry/exit; rewrite every reference to &SQL_SELECT_COMPOUND from outside its own definition to use the new variant (subqueries in sql_expr.rs, CTE bodies in sql_select.rs); teach from_clause / join_clause to populate the frame's from_scope; teach with_clause to push placeholder CTE bindings before the body and harvest derived output columns on body-exit per §10.3; teach projection_item to append to projection_aliases. Keep current_table / current_table_columns as derived helpers (top frame's single-binding view) so the DSL paths stay green.
  6. Qualified-prefix completion (§10.5). When the matched-path immediately preceding an IdentSource::Columns slot ends with Ident '.', narrow candidates to the named binding's columns. Unit tests: select t. Tab offers t's columns; an unresolved prefix returns an empty list.
  7. Post-walk fixup pass (§10.6). Collect projection-list Ident terminals during the walk; after the walk, re-resolve each against the final from_scope, rewriting the highlight class and validity diagnostic. Tests: typing select col1 from t lights col1 correctly once t is typed; typing select bogus from t produces a column-not-found diagnostic.
  8. Diagnostic passes (§11). Extend the schema-existence ERROR pass to walk every from_scope binding plus cte_bindings; add the qualified-reference and ambiguity checks (§11.2). Add the new arity-check ERROR pass at the CTE-body and compound-query frame-exit hooks (§11.7 case 2). Extend the predicate-warnings pass with a MatchedPath-walking variant covering every Phase-2 sql_expr slot (§11.6) — closes the Phase-1 carry-over gap. Author the five new diagnostic.* catalog keys and the eight new engine.* translation keys (§11.5). Tests: one positive and one negative case per new ERROR key; predicate warnings firing on select * from t where col like 5 (the Phase-1 gap closure); arity-mismatch ERRORs on a CTE and on a UNION.
  9. Result-column type resolution (§12). Add "column_metadata" to rusqlite's feature list in Cargo.toml. The worker's do_run_select calls the new resolver — RawStatement::column_table_name / column_origin_name per result column — before constructing the DataResult. Tests: a single-column SELECT recovers the playground type (covering each of the ten types, the pedagogically important one being booltrue / false); a SELECT with a computed projection keeps it typeless; a SELECT through a CTE recovers the underlying column's type if the engine's column-origin metadata follows through the CTE (verified, not assumed).
  10. Highlighting / completion / hint spot-checks via the typing-surface matrix (ADR-0022 / ADR-0030 §8): a SELECT with a JOIN highlights the JOIN keywords; Tab past select t. offers columns of t; column completion inside a WHERE after from a join b on … offers both a's and b's columns; column completion inside a correlated subquery sees the outer scope; the [ERR] indicator fires on a malformed subquery; an out-of-subset construct (e.g. OVER (…)) produces an engine-neutral parse error.
  11. reject_internal_table spot-checks against every new table-source slot: a FROM __rdbms_columns parse-rejects; a WITH __rdbms_x AS (…) parse-rejects; a FROM inside a CTE body referencing __rdbms_* parse-rejects.

Later phases continue ADR-0030's plan unchanged — Phase 3 (DML), Phase 4 (DDL), Phase 5 (DSL → SQL echo), Phase 6 (polish). ADR-0030 §13 OOS items (window functions, LATERAL, function allowlist, quoted identifiers) remain tracked separately and are authored if and when they are taken up; they are not implicit follow-ups of Phase 2.

Amendment 1 — Empirical scope of column-origin metadata (2026-05-20)

§12 was written conservatively: it constrained type recovery to projection items "structurally a single column reference" and listed "subquery expressions" alongside arithmetic and CASE as cases that stay None. The implementation plan's Open Question 1 (docs/plans/20260520-adr-0032-phase-2.md) captured the matching uncertainty about CTEs and scalar subqueries, leaving the test in sub-phase 2f to "assert the actual behaviour (not the wished-for behaviour)".

A throwaway probe against the pinned bundled SQLite (run 2026-05-20, with rusqlite 0.39.0 + column_metadata) settles the question. Across twenty representative query shapes, the engine's sqlite3_column_table_name / sqlite3_column_origin_name metadata follows through:

  • direct bare column refs (the baseline);
  • AS alias projections (the alias remaps the output name but the origin pair stays the source (table, column));
  • table-alias qualified refs (u.name(users, name));
  • non-recursive CTEs, including SELECT * bodies, bare-ref bodies, qualified-ref bodies, and (col-list)-renamed bodies (the rename remaps the output name; origin stays the underlying column);
  • CTE chains (a CTE that selects from a prior CTE — origin traces back to the base table);
  • derived tables in FROM (SELECT …) AS sub (out-of-scope for Phase 2 per §13 OOS-1, but useful to note: if ever admitted, type recovery comes for free);
  • scalar subqueries used as a projection primary (SELECT (SELECT name FROM users WHERE id = 1) — origin is preserved whether the subquery has an outer alias or not);
  • UNION / UNION ALL / INTERSECT / EXCEPT compound queries (result columns carry the first leg's origin);
  • multi-table JOIN projections (per-column origin per leg);
  • IN (SELECT …) subqueries in WHERE (the inner subquery does not affect the outer projection's origin).

The metadata returns None for exactly two structural classes:

  • Computed projections — function calls, arithmetic expressions, string concatenation, CASE expressions, literals, the * and t.* wildcards. Expected; pedagogically obvious; no surprise for the learner.
  • Recursive CTE result columns (WITH RECURSIVE r(n) AS (SELECT 1 UNION ALL SELECT n + 1 FROM r WHERE n < 5) SELECT n FROM r). The recursion materialises through an internal temporary table that has no base-column origin to point at. This is the one structural surprise — a recursive-CTE result column is typeless even when it is structurally a bare name reference, because the engine cannot trace the column back past the recursion.

What §12's resolution rule becomes

The original §12 rule classifies projection items structurally (unqualified ident / qualified ref → recover; everything else → None). The empirical finding makes that classification redundant and slightly wrong: it misses scalar subqueries and CTE-routed refs that the engine does carry through, and it would have needed extending for (col-list)-renamed CTEs.

The amended posture: trust the engine's column-origin metadata verbatim. For each result column, call column_table_name(i) / column_origin_name(i). If both return Some, look the pair up in the active SchemaCache and use the playground type. If either is None, the slot stays None and the renderer falls back to neutral alignment. No structural classification of the projection item is needed; the grammar tier stays uninvolved (preserving ADR-0031 §2's "no AST" decision and ADR-0030's "one source of truth" rule, both as before).

The "structurally a single column reference" definition in §12's Resolution rule is superseded by the engine-driven rule above. The §12 Implementation seam is unchanged in approach (engine-side column-origin lookup is still the mechanism), but the speculative fallback paragraph ("If exposure turns out to be awkward, the fallback is a small post-parse walk over the projection-item subtrees in the MatchedPath") is moot — the exposure works, and the engine's metadata is broader than a grammar-side walk could be without re-implementing SQLite's query-planner traceback. The fallback path is removed.

Effect on the Phase-2 plan's sub-phase 2f

The 2f exit gate's "CTE pass-through" row should be asserted positive (recovers Some(text)). The "Subquery result" row, which the plan left as "assert whichever behaviour the engine exhibits", should be asserted positive as well. A new explicit 2f test row covers the named limitation: a recursive CTE result column must produce column_types[0] = None and the renderer must fall back to neutral alignment without panicking.

The catalog and grammar-side work in 2a2e is unaffected by this amendment. Only 2f's test list and the worker's resolve_select_column_types helper change shape (the helper becomes simpler — no structural classification, just a direct metadata lookup per result column).

This amendment narrows the honest limitation in §12 from "computed / non-direct projection items" to "computed projections and recursive CTE result columns" — a tighter, factually verified carve-out.

Amendment 2 — §10.6 fixup-pass mechanism (2026-05-20)

§10.6's prescription for the post-walk fixup is written in terms of "rewriting the highlight class" on projection-list Ident terminals — downgrading "column" → "unknown identifier" when an ident doesn't belong to the eventual from_scope, or upgrading the reverse direction once a FROM is typed. The implementation chose a different mechanism that achieves the identical user-visible effect; this Amendment records the choice so a reader of §10.6 doesn't go looking for a literal per_byte_class rewrite step that does not exist.

Mechanism actually used

Two pieces, both already in the codebase by the end of sub-phase 2d:

  1. Two-pass schema-existence diagnostic. The 2d rewrite of schema_existence_diagnostics (src/dsl/walker/mod.rs) runs a pre-pass over the matched path that collects every IdentSource::Tables / cte_name / table_alias ident into a single binding vec, regardless of where in the path it sits. The main pass then resolves each sql_expr_ident against the complete binding set. A projection ident that resolves under the eventual FROM scope produces no diagnostic; one that doesn't produces an unknown_column diagnostic on its own span.

  2. Diagnostic-overlay renderer. src/input_render.rs reads the walker's diagnostic list at every keystroke and overlays each diagnostic's span with the appropriate colour (Error red for unknown-column, Warning for type-mismatch / LIKE-on-numeric / etc.). The overlay sits on top of the walker's per_byte_class (which keeps all idents at HighlightClass::Identifier).

Combined, the two yield the §10.6 user-visible behaviour: typing select bogus_col, the diagnostic emits and the overlay paints the ident red as soon as a FROM appears that shows the column doesn't exist; typing select real_col, no diagnostic emits and the ident stays Identifier-coloured. Within one debounce cycle.

Why this is equivalent

§10.6's stated goal is correctness of the end-of-walk classification — "rewriting the highlight class" is one implementation strategy for that goal. The HighlightClass enum in the codebase has only one identifier slot (Identifier); the Error tint comes from diagnostic overlay, not from a separate Column vs UnknownIdentifier class. The two-pass diagnostic pass is the "post-walk fixup" that §10.6 calls for — it just runs inside the diagnostic emitter rather than as a separate rewrite step. The integration point (§10.6's "final stage of the walk itself") still holds: schema_existence_diagnostics runs after the walk's main work, mutating the walker's accumulated diagnostic vector in place. Consumers see a single coherent snapshot.

Completion mid-typing

§10.6's second user-visible promise — "during-typing completion of projection-list column names uses the global fallback" — is preserved as a posture, but improved at the edges in sub-phase 2e by a look-ahead probe in src/completion.rs. When the leading walk produces no from_scope (the projection-before-FROM state) and the full input does have a FROM after the cursor, a second walk on the full input populates the binding set, and column candidates narrow to that scope. The fallback to global SchemaCache.columns remains the path when the full input doesn't parse cleanly (e.g., the user deleted * and is mid-edit). This is a strict improvement: the realistic "edit an existing query" workflow now narrows correctly.

What §10.6's prescription becomes

The "rewrite the highlight class" wording is superseded by: the post-walk diagnostic pass re-resolves projection idents against the complete scope and emits / withholds the unknown-column diagnostic accordingly; the renderer's diagnostic-overlay path achieves the visual change. No new HighlightClass variant is required.

§10.6's other prescriptions stand verbatim — the integration point (final walk stage, in-place mutation of walker accumulators), the per-keystroke re-walk (ADR-0027's debounced cadence), and the ORDER BY no-fixup-needed clarification.

See also

  • ADR-0005 — the ten-type vocabulary §10 resolves back to.
  • ADR-0016 — the data-table renderer SELECT results reuse.
  • ADR-0019 — the friendly-error layer engine-side rejections route through (§7).
  • ADR-0021 — per-command parse-error usage; the Phase-2 surface inherits the framework, Phase 6 polishes per-clause messages (§11 OOS-12).
  • ADR-0022 — ambient typing assistance. §5/§6/§8 inherit its keyword-completion / highlighting / hint mechanisms for free, but §10 extends its IdentSource::Columns / SchemaCache / WalkContext infrastructure with the scope accumulators, qualified-prefix narrowing, and the post-walk fixup pass that Phase 2 needs.
  • ADR-0023 / ADR-0024 — the unified grammar tree Phase 2 extends.
  • ADR-0026 — the WHERE grammar's Subgrammar node, depth counter, and MAX_SUBGRAMMAR_DEPTH = 64 cap, all reused unchanged (§9).
  • ADR-0027 — the validity indicator, free for the Phase-2 surface; §1 (ERROR/WARNING guideline) is the source quoted verbatim in §11; Amendment 1 (LIKE-on-numeric WARNING) is the one that the SQL-expression predicate-warnings gap of §11.6 closes for the SQL surface.
  • ADR-0028 — the styled OutputLine mechanism the renderer uses; not directly touched by Phase 2.
  • ADR-0030 — the parent ADR; §3 commissions this phase, §4/§6 fix execution-as-text, §7 fixes engine neutrality, §11 fixes history / replay, §13 fixes the long-running OOS list.
  • ADR-0031 — the SQL expression grammar this ADR extends additively (§5, §6); §7 named the two extensions implemented here.
  • docs/simple-mode-limitations.md — the DSL limits advanced mode lifts; Phase 2 lifts the JOIN, subquery, set-op, CTE, and grouping limits.