rdbms-playground/docs/adr/0032-sql-select-grammar.md

# ADR-0032: The full SQL `SELECT` grammar

## Status

Accepted

## Context

ADR-0030 commissions advanced mode as a body of **SQL grammar
inside the unified grammar tree** (ADR-0023/0024), phased. Phase 1
("Foundations + first `SELECT`") shipped: a single-table `SELECT`
with projection, `WHERE`, `ORDER BY`, and `LIMIT`, executed as
validated SQL text through the existing data-table renderer.
ADR-0031 authored the SQL **expression grammar** the Phase-1
`SELECT` consumed.

Phase 2 — "`SELECT` — full" — is the next slice. ADR-0030 §3 lists
it: `JOIN`s, `GROUP BY` / `HAVING`, aggregates, subquery
expressions, `UNION`/`INTERSECT`/`EXCEPT`, common table
expressions, `LIMIT … OFFSET`, qualified column references.
ADR-0030 §3 also says the full `SELECT` grammar "is each large
enough to warrant their own focused ADR when implemented — the
precedent is ADR-0026 for the `WHERE` grammar." This is that ADR.

The architecture is fixed (ADR-0030 §1, §4, §6, §8): one walker,
grammar-as-text execution, ambient assistance for free. This ADR
fixes the **shape** of the grammar — the productions, the
recursion, the additive extensions to ADR-0031's expression
fragment, and the few execution-path implications (worker-side
column-origin lookup so result columns recover their playground
type). It deliberately does **not** revisit ADR-0030's structural
decisions; references in this ADR's text to ADR-0030 §X mean
"that decision is the controlling one."

### What ADR-0030 and ADR-0031 already fix

- **No batch parser; SQL is grammar in the unified tree.**
  Subquery recursion is a `Node::Subgrammar(&NAMED)` reference,
  exactly as the expression ladder uses it (ADR-0031 §3).
- **No AST builder for the parts that execute as text.**
  `Command::Select { sql: String }` carries the validated source;
  the worker prepares and runs it (ADR-0030 §4/§6, ADR-0031 §2).
- **The `__rdbms_*` rejection** at every table-name slot
  (ADR-0030 §6) — re-applied to every Phase-2 table-source slot
  (`FROM`, `JOIN`, CTE-name).
- **No allowlist for function names** (ADR-0030 §13 OOS-3,
  ADR-0031 §6). Aggregates (`count`, `sum`, `avg`, `min`, `max`)
  parse through the generic `name_or_call` path — the grammar is
  structurally aggregate-blind, by design.
- **No quoted identifiers** (ADR-0031 §7 OOS-3) — unchanged.
- **`MAX_SUBGRAMMAR_DEPTH = 64`** (ADR-0026) is the shared
  recursion budget across DSL `Expr`, SQL expression, and (added
  here) SQL `SELECT` recursion. No new walker capability is
  introduced (§9).

### The boundary with ADR-0031

ADR-0031 §7 named two additive extensions deferred to this ADR:

- **OOS-1: subquery expressions** — `( SELECT … )` as a `primary`,
  `IN ( SELECT … )`, `EXISTS ( SELECT … )`. Their grammar is fixed
  in §6; they are additive `Choice` branches in `sql_expr.rs`,
  recursing into the named `SELECT` fragment authored here.
- **OOS-2: qualified column references** — `t.c` / `alias.c`.
  Their grammar is fixed in §5; they are an additive tail on
  `name_or_call` in `sql_expr.rs`.

`sql_expr.rs` was shaped to receive both branches without
restructuring (ADR-0031 §7 promise). This ADR redeems that
promise; the changes there are strictly additive.

## Decision

### 1. The top-level `SELECT` grammar

The full statement decomposes into a top-level *compound query*
(set-operator chains around per-leg *core selects*), wrapped by
an optional `WITH` prefix and trailing `ORDER BY` / `LIMIT`:

```
select_statement   := [ with_clause ] compound_select
compound_select    := select_core ( set_op select_core )*
                      [ order_by_clause ]
                      [ limit_clause ]
set_op             := UNION [ ALL ] | INTERSECT | EXCEPT
select_core        := SELECT [ DISTINCT | ALL ]
                      projection_list
                      [ from_clause ]
                      [ where_clause ]
                      [ group_by_clause ]
                      [ having_clause ]
with_clause        := WITH [ RECURSIVE ] cte_def
                      ( ',' cte_def )*
cte_def            := identifier [ '(' column_name_list ')' ]
                      AS '(' compound_select ')'
projection_list    := projection_item ( ',' projection_item )*
projection_item    := '*'
                    | identifier '.' '*'
                    | sql_expr [ [ AS ] identifier ]
from_clause        := FROM table_source ( join_clause )*
table_source       := identifier [ [ AS ] identifier ]
join_clause        := [ INNER ] JOIN table_source ON sql_expr
                    | LEFT  [ OUTER ] JOIN table_source ON sql_expr
                    | RIGHT [ OUTER ] JOIN table_source ON sql_expr
                    | FULL  [ OUTER ] JOIN table_source ON sql_expr
                    | CROSS JOIN table_source
where_clause       := WHERE sql_expr
group_by_clause    := GROUP BY sql_expr ( ',' sql_expr )*
having_clause      := HAVING sql_expr
order_by_clause    := ORDER BY order_item ( ',' order_item )*
order_item         := sql_expr [ ASC | DESC ]
limit_clause       := LIMIT sql_expr [ OFFSET sql_expr ]
```

`sql_expr` is ADR-0031's `SQL_OR_EXPR`, extended additively per
§5 and §6. `column_name_list` is `identifier (, identifier)*`.

The named `static Node` exported by the new
`src/dsl/grammar/sql_select.rs` is `SQL_SELECT_STATEMENT`
(matching the full statement) and `SQL_SELECT_COMPOUND` (the
embedded form, omitting the outer `WITH`; this is what subqueries
recurse into — see §6, §9).

Notes on specific productions:

- **`FROM` stays optional.** Phase 1's autonomous decision §4.1
  is upheld: `SELECT 1` and `SELECT upper('x')` continue to
  parse. With JOINs landing, the absence of a `FROM` simply
  means no `from_clause`/`join_clause` was matched; no extra
  shape is needed.
- **Bare-alias projection (`select a x`) is admitted.** Phase 1's
  autonomous decision §4.2 deliberately rejected it as
  structurally ambiguous. With Phase 2's grammar — `FROM` is
  the only word that can legitimately follow a projection list,
  and it is a keyword in the walker's expected-set — the
  ambiguity dissolves: an identifier following the last
  projection expression that is not `FROM`, `,`, `WHERE`,
  `GROUP`, `ORDER`, `LIMIT`, or a set-op keyword is a bare
  alias, and is so admitted. This lifts a small but visible
  Phase-1 limitation.
- **`SELECT [ DISTINCT | ALL ]`.** `ALL` is the default and is
  admitted for symmetry; `DISTINCT` is the meaningful case. They
  are mutually exclusive at this position (a `Choice`, not two
  `Optional`s).
- **`identifier '.' '*'`** lives only in `projection_item`, never
  in `sql_expr`. This is intentional: `t.*` is *projection
  syntax*, not an expression, and admitting it as an expression
  primary would let it appear in `WHERE` / `ORDER BY` / etc.
  where the engine would reject it and the engine-neutral error
  would be hard to phrase. The grammar simply refuses it
  structurally outside projection.
- **`UNION ALL` is a single set-op,** not `UNION` followed by an
  `ALL` modifier on the next leg. `set_op` is a `Choice` of the
  four atoms (with `UNION` and `UNION ALL` as separate branches);
  factoring `UNION [ ALL ]` is also valid but the explicit four
  branches keep the matched-path classes cleaner for
  highlighting.

### 2. JOIN flavours admitted

The grammar admits exactly the flavours the user picked:

- `INNER JOIN` / bare `JOIN`
- `LEFT [ OUTER ] JOIN`
- `RIGHT [ OUTER ] JOIN`
- `FULL [ OUTER ] JOIN`
- `CROSS JOIN`

The first four take a mandatory `ON sql_expr`; `CROSS JOIN`
takes none. `OUTER` is the optional explicit modifier on
`LEFT` / `RIGHT` / `FULL`.

**Explicitly out (§11):** `NATURAL JOIN`, `JOIN … USING (col)`,
and comma-list `FROM t1, t2` (the legacy implicit cross join).
The first two add grammar weight for limited teaching value;
comma-FROM teaches habits we do not want to encourage —
`CROSS JOIN` covers the same shape explicitly.

JOIN chains are admitted as a flat `( join_clause )*`. Standard
SQL is left-associative; since the grammar builds no AST and the
engine receives the source text verbatim (ADR-0030 §4), the
engine resolves the associativity. The grammar's job ends at "the
chain parses".

### 3. Set operators and compound queries

`UNION`, `UNION ALL`, `INTERSECT`, `EXCEPT` all admitted —
ADR-0030 §3's full set.

The compound shape (§1) is `select_core (set_op select_core)*`,
flat. Standard SQL gives `INTERSECT` higher precedence than
`UNION` / `EXCEPT`; the engine resolves this — the grammar admits
the chain as written. This mirrors §2's JOIN-chain decision.
A user who wants explicit grouping writes
`(SELECT … INTERSECT SELECT …) UNION SELECT …`, which falls out
of the subquery-`primary` branch (§6) — though for a top-level
statement this requires an extra `SELECT` wrapping. In practice
the engine's precedence is what learners encounter; calling it
out in the `help sql` page (ADR-0030 Phase 6) is sufficient.

`ORDER BY` / `LIMIT` on a compound apply to the whole compound,
not to a leg — fixed by the position of `order_by_clause` and
`limit_clause` in §1's `compound_select`.

### 4. CTEs (`WITH` and `WITH RECURSIVE`)

The full `with_clause` per §1. Both forms admitted: non-recursive
`WITH` for naming intermediate results, and `WITH RECURSIVE` for
recursive queries (tree traversals, transitive closure,
generated sequences).

The `cte_def` body is a parenthesised `compound_select`, so the
recursion is into `SQL_SELECT_COMPOUND` via `Subgrammar` — the
same recursion mechanism subqueries use (§9).

**CTE-name collisions.** A CTE name shares the table-name
namespace at the engine. Standard SQL: the CTE shadows a
same-named base table within the statement. The grammar is
agnostic — both are identifiers in a table-source slot — so the
shadowing falls out of engine resolution. The
`reject_internal_table` validator still rejects any `__rdbms_*`
identifier in any table-source slot, **including** CTE-name
slots and the `FROM`s inside CTE bodies. That is the right
posture: the reserved namespace is reserved everywhere.

Recursive CTEs use the standard `cte_name AS ( base_case UNION
[ALL] recursive_case )` shape — already admitted by §1's
`compound_select` body. No grammar branch specific to recursion
is needed; the `RECURSIVE` keyword is a hint to the engine, not
a grammar gate.

### 5. Qualified column references

Additive extension to `sql_expr.rs` (ADR-0031 §7 OOS-2).
`name_or_call`'s identifier prefix gains a `Choice` tail:

```
name_or_call    := identifier
                   ( '.' identifier
                   | '(' call_args? ')'
                   )?
```

The leading identifier is matched once (preserving ADR-0031 §1's
factoring — no `Choice` branch begins with an identifier). The
optional tail is *either* a qualified-reference suffix
(`. identifier`) *or* a function-call argument list (`( … )`),
not both. A bare identifier with no tail remains a plain column
reference.

A function call with a qualified name — `schema.f(…)` — is not in
scope (we have no schemas) and is structurally inadmissible by
construction: there is no production that admits both a `.`-tail
and a `(`-tail.

Completion for the qualified form: when the cursor is past
`identifier '.'`, the completion source is "columns of the table
or alias named by the leading identifier", resolved from the
active `SchemaCache` (the same source the DSL completion uses,
ADR-0030 §8). This is a small extension to the existing
`IdentSource::Columns` machinery — when in scope, column
completion is scoped to the named source.

### 6. Subquery expressions

Additive extensions to `sql_expr.rs` (ADR-0031 §7 OOS-1):

- **Scalar subquery as `primary`.** A `Choice` branch
  `'(' compound_select ')'`. The existing `'(' or_expr ')'`
  branch handles parenthesised expressions. Both start with
  `'('`, so per ADR-0031 §1's factoring principle, the `'('` is
  matched once and the inside is a `Choice` between
  `compound_select` and `or_expr`. The first inside token
  disambiguates: `SELECT` or `WITH` → subquery; anything else →
  expression. The two `Choice` branches have non-overlapping
  first-token sets, so the walker's expected-set at the
  ambiguity point merges naturally without `Optional`-first
  hazards.

- **`IN ( subquery )`.** The existing `predicate_tail`'s
  `IN '(' additive (',' additive)* ')'` branch gains a sibling
  `IN '(' compound_select ')'`. Same `'('` factoring as the
  scalar case: after `'('`, branch on `SELECT`/`WITH` (subquery)
  vs additive-first-token (literal list). `NOT IN` follows from
  the existing `[ NOT ]` factoring on the predicate tail.

- **`[ NOT ] EXISTS ( subquery )`.** Added as a `primary`
  `Choice` branch:

  ```
  primary := … | EXISTS '(' compound_select ')'
  ```

  The bare `EXISTS` form lives in `primary`; `NOT EXISTS` falls
  out of the existing `not_expr := NOT not_expr` tier above
  `primary` in the precedence ladder. This is structurally
  cleaner than putting `[ NOT ] EXISTS` inside `primary`: there
  is only one place `NOT` is admitted, and it composes uniformly.

All three branches recurse through `Subgrammar(&SQL_SELECT_COMPOUND)`.
Correlated subqueries fall out for free — a subquery's
`sql_expr` reaches identifiers, which the engine resolves
against outer scopes. The grammar imposes no correlation
constraint; correlation is engine-side semantics.

### 7. `GROUP BY` and `HAVING`

`GROUP BY` takes a comma-separated list of `sql_expr`s.
Standard SQL admits any expression as a grouping key (not just
bare columns) — e.g. `GROUP BY date(created_at)`. The grammar
admits this without special-casing.

`HAVING` is a single `sql_expr`. Its semantics is "boolean over
grouped rows"; the grammar does not enforce that — the
expression's typing is the engine's concern.

**Aggregate correctness is not grammar-checked.** Whether a
projection's non-aggregated columns are valid given the
`GROUP BY` keys is a semantic question. ADR-0030 §9 settled this:
the grammar admits structurally, the engine rejects semantically,
and the friendly-error layer renders engine-neutral wording
(ADR-0019). A learner who writes `SELECT Name, COUNT(*) FROM t`
sees an engine-neutral "Name must appear in a GROUP BY clause or
be wrapped in an aggregate function"-style message, not a raw
engine string and not a parse error. This is the project's
honest limitation (ADR-0030 §7) and remains so.

### 8. `LIMIT` / `OFFSET` and `ORDER BY` extras

`LIMIT n [ OFFSET m ]` — the standard form. Both `n` and `m` are
`sql_expr`s (in practice integer literals, but the grammar
admits the general form so e.g. `LIMIT max(10, x) OFFSET 0` is
structurally accepted; the engine constrains values).

The MySQL/SQLite legacy comma form `LIMIT m, n` is **out** (§11).
Its argument order (offset first, then count) inverts the
keyword form — a needless source of confusion.

`ORDER BY` already admits `sql_expr` items with optional
`ASC` / `DESC` (Phase 1). With Phase 2:

- **Column-position references** (`ORDER BY 1, 3 DESC`) fall out
  for free — an integer literal is a valid `sql_expr`, and the
  engine interprets a bare positive integer in `ORDER BY` as a
  column position. The grammar does not distinguish the case;
  rendering interprets the position. Document in `help sql`.
- **Qualified refs** in `ORDER BY` (e.g. `ORDER BY t.c`) fall
  out of §5 — the grammar uses the same `sql_expr` body.

### 9. Recursion, the depth budget, and the walker

`SELECT` recurses into itself at four points:

- A subquery `primary` in `sql_expr` (§6).
- An `IN ( subquery )` predicate tail (§6).
- An `EXISTS ( subquery )` primary (§6).
- A CTE body (§4).

Every recursion is wired through
`Node::Subgrammar(&SQL_SELECT_COMPOUND)` — the named `static` Node
exported by `sql_select.rs`. The recursion is token-guarded in
every case: a subquery `primary` is preceded by `'('`; an
`IN ( subquery )` by `IN (`; an `EXISTS ( subquery )` by
`EXISTS (`; a CTE body by `AS (`. There is no left recursion;
the walker always makes progress.

`MAX_SUBGRAMMAR_DEPTH = 64` (ADR-0026, reused by ADR-0031) is
**shared**: DSL `Expr` recursion, SQL expression recursion, and
SQL `SELECT` recursion all increment the same
`WalkContext::subgrammar_depth`. A worst-case learner query
might be `SELECT … WHERE id IN (SELECT … WHERE id IN (SELECT …))`
with each inner select carrying a few-deep expression — well
below the cap. The cap remains purely a stack-overflow guard;
**this ADR does not raise it**. If pathological-but-realistic
learner queries reach 64 in practice, a focused ADR lifts it
with measurements. Speculative raising would weaken the guard
without evidence.

**No new walker capability is introduced.** `Subgrammar`, the
depth counter, the cap, and the friendly depth-exceeded error
all carry over from ADR-0026 unchanged — the same posture
ADR-0031 took. This is a non-trivial property: Phase 2 is the
biggest single grammar slice in the project, and it lands
without changing the walker's contract.

### 10. Completion scope and the `WalkContext` extension

ADR-0030 §8 promises that "ambient assistance comes for free"
because SQL is grammar in the unified tree. For Phase 1's
single-table `SELECT` this was substantially true: the existing
`WalkContext::current_table` mechanism (populated via the
`writes_table: true` flag on the `FROM` table-name slot) gave
`WHERE` and `ORDER BY` column-name completion against the right
table at no incremental cost.

Phase 2 breaks the "free" claim. Multiple `FROM` tables via
`JOIN`s, aliases, CTE-defined table sources, subqueries with their
own `FROM` scope, qualified `t.c` references, projection aliases
referenced in `ORDER BY` — every Phase-2 surface needs **scope
information that `WalkContext` does not currently carry**. §9's
"no new walker capability" claim holds for grammar recursion
(`Subgrammar` and the depth cap suffice); for completion scope it
is too strong, and is softened here to an honest split.

The current `WalkContext` carries one table at a time
(`current_table: Option<String>` + `current_table_columns`), set
by `writes_table: true` on a `Tables` identifier. DSL paths
(`update T`, `delete from T`, `insert into T`) rely on this
single-table contract and continue to work unchanged. Phase 2
adds layered accumulators alongside, not in place of.

#### 10.1. The from-scope accumulator

A new `WalkContext` field:

```
from_scope: Vec<TableBinding>
TableBinding { table: String, alias: Option<String>,
                columns: Vec<TableColumn> }
```

Populated incrementally as the walker descends through
`from_clause` and each `join_clause` (§1). The first table-source
slot pushes a binding; every subsequent `JOIN` pushes another.
`Ident` slots whose `IdentSource` is `Columns` now resolve against
the union of every binding's columns, with deduplication.

`current_table` / `current_table_columns` remain as derived
helpers: when `from_scope.len() == 1`, they expose that single
binding's data, preserving the contract every existing DSL path
relies on. DSL `UPDATE` / `DELETE` / `INSERT` continue to push
exactly one binding via the existing `writes_table: true`
mechanism, unchanged.

#### 10.2. Scope-stack discipline at `Subgrammar` boundaries

Subqueries (§6) and CTE bodies (§4) introduce new lexical scopes.
A column reference inside `SELECT … WHERE id IN (SELECT id FROM
u)` resolves first against the inner `SELECT`'s `FROM` (`u`), and
— for correlation — also against the outer scope.
`subgrammar_depth` is a counter; it suffices for §9's depth cap
but not for scope.

Phase 2 layers a stack on top. A new field:

```
from_scope_stack: Vec<ScopeFrame>
ScopeFrame {
    from_scope: Vec<TableBinding>,
    cte_bindings: Vec<CteBinding>,
    projection_aliases: Vec<String>,
}
```

The new walker node variant — `Node::ScopedSubgrammar(&Node)` —
is what triggers a scope push. It is a sibling of the existing
`Node::Subgrammar(&Node)`, with the same recursion semantics
(reference-following, depth-counted) and one additional driver
behaviour: on entry, push the current `ScopeFrame` onto
`from_scope_stack` and start a fresh empty frame; on exit, pop
back. The existing `Node::Subgrammar` variant is unchanged — DSL
`Expr` recursion (ADR-0026) and the `sql_expr.rs` precedence-
ladder recursion (ADR-0031) keep using it and never push a scope.

The grammar source spells the choice explicitly at each call
site: subqueries in `sql_expr.rs` and CTE bodies in
`sql_select.rs` reference the compound-SELECT through
`Node::ScopedSubgrammar(&SQL_SELECT_COMPOUND)`; predicate-ladder
recursion in `sql_expr.rs` continues to use
`Node::Subgrammar(&SQL_OR_EXPR)`. Self-documenting, no flag
bookkeeping, and the walker change is localised to one extra arm
in the driver's `match` over `Node` variants.

Column-completion candidates inside a scope frame are the union
of the current frame's `from_scope` and (for correlated refs)
all outer frames; outer-frame columns are admitted as additional
candidates so correlated references work. Ordering or visual
differentiation between current-frame and outer-frame candidates
is completion-tier polish and is not specified by this ADR — the
current completion API (`candidates_at_cursor*`) returns a flat
`Vec`, and adding a priority dimension is a separate concern.
CTE bindings resolve the same way (outward-walking) — a CTE
defined in an outer query is visible inside an inner subquery as
a table source, unless the inner subquery defines a CTE of the
same name and shadows it.

This is the one explicit walker-capability extension Phase 2
makes. It is scoped: one new node variant, no new walker entry
point, no change to how Subgrammar bodies are entered
structurally. The depth cap (§9) applies to both variants
uniformly through the shared `subgrammar_depth` counter.

#### 10.3. CTE bindings

A frame-local accumulator carries CTE definitions visible in the
current scope:

```
cte_bindings: Vec<CteBinding>
CteBinding {
    name: String,
    columns: Vec<CteColumn>,
}
CteColumn {
    name: Option<String>,            // None for unnamed
                                     //   computed projections
    type_: Option<Type>,             // resolved playground type
                                     //   if derivable
}
```

A CTE definition `cte_name [(col-list)] AS (compound_select)`
produces a binding in two stages:

1. **Pre-body push** (so `WITH RECURSIVE` self-references resolve).
   When the walker reaches `AS` and is about to enter the body's
   `Node::ScopedSubgrammar(&SQL_SELECT_COMPOUND)`, it pushes a
   placeholder binding into the *outer* frame's `cte_bindings`
   with `columns = []` (an empty stand-in). The CTE name is now
   visible as a table source from inside the body.
2. **Body-finalised harvest** (when the body's scope frame
   completes). On `ScopedSubgrammar` exit, before popping the
   frame, the driver derives the body's projection-list output
   columns (rules below) and rewrites the placeholder binding in
   the outer frame.

**Output-column derivation rules.** Walking the body's
projection items:

| Projection item                       | Derived CTE column(s)                                                                  |
|---------------------------------------|----------------------------------------------------------------------------------------|
| `*`                                   | Every column from the body frame's `from_scope`, in order, with their resolved types   |
| `t.*` (qualified wildcard)            | Every column from binding `t` in the body frame's `from_scope`, with their types       |
| `col` (bare ref, resolves uniquely)   | One column: name = `col`, type = the resolved column's playground type                 |
| `t.col` (qualified ref)               | One column: name = `col`, type = `t`'s column's type                                   |
| `expr AS alias` or bare `expr alias`  | One column: name = `alias`, type = the underlying type if `expr` is a single column ref, else `None` |
| `expr` (computed, no alias)           | One column: name = `None`, type = `None` — engine assigns an implementation-defined name |

For compound bodies (`UNION` / `INTERSECT` / `EXCEPT`) the columns
come from the **first leg** per standard SQL. For recursive CTE
bodies (`WITH RECURSIVE`) the same rule — the non-recursive leg
dictates.

If a `(col-list)` was supplied on the CTE name, it **renames** the
derived columns positionally and overrides their names; types are
preserved from the derivation. If the column-count of `(col-list)`
disagrees with the body's projection arity, the grammar admits
this and the engine surfaces the mismatch — `do_run_select`'s
engine-neutral error layer carries the message (ADR-0030 §9,
ADR-0019).

**Completion past `cte_alias.|`.** Where the derivation produced
named columns (every form above except computed-no-alias), they
complete with their names and (where typed) participate in §11's
result-type resolution if the CTE's columns are projected
upstream. Where the derivation produced an unnamed column slot,
that slot is silently skipped from the qualified-prefix candidate
list — the user typing `cte.|` past it sees only the nameable
columns. The cure for "I want my expression to be referenceable
from outside the CTE" is to add an alias, which is the same cure
the engine itself enforces at execution time.

This is substantially better than the earlier "honest limitation"
posture: the common `SELECT *` body is fully resolvable; explicit
projections are resolvable; only un-aliased computed columns
elude us, and the right learner response there is the same as
the engine's right learner response — write an alias.

`cte_bindings` lives on the scope frame, so a CTE defined in an
outer query is visible inside an inner subquery as a table source
unless that subquery defines a CTE of the same name (which
shadows it, per standard SQL).

#### 10.4. Projection-alias bindings

Standard SQL admits `ORDER BY` referencing a SELECT-list alias:
`SELECT a + b AS total FROM t ORDER BY total`. A third frame-local
accumulator:

```
projection_aliases: Vec<String>
```

Each `projection_item`'s optional alias (whether `AS x` or bare
`x` — see §1) appends its name. `Ident` slots inside the trailing
`ORDER BY`'s `sql_expr`s offer projection aliases as additional
candidates alongside column names. This addresses §1's bare-alias
admission's completion behaviour at the same time.

The accumulator is not consulted inside `WHERE`, `GROUP BY`, or
`HAVING` — standard SQL forbids alias references there
(aliases are not yet bound at evaluation time). The grammar
admits them structurally regardless; the engine rejects; ADR-0019
renders the engine-neutral error.

#### 10.5. Qualified-prefix completion

§5 fixed the grammar for `t.c` references. The completion
behaviour at qualified positions:

- At an `Ident` cursor with **no prefix**, candidates are the
  union of every `from_scope` binding's columns, plus
  `projection_aliases` when in `ORDER BY`, deduplicated. CTE-name
  candidates apply only in table-source slots, not column slots.
- At an `Ident` cursor immediately after `prefix '.'`, candidates
  are **scoped**: resolve `prefix` against the active `from_scope`
  (preferring alias matches over table matches, since aliases
  shadow), and offer that binding's columns alone. If `prefix`
  doesn't resolve to a binding, the candidate list is empty — the
  walker's expected-set still surfaces the syntactic alternatives
  (the user sees no column candidates but the structural error
  message reports the unresolved prefix).

The qualified-prefix narrowing is a small extension to the
existing `IdentSource::Columns` handling: when the matched-path
immediately preceding the `Ident` ends with `Ident '.'`, the
completer is told the prefix and narrows accordingly. This is the
only completion-source-level change; the rest is data flowing
through the new accumulators.

#### 10.6. The projection-before-FROM problem

Standard SQL writes projection **before** `FROM`. A user typing
`select col1, col2 from mytable` produces, mid-typing, a state
where the projection list has been parsed but the `FROM` has not.
At that point the column-name completer cannot scope to
`mytable` — it does not know `mytable` is coming. Validation and
highlighting face the same problem: `col1` and `col2` cannot be
checked as belonging to `mytable` until the user types `from
mytable`. The debounced re-walk on every keystroke (ADR-0027) is
**not** sufficient on its own to fix this in a single-pass walker,
because by the time the FROM is parsed, the projection
identifiers have already been resolved (left-to-right) against
the only scope information available at that moment — the empty
`from_scope`.

There is no fully satisfying single-pass answer. Phase 2's
posture is therefore explicit:

1. **During-typing completion** of projection-list column names,
   when `from_scope` is empty (no `FROM` yet), uses the unioned
   `SchemaCache.columns` — every column known to the schema —
   as the candidate set. This is the same global fallback Phase 1
   uses and remains the right behaviour: a noisier-but-useful
   completion is better than no completion.
2. **A post-walk fixup pass** re-evaluates projection-list column
   refs against the *final* `from_scope` after the walk
   completes. The walk records each projection `Ident`'s
   span and matched-path location; once the walk reaches end-of-
   input (or end-of-statement), the fixup walks the recorded
   list, looks up each identifier against the final `from_scope`,
   and:
   - **Rewrites the highlight class** on that terminal —
     downgrading "column" → "unknown identifier" when the
     identifier doesn't belong to any in-scope binding,
     upgrading "unknown identifier" → "column" when it does.
   - **Updates the diagnostic** for the validity indicator
     (ADR-0027) — a column-not-found ERROR either appears or
     disappears based on the post-walk scope.

   **Integration point.** The fixup runs as the **final stage of
   the walk itself**, after all grammar nodes have been processed
   but before `WalkResult` is returned to the caller. It mutates
   the walker's accumulated highlight runs and diagnostics vector
   in place, so the consumer (the renderer, the validity
   indicator) sees a single coherent snapshot. This keeps the
   walker the single source of truth for what reaches the
   renderer — the fixup is conceptually part of "what the walker
   produces", not a separate post-processing layer. The same
   convention applies to the §11.6 SQL-expression predicate
   warnings, which also run as a final walk stage.
3. The fixup runs on every debounced re-walk (ADR-0027 already
   triggers the full walk per keystroke), so the user observes:
   typing `col1, col2 from mytable`, the `col1` / `col2`
   initially highlight as generic identifiers (with a soft
   warning if not found anywhere in the schema); the moment
   `mytable` is typed, the highlight snaps to the column class
   if `col1` / `col2` belong to `mytable`, or to the
   unknown-identifier diagnostic if they don't — within one
   debounce cycle.

The fixup pass does not re-parse; it only re-resolves
identifiers against the final `from_scope`.

`ORDER BY` alias resolution needs no fixup. Projection precedes
`ORDER BY` in walk order, so `projection_aliases` is fully
populated by the time the walker reaches an `ORDER BY` `Ident`;
the alias-as-column-candidate is resolved in the single forward
pass.

This is the answer to the user's "I think this may be automatic"
intuition: the debounced re-walk is automatic; the
post-walk fixup pass is the new infrastructure that makes the
re-walk produce *correct* results. Without it, projection-list
column refs would forever validate against the global column set
even after the `FROM` is typed.

#### 10.7. The honest split

§9 still holds for **grammar recursion**: `Subgrammar` and the
depth cap are reused unchanged. For **completion scope**, this
section introduces:

- New `WalkContext` fields: `from_scope`, `from_scope_stack`,
  `cte_bindings`, `projection_aliases`.
- Scope push/pop discipline at `SQL_SELECT_COMPOUND` `Subgrammar`
  boundaries — driven by a marker on the Subgrammar target so DSL
  Subgrammars are unaffected.
- A qualified-prefix narrowing in the `IdentSource::Columns`
  completion path.
- A post-walk fixup pass for projection-list identifier
  highlighting and validity (§10.6).

These are real walker-contract extensions. They are scoped: no
new node kinds, no new walk-driver entry points, no changes to
how Subgrammar bodies are entered structurally. The existing DSL
paths are unaffected — their grammars never push a SELECT scope,
never define a CTE, never carry projection aliases — and the
single-table `current_table` / `current_table_columns` view is
preserved as a derived helper.

§9's claim is therefore restated honestly: **grammar recursion
needs no new walker capability; completion scope needs the
additions above.**

### 11. Diagnostics for Phase-2 validation cases

ADR-0027 fixes the warning-vs-error guideline verbatim:

> **ERROR** — the input is *known* to fail. Either it does not
> parse (incomplete, or a mismatched / invalid token), or it
> parses but names something that does not exist (an unknown
> table or column).
>
> **WARNING** — the input is valid and *will* run, but is very
> likely not what a knowledgeable user wants: a type-mismatched
> comparison, or `= NULL` (both from ADR-0026 §7). Amendment 1
> adds a third trigger — `LIKE` against a numeric column.
>
> The split is *certainty of failure* versus *likely misleading*.

This section walks the Phase-2 surface case-by-case, classifies
each against that guideline, and identifies the diagnostic
machinery additions needed. It also flags a Phase-1 carry-over
gap (§11.6) that Phase 2 closes.

#### 11.1. Existing diagnostics, briefly

Two post-walk passes today (`src/dsl/walker/mod.rs`):

- **Schema-existence pass** (ERROR). Walks the `MatchedPath`,
  checks every `IdentSource::Tables` / `IdentSource::Columns`
  ident against `SchemaCache`. Emits `diagnostic.unknown_table`
  and `diagnostic.unknown_column`. Today this assumes a single
  `current_table` for column resolution.
- **Expression predicate-warnings pass** (WARNING). Walks the
  parsed DSL `Expr` AST emitted by `expr.rs`'s builder.
  Emits `diagnostic.eq_null`, `diagnostic.type_mismatch`,
  `diagnostic.like_numeric`. Runs only on WHERE expressions in
  the DSL.

Phase 2 extends both, and §11.6 fills a SQL-side gap.

#### 11.2. Phase-2 new ERROR cases

Every case below is "known to fail on the engine" — the engine
would surface a message the friendly-error layer would translate
(ADR-0019). Surfacing them as pre-flight ERROR diagnostics gives
the learner the answer one debounce cycle faster, with the
walker as the single source of truth.

- **Unknown table in any `FROM`/`JOIN` slot.** The existing
  schema-existence pass extends from "the one
  `current_table`" to walking every `from_scope` binding's
  `table` and emitting `diagnostic.unknown_table` per
  unresolved name. CTE-name slots in the active
  `cte_bindings` are valid table sources and exempt from
  this check.
- **Unknown CTE-as-table.** A table-source slot whose name is
  not in `SchemaCache.tables` *and* not in the active
  `cte_bindings` chain emits `diagnostic.unknown_table` (same
  catalog key — from the learner's perspective the engine
  message is the same; the slot is a "table that doesn't
  exist", whether they meant a CTE or a base table).
- **Unknown table or alias in a qualified column reference**
  (`t.c` where `t` doesn't resolve in the active
  `from_scope`). New catalog key
  `diagnostic.unknown_qualifier` `{qualifier}`.
- **Unknown column in a qualified reference** (`t.c` where `t`
  resolves but `c` is not a column of that binding). Reuses
  `diagnostic.unknown_column` with the column name in context.
- **Ambiguous unqualified column reference** — a column name
  used unqualified that exists in two or more `from_scope`
  bindings. The engine raises "ambiguous column name"; we
  surface it as ERROR with a new catalog key
  `diagnostic.ambiguous_column` `{column}, {qualifiers}` so
  the learner sees which two tables the name appeared in.
- **Reference to a projection alias in `WHERE` / `GROUP BY` /
  `HAVING`.** Standard SQL forbids it (aliases are not bound
  at evaluation time). The grammar admits the identifier
  structurally; a new diagnostic pass emits ERROR with a new
  catalog key `diagnostic.projection_alias_misplaced`
  `{alias}, {clause}`.
- **CTE column-list arity mismatch.** When `cte_name (col1,
  col2, …) AS (compound_select)` declares N columns and the
  body's projection (§10.3) derives M columns with N ≠ M, the
  CTE harvest pass (§10.3 stage 2) emits ERROR with a new
  catalog key `diagnostic.cte_arity_mismatch` `{cte},
  {declared}, {actual}`.
- **Compound-query column-count mismatch.** When a `UNION` /
  `INTERSECT` / `EXCEPT` chain has legs whose projection
  arities differ, the engine errors at execution. Phase 2
  catches it pre-flight: each leg's derived arity (the same
  derivation the CTE harvest uses) is compared as the
  compound is assembled. ERROR with a new catalog key
  `diagnostic.compound_arity_mismatch` `{op}, {left_n},
  {right_n}`.
- **Internal-table reference in any new table-source slot.**
  Already a parse-time rejection via
  `reject_internal_table` (§1, §4) — surfaces as a parse
  error, not a post-walk diagnostic. Listed here for
  completeness: the catalog key `select.internal_table`
  authored in Phase 1 covers every Phase-2 slot too.

#### 11.3. Phase-2 new WARNING cases

The existing WARNING set (`= NULL`, type-mismatched
comparison, `LIKE`-on-numeric) is the right set. Phase-2 adds
**no new WARNING categories** — every Phase-2-specific case
falls into ERROR (§11.2) or engine-rejected (§11.4).

Considered and rejected as WARNINGs:

- **CTE name shadowing a base table.** Standard SQL behaviour;
  often intentional (the canonical "filter to a subset, then
  query as if it were the base table" pattern). No diagnostic.
- **Correlated reference without explicit qualification.**
  Correlation is implicit in standard SQL; per the user
  guideline a knowledgeable user does want this. The walker
  validates the reference silently against the outer-frame
  scope; no warning, no diagnostic.
- **Unused CTE.** A CTE defined in `WITH` but never referenced.
  The engine ignores it; many learners write CTEs as
  intermediate scratch space. Not a warning.

#### 11.4. Engine-rejected (no diagnostic)

These fail on the engine and surface via ADR-0019's
friendly-error layer at execution time. The walker does not
attempt pre-flight detection because:

- **Non-aggregated columns in projection with `GROUP BY`** —
  detecting requires knowing which function names are
  aggregates; ADR-0030 §13 OOS-3 / ADR-0031 §6 keep us
  allowlist-free.
- **Aggregate function in `WHERE`** — same reason.
- **Scalar subquery returning multiple rows** — semantic, not
  syntactic; requires execution.
- **Recursive CTE without a `UNION`** — requires inspection of
  the body's compound shape against the recursive contract;
  doable in principle, deferred as engine territory.
- **Duplicate CTE names within the same `WITH`** — checkable
  in principle (walking `cte_bindings` for duplicates), but
  the engine catches it cleanly. Phase 2 does not pre-flight
  it; could be added later if its absence proves confusing.
- **Type-mismatched JOIN ON predicates** — the existing
  expression type-mismatch warning (extended per §11.6)
  handles the explicit-literal case; arbitrary-expression
  cases require type inference and stay engine-side.

#### 11.5. Catalog additions

Phase 2 adds the following message-catalog keys (ADR-0019).
Every key is engine-neutral by construction.

Parse-time-detectable (post-walk diagnostic passes):

| Key                                    | Slots                                            |
|----------------------------------------|--------------------------------------------------|
| `diagnostic.unknown_qualifier`         | `{qualifier}`                                    |
| `diagnostic.ambiguous_column`          | `{column}, {qualifiers}`                         |
| `diagnostic.projection_alias_misplaced`| `{alias}, {clause}`                              |
| `diagnostic.cte_arity_mismatch`        | `{cte}, {declared}, {actual}`                    |
| `diagnostic.compound_arity_mismatch`   | `{op}, {left_n}, {right_n}`                      |

Engine-error translations (friendly-error layer; reached on
execution failure):

| Key                                    | Engine cause                                     |
|----------------------------------------|--------------------------------------------------|
| `engine.no_such_table`                 | `no such table: <name>` (post-execution path)    |
| `engine.no_such_column`                | `no such column: <name>` (post-execution path)   |
| `engine.ambiguous_column`              | `ambiguous column name: <name>`                  |
| `engine.aggregate_misuse`              | `misuse of aggregate function <name>()`          |
| `engine.group_by_required`             | `column must appear in the GROUP BY clause or be used in an aggregate function` (or equivalent) |
| `engine.compound_arity_mismatch`       | `SELECTs to the left and right of UNION do not have the same number of result columns` (or equivalent) |
| `engine.scalar_subquery_too_many_rows` | scalar subquery cardinality violation            |
| `engine.recursive_cte_malformed`       | recursive CTE shape errors                       |

The parse-time keys and the engine keys are intentionally
separate even when they describe the same situation
(`engine.ambiguous_column` mirrors
`diagnostic.ambiguous_column`) — the parse-time message can
include the learner's typed text and span; the engine-time
message catches what the parser missed and routes through the
friendly-error layer with whatever context the engine yielded.

Two pre-existing parse-time keys are reused unchanged for
Phase-2 slots: `diagnostic.unknown_table`,
`diagnostic.unknown_column`, and the Phase-1
`select.internal_table`.

#### 11.6. The Phase-1 SQL-expression predicate-warning gap

ADR-0027 Amendment 1's `LIKE`-on-numeric warning, and ADR-0026
§7's `= NULL` and type-mismatch warnings, are emitted by a pass
that walks the **DSL** `Expr` AST. Phase 1's `sql_expr.rs`
deliberately builds **no AST** (ADR-0031 §2). The consequence
is a Phase-1 carry-over gap: **SQL `WHERE` expressions today
emit none of these warnings** — `select * from t where name
like 5` parses, the engine runs it, and the learner gets the
engine's verdict, not the friendly pre-flight nudge ADR-0027
Amendment 1 promised.

Phase 2 closes this. The predicate-warnings pass gains a
**MatchedPath-walking variant** that runs over the SQL
expression nodes and identifies the predicate shapes
structurally (a `LIKE` predicate-tail with a column-ref left
operand; a `=`/`!=` predicate-tail with a `NULL` literal
operand; a comparison predicate-tail with a column-literal
operand pair of mismatched types). It does not need an `Expr`
AST because the matched-path terminals carry both the byte spans
(for the diagnostic) and the node-name labels (for shape
identification). The same catalog keys (`diagnostic.eq_null`,
`diagnostic.type_mismatch`, `diagnostic.like_numeric`) apply
unchanged; only the pass implementation is new.

The MatchedPath-walking pass runs over **every** Phase-2
`sql_expr` slot — `WHERE`, `HAVING`, `ON`, `CASE` branches,
projection items, `ORDER BY` items — so warnings surface
uniformly across the SQL surface rather than just `WHERE`. This
is a strict improvement over Phase 1's behaviour, where even
Phase-1 SELECT WHERE expressions got no predicate warnings.

Type-resolution for the MatchedPath-walking pass: a column ref's
type comes from §10's `from_scope` (or, for `t.c`, the specific
binding); a literal's type comes from its lexical class. When
the column ref doesn't resolve (the schema-existence ERROR pass
will already have flagged it), the warning pass skips the
predicate — no point compounding diagnostics on an already-
broken reference.

#### 11.7. Mechanism summary

Three diagnostic passes by end of Phase 2, all running as final
stages of the walk (per §10.6's integration-point convention):

1. **Schema-existence ERROR pass** — extended from single
   `current_table` to walking every `from_scope` binding and
   the active `cte_bindings`. Adds the qualified-reference
   and ambiguity checks (§11.2).
2. **Arity-check ERROR pass** (new) — runs at CTE-body and
   compound-query frame-exits (the same `ScopedSubgrammar`
   exit hook §10.3 uses), comparing declared vs derived
   column counts.
3. **Predicate-warnings pass** — extended with a
   MatchedPath-walking variant for `sql_expr` (§11.6) covering
   `= NULL`, type mismatch, and `LIKE`-on-numeric across every
   SQL expression slot, in addition to the existing DSL `Expr`
   AST variant for DSL expressions.

Per the integration-point convention (§10.6), each pass
mutates the walker's accumulated highlight runs and diagnostics
in place; the consumer sees a single coherent snapshot.

The projection-list fixup of §10.6 is conceptually part of pass
(1) — it is the same "re-resolve identifier against final
scope" operation, applied to the small subset of identifiers
whose scope wasn't fully known at first-pass walk time.

### 12. Result-column type resolution

Phase 1's `column_types: Vec<None>` is partially lifted: where a
projection item is structurally a single column reference, the
worker resolves it back to the source column's playground type
(ADR-0005) and populates that slot in `DataResult.column_types`.
Everything else stays `None`.

This addresses Phase-1 autonomous decision §4.5 (bool SELECT
results render as `0`/`1`): a bare `bool` column now renders as
`true` / `false` again, alignment recovers, and the `show data`
rendering path is reached for the common case.

**Resolution rule.** A projection item is "structurally a single
column reference" when, after stripping an optional `[ AS ]
alias`, its expression is one of:

- An unqualified identifier (`Name`) that resolves uniquely to a
  single column across the FROM tables;
- A qualified reference (`t.c` / `alias.c`) that resolves
  unambiguously through the FROM aliases.

Anything else — function calls, arithmetic, `CASE`, literals,
subquery expressions, the `*` and `t.*` wildcards — keeps
`column_types[i] = None`. When resolution is ambiguous
(unqualified column name appears in two FROM tables) the
grammar admits it (engine resolves or errors); the type-resolver
returns `None` and the renderer falls back to neutral alignment.

**Implementation seam.** The strongly preferred mechanism is
**engine-side column-origin lookup**: after preparing the
statement, query the prepared statement for each result column's
underlying table and column. The engine knows authoritatively
which result columns are direct references and which are
expressions; for direct references it returns the source
table+column, for expressions it returns nothing. This avoids
re-parsing the source or adding structured projection-item data
to the `MatchedPath` — the grammar tier is not involved at all,
which preserves ADR-0031 §2's "no AST" decision and stays on the
right side of ADR-0030's "one source of truth" rule.

The Phase-2 implementer verifies that the rusqlite version
pinned in `Cargo.toml` exposes this metadata (the SQLite C API
calls are `sqlite3_column_table_name` /
`sqlite3_column_origin_name` — they have been stable for two
decades; rusqlite either exposes them directly or via the
underlying `*mut sqlite3_stmt` handle). If exposure turns out
to be awkward, the fallback is a small post-parse walk over the
projection-item subtrees in the `MatchedPath` — strictly worse
because it duplicates a slice of parsing, but available.

The resolution pass adds one method on `Database` (something
like `resolve_select_column_types`) called from `do_run_select`
before the `DataResult` is shipped. It takes the prepared
statement and the active `SchemaCache`, and returns
`Vec<Option<Type>>`. The renderer needs no change — `None`
slots already render as typeless.

This is the only execution-path change Phase 2 makes; everything
else routes through Phase 1's grammar-as-text execution.

### 13. Out of scope

- **OOS-1. Derived tables in `FROM`** — `FROM (SELECT …) [AS]
  alias`. The same shapes are reachable via CTEs (§4), which
  Phase 2 ships. Derived tables in `FROM` are not authored here.
- **OOS-2. `NATURAL JOIN` and `JOIN … USING (col)`.** Both are
  convenience forms. NATURAL is widely considered a footgun;
  USING is cleaner but adds grammar weight without lifting any
  expressive ceiling. Out.
- **OOS-3. Comma-list `FROM t1, t2` (implicit cross join).** Out.
  `CROSS JOIN` covers the same shape explicitly.
- **OOS-4. `LIMIT m, n` (the legacy comma form).** Out (§8).
- **OOS-5. Window functions** (`OVER (…)`, `PARTITION BY`,
  window-frame syntax). A meaningful learning topic, but a large
  surface of its own and out of ADR-0030's commissioned set.
- **OOS-6. `LATERAL` joins.** Not commissioned by ADR-0030.
- **OOS-7. `VALUES (…)` as a row source.** Not commissioned.
- **OOS-8. A function/aggregate allowlist** — ADR-0030 §13
  OOS-3 / ADR-0031 §7 OOS-4 still apply: aggregate names parse
  generically through `name_or_call`.
- **OOS-9. Quoted identifiers** (`"column name"`). Tracked as
  ADR-0031 §7 OOS-3, still tracked.
- **OOS-10. Engine-checked aggregate correctness at parse
  time.** The grammar admits structurally; engine rejects
  semantically; ADR-0019 surfaces the engine's verdict in
  engine-neutral wording (§7).
- **OOS-11. Result-column type resolution beyond bare column
  refs.** Computed columns (`a + b`, `upper(name)`, `CASE …`)
  stay typeless (§10).
- **OOS-12. The `help sql` page and parse-error usage entries**
  for the Phase-2 surface. The grammar carries the `help_id`s
  authored in this phase, but the page content and the rich
  per-command usage messages are Phase 6 (ADR-0030 §10) and
  ADR-0021. Phase 2 leaves the same `help_id: None` shape Phase
  1 used for `select`.

## Consequences

- A new grammar file, `src/dsl/grammar/sql_select.rs`, parallel
  to `sql_expr.rs`, exporting `pub static SQL_SELECT_STATEMENT:
  Node` and `pub static SQL_SELECT_COMPOUND: Node`. The Phase-1
  `data::SELECT` `CommandNode` is rebuilt against
  `SQL_SELECT_STATEMENT` (its body becomes a `Subgrammar`
  reference); the `CommandNode` itself stays.
- **Phase-1 SQL `SELECT` grammar nodes migrate.** The Phase-1
  static nodes that live in `src/dsl/grammar/data.rs` for the
  single-table SELECT (the projection, FROM, WHERE, ORDER-BY,
  LIMIT sub-trees) move into `sql_select.rs` as the
  starting-point for the §1 productions; the file leaves only
  the `CommandNode` shell behind. The seven Phase-1 SQL `SELECT`
  integration tests are part of the safety net for this
  migration — they must continue to pass under the rebuilt
  grammar, in addition to the new Phase-2 integration tests
  authored in step 4 of the implementation notes.
- **Hint-panel prose** for the new clauses (JOIN flavours, ON,
  GROUP BY, HAVING, UNION / INTERSECT / EXCEPT, WITH, OFFSET, the
  qualified-prefix and CTE-prefix completion states) is
  authored at the structural level alongside each grammar node
  in step 1 — a one-liner per slot, enough to drive the hint
  panel. Richer per-clause teaching prose and the `help sql`
  reference page remain ADR-0030 Phase 6 work (§12 OOS-12).
- **Walker cost is expected to stay proportional to source
  length.** The new accumulators are `O(bindings + aliases)`
  per frame; the scope stack is bounded by `MAX_SUBGRAMMAR_DEPTH
  = 64` (§9); the §10.6 post-walk fixup pass touches one entry
  per projection-list `Ident` (a small set). Each debounced
  keystroke (ADR-0027) walks once, fixes up once, and emits a
  single coherent output. No new pathological case is
  introduced — if a learner-realistic query produces a
  noticeable typing-time stall, measure first and revisit the
  recursion budget or the accumulator structure on evidence.
- `sql_expr.rs` gains three additive `Choice` branches and one
  additive tail on `name_or_call` (§5, §6). The existing tiers
  and the depth-cap discipline are unchanged. The Phase-1 tests
  continue to exercise the existing branches as they stand.
- **No new walker capability** (§9). `Subgrammar`, the depth
  counter, the cap, and the friendly depth error are all reused
  unchanged — the same posture ADR-0031 took.
- `Command::Select { sql: String }` is unchanged. The validated
  source SQL is simply larger; the worker still routes it through
  `Database::run_select` and `do_run_select` (Phase 1 path).
- The worker gains a post-prepare type-resolution helper that
  populates `column_types` for direct-reference projection items
  (§12) via the engine's column-origin metadata. **`Cargo.toml`
  gains `column_metadata` to `rusqlite`'s feature list**
  (alongside the existing `bundled`); this pulls in the SQLite
  `SQLITE_ENABLE_COLUMN_METADATA` compile flag and exposes
  `RawStatement::column_table_name` /
  `column_origin_name` / `column_database_name` on the prepared
  statement. Verified against the project's pinned rusqlite
  0.39.0. This is the only Phase-2 execution-path change.
- **Three diagnostic passes** (§11.7) — schema-existence
  (extended), CTE/compound arity-check (new), and predicate
  warnings (extended with a MatchedPath-walking variant for
  `sql_expr` — §11.6). All run as final walk stages and
  mutate the walker's accumulated output in place. Closes the
  Phase-1 carry-over gap where SQL `WHERE` expressions emitted
  no `LIKE`-on-numeric / type-mismatch / `= NULL` warnings.
- **Catalog additions** (§11.5) — five new `diagnostic.*` keys
  for parse-time-detectable cases and eight new `engine.*`
  keys for friendly-error layer translations of engine
  messages.
- The walker's `WalkContext` gains the completion-scope
  accumulators of §10 — a `from_scope_stack: Vec<ScopeFrame>`
  whose top frame is the active `from_scope` / `cte_bindings` /
  `projection_aliases`. A **new node variant `Node::Scoped
  Subgrammar(&Node)`** (§10.2) is the trigger for push/pop;
  existing `Node::Subgrammar` is unchanged so DSL `Expr` and
  `sql_expr` recursion are unaffected. A post-walk fixup pass
  re-resolves projection-list identifier highlighting and
  validity once the final `from_scope` is known (§10.6). CTE
  output columns are derived from the body's projection list at
  body-frame exit, populating the binding back into the outer
  frame (§10.3) — so `SELECT *` and explicit-projection CTE
  bodies both yield real column completion past `cte_alias.|`.
  This **softens §9's "no new walker capability" claim** for
  completion scope; grammar recursion still needs nothing new.
- `__rdbms_*` rejection extends to **every** table-source slot
  introduced by Phase 2: the `FROM` table, each `JOIN`'s table,
  each CTE name, and the `FROM` table inside any CTE body
  (§4, §6). The `reject_internal_table` validator is reused.
- Completion gains: SQL keywords for joins / set ops / `WITH` /
  `GROUP` / `HAVING` / `OFFSET` (all walker-derived, no
  bespoke code); column completion scoped to a qualified prefix
  `t.` resolves through the active `SchemaCache` (§5).
- Phase-1 autonomous decisions §4.1 and §4.3–§4.4 stand (optional
  `FROM`, `help_id: None`, walker-mode defaults). §4.2 is lifted
  (bare-alias projection admitted, §1). §4.5 is partially lifted
  (bare bool column refs recover their type via §12).
- `requirements.md`'s `Q1` / `Q2` advance further; `Q4` was
  already ticked by ADR-0030 and ADR-0031.

## Implementation notes

A build order, each step guarded by the test suite. The phases
within Phase 2 mirror the ADR-0030 / ADR-0031 staging — grammar
first, execution-path change last.

**Detailed plan: `docs/plans/20260520-adr-0032-phase-2.md`.**
The notes below are the outline; the plan refines them into
seven sub-phases (2a–2g) with per-gate exit criteria, a
cross-cut verification matrix that explicitly tests every
"X comes for free" claim from ADR-0030/0031/0032 (the kind of
implicit claim that produced the Phase-1 SQL-expression
predicate-warning gap §11.6 closes), and a final phase-exit
verification report template. Implementers work through the
plan; the ADR remains the decisions.

1. **The `sql_select.rs` grammar fragment.** Author the
   stratified tiers of §1 as named `static` `Node`s, recursion
   via `Subgrammar`. Export `SQL_SELECT_STATEMENT` and
   `SQL_SELECT_COMPOUND`. The existing `data::SELECT`
   `CommandNode` is rebuilt against `SQL_SELECT_STATEMENT`.
2. **Unit tests** against the fragment directly (the
   `expr.rs` / `sql_expr.rs` test pattern): JOIN flavours,
   GROUP BY / HAVING, qualified refs, every set-op, recursive
   and non-recursive CTEs, `LIMIT … OFFSET`, `DISTINCT`,
   `t.*` projection, the bare-alias projection, plus the
   keyword-case-insensitivity check.
3. **`sql_expr.rs` additive extensions** (§5, §6): the
   qualified-ref tail on `name_or_call`; the scalar-subquery
   `primary` branch; the `IN (subquery)` predicate-tail branch;
   the `EXISTS (subquery)` `primary` branch. Unit tests for each.
4. **Integration tests** (the `tests/` Tier-3 path, building on
   Phase 1's SQL `SELECT` tests): each JOIN flavour returns the
   expected rows; GROUP BY / HAVING aggregates over real data;
   `UNION` / `INTERSECT` / `EXCEPT` between two SELECTs; a
   non-recursive CTE; a recursive CTE (a small tree traversal
   or generated-sequence example); a scalar subquery in
   `WHERE`; `IN (SELECT …)`; `EXISTS (…)`; qualified refs
   resolving correctly.
5. **The `WalkContext` scope accumulators** (§10). Add the
   `ScopeFrame` type (`from_scope` / `cte_bindings` /
   `projection_aliases`) and the `from_scope_stack`; add the
   `Node::ScopedSubgrammar(&Node)` variant alongside the
   existing `Node::Subgrammar`; teach the driver to push/pop a
   fresh frame on `ScopedSubgrammar` entry/exit; rewrite every
   reference to `&SQL_SELECT_COMPOUND` from outside its own
   definition to use the new variant (subqueries in
   `sql_expr.rs`, CTE bodies in `sql_select.rs`); teach
   `from_clause` / `join_clause` to populate the frame's
   `from_scope`; teach `with_clause` to push placeholder CTE
   bindings before the body and harvest derived output columns
   on body-exit per §10.3; teach `projection_item` to append to
   `projection_aliases`. Keep `current_table` /
   `current_table_columns` as derived helpers (top frame's
   single-binding view) so the DSL paths stay green.
6. **Qualified-prefix completion** (§10.5). When the
   matched-path immediately preceding an `IdentSource::Columns`
   slot ends with `Ident '.'`, narrow candidates to the named
   binding's columns. Unit tests: `select t.` Tab offers
   `t`'s columns; an unresolved prefix returns an empty list.
7. **Post-walk fixup pass** (§10.6). Collect projection-list
   `Ident` terminals during the walk; after the walk, re-resolve
   each against the final `from_scope`, rewriting the highlight
   class and validity diagnostic. Tests: typing `select col1
   from t` lights `col1` correctly once `t` is typed; typing
   `select bogus from t` produces a column-not-found diagnostic.
8. **Diagnostic passes** (§11). Extend the schema-existence
   ERROR pass to walk every `from_scope` binding plus
   `cte_bindings`; add the qualified-reference and ambiguity
   checks (§11.2). Add the new arity-check ERROR pass at the
   CTE-body and compound-query frame-exit hooks (§11.7 case
   2). Extend the predicate-warnings pass with a
   MatchedPath-walking variant covering every Phase-2
   `sql_expr` slot (§11.6) — closes the Phase-1 carry-over
   gap. Author the five new `diagnostic.*` catalog keys and
   the eight new `engine.*` translation keys (§11.5).
   Tests: one positive and one negative case per new ERROR
   key; predicate warnings firing on `select * from t where
   col like 5` (the Phase-1 gap closure); arity-mismatch
   ERRORs on a CTE and on a `UNION`.
9. **Result-column type resolution** (§12). Add
   `"column_metadata"` to rusqlite's feature list in
   `Cargo.toml`. The worker's `do_run_select` calls the new
   resolver — `RawStatement::column_table_name` /
   `column_origin_name` per result column — before constructing
   the `DataResult`. Tests: a single-column SELECT recovers the
   playground type (covering each of the ten types, the
   pedagogically important one being `bool` → `true` / `false`);
   a SELECT with a computed projection keeps it typeless; a
   SELECT through a CTE recovers the underlying column's type
   if the engine's column-origin metadata follows through the
   CTE (verified, not assumed).
10. **Highlighting / completion / hint** spot-checks via the
    typing-surface matrix (ADR-0022 / ADR-0030 §8): a SELECT
    with a JOIN highlights the JOIN keywords; Tab past
    `select t.` offers columns of `t`; column completion
    inside a `WHERE` after `from a join b on …` offers both
    `a`'s and `b`'s columns; column completion inside a
    correlated subquery sees the outer scope; the `[ERR]`
    indicator fires on a malformed subquery; an out-of-subset
    construct (e.g. `OVER (…)`) produces an engine-neutral
    parse error.
11. **`reject_internal_table`** spot-checks against every new
    table-source slot: a `FROM __rdbms_columns` parse-rejects;
    a `WITH __rdbms_x AS (…)` parse-rejects; a `FROM` inside a
    CTE body referencing `__rdbms_*` parse-rejects.

Later phases continue ADR-0030's plan unchanged — Phase 3 (DML),
Phase 4 (DDL), Phase 5 (DSL → SQL echo), Phase 6 (polish).
ADR-0030 §13 OOS items (window functions, `LATERAL`, function
allowlist, quoted identifiers) remain tracked separately and are
authored if and when they are taken up; they are not implicit
follow-ups of Phase 2.

## Amendment 1 — Empirical scope of column-origin metadata (2026-05-20)

§12 was written conservatively: it constrained type recovery to
projection items "structurally a single column reference" and
listed "subquery expressions" alongside arithmetic and `CASE` as
cases that stay `None`. The implementation plan's Open Question 1
(`docs/plans/20260520-adr-0032-phase-2.md`) captured the matching
uncertainty about CTEs and scalar subqueries, leaving the test in
sub-phase 2f to "assert the actual behaviour (not the wished-for
behaviour)".

A throwaway probe against the pinned bundled SQLite (run
2026-05-20, with `rusqlite` 0.39.0 + `column_metadata`) settles
the question. Across twenty representative query shapes, the
engine's `sqlite3_column_table_name` / `sqlite3_column_origin_name`
metadata follows through:

- direct bare column refs (the baseline);
- `AS alias` projections (the alias remaps the output name but
  the origin pair stays the source `(table, column)`);
- table-alias qualified refs (`u.name` → `(users, name)`);
- non-recursive CTEs, including `SELECT *` bodies, bare-ref
  bodies, qualified-ref bodies, and `(col-list)`-renamed
  bodies (the rename remaps the output name; origin stays the
  underlying column);
- CTE chains (a CTE that selects from a prior CTE — origin
  traces back to the base table);
- derived tables in `FROM (SELECT …) AS sub` (out-of-scope for
  Phase 2 per §13 OOS-1, but useful to note: if ever admitted,
  type recovery comes for free);
- scalar subqueries used as a projection primary (`SELECT
  (SELECT name FROM users WHERE id = 1)` — origin is preserved
  whether the subquery has an outer alias or not);
- `UNION` / `UNION ALL` / `INTERSECT` / `EXCEPT` compound
  queries (result columns carry the first leg's origin);
- multi-table `JOIN` projections (per-column origin per leg);
- `IN (SELECT …)` subqueries in `WHERE` (the inner subquery
  does not affect the outer projection's origin).

The metadata returns `None` for exactly two structural classes:

- **Computed projections** — function calls, arithmetic
  expressions, string concatenation, `CASE` expressions,
  literals, the `*` and `t.*` wildcards. Expected; pedagogically
  obvious; no surprise for the learner.
- **Recursive CTE result columns** (`WITH RECURSIVE r(n) AS
  (SELECT 1 UNION ALL SELECT n + 1 FROM r WHERE n < 5) SELECT n
  FROM r`). The recursion materialises through an internal
  temporary table that has no base-column origin to point at.
  This is the one structural surprise — a recursive-CTE result
  column is typeless even when it is structurally a bare name
  reference, because the engine cannot trace the column back
  past the recursion.

### What §12's resolution rule becomes

The original §12 rule classifies projection items structurally
(unqualified ident / qualified ref → recover; everything else →
None). The empirical finding makes that classification redundant
and slightly wrong: it misses scalar subqueries and CTE-routed
refs that the engine does carry through, and it would have
needed extending for `(col-list)`-renamed CTEs.

The amended posture: **trust the engine's column-origin metadata
verbatim**. For each result column, call
`column_table_name(i)` / `column_origin_name(i)`. If both return
`Some`, look the pair up in the active `SchemaCache` and use the
playground type. If either is `None`, the slot stays `None` and
the renderer falls back to neutral alignment. No structural
classification of the projection item is needed; the grammar tier
stays uninvolved (preserving ADR-0031 §2's "no AST" decision and
ADR-0030's "one source of truth" rule, both as before).

The "structurally a single column reference" definition in §12's
**Resolution rule** is superseded by the engine-driven rule
above. The §12 **Implementation seam** is unchanged in approach
(engine-side column-origin lookup is still the mechanism), but
the speculative fallback paragraph ("If exposure turns out to be
awkward, the fallback is a small post-parse walk over the
projection-item subtrees in the `MatchedPath`") is moot — the
exposure works, and the engine's metadata is broader than a
grammar-side walk could be without re-implementing SQLite's
query-planner traceback. The fallback path is removed.

### Effect on the Phase-2 plan's sub-phase 2f

The 2f exit gate's "CTE pass-through" row should be asserted
positive (recovers `Some(text)`). The "Subquery result" row,
which the plan left as "assert whichever behaviour the engine
exhibits", should be asserted positive as well. A new explicit
2f test row covers the named limitation: a recursive CTE result
column must produce `column_types[0] = None` and the renderer
must fall back to neutral alignment without panicking.

The catalog and grammar-side work in 2a–2e is unaffected by this
amendment. Only 2f's test list and the worker's
`resolve_select_column_types` helper change shape (the helper
becomes simpler — no structural classification, just a direct
metadata lookup per result column).

This amendment narrows the honest limitation in §12 from
"computed / non-direct projection items" to "computed projections
and recursive CTE result columns" — a tighter, factually
verified carve-out.

## Amendment 2 — §10.6 fixup-pass mechanism (2026-05-20)

§10.6's prescription for the post-walk fixup is written in
terms of "rewriting the highlight class" on projection-list
`Ident` terminals — downgrading "column" → "unknown identifier"
when an ident doesn't belong to the eventual `from_scope`, or
upgrading the reverse direction once a `FROM` is typed. The
implementation chose a different mechanism that achieves the
identical user-visible effect; this Amendment records the
choice so a reader of §10.6 doesn't go looking for a literal
`per_byte_class` rewrite step that does not exist.

### Mechanism actually used

Two pieces, both already in the codebase by the end of
sub-phase 2d:

1. **Two-pass schema-existence diagnostic.** The 2d rewrite of
   `schema_existence_diagnostics` (`src/dsl/walker/mod.rs`)
   runs a pre-pass over the matched path that collects every
   `IdentSource::Tables` / `cte_name` / `table_alias` ident
   into a single binding vec, regardless of where in the path
   it sits. The main pass then resolves each `sql_expr_ident`
   against the **complete** binding set. A projection ident
   that resolves under the eventual FROM scope produces no
   diagnostic; one that doesn't produces an
   `unknown_column` diagnostic on its own span.

2. **Diagnostic-overlay renderer.** `src/input_render.rs`
   reads the walker's diagnostic list at every keystroke and
   overlays each diagnostic's span with the appropriate
   colour (Error red for unknown-column, Warning for
   type-mismatch / `LIKE`-on-numeric / etc.). The overlay
   sits on top of the walker's `per_byte_class` (which keeps
   all idents at `HighlightClass::Identifier`).

Combined, the two yield the §10.6 user-visible behaviour:
typing `select bogus_col`, the diagnostic emits and the
overlay paints the ident red as soon as a FROM appears that
shows the column doesn't exist; typing `select real_col`, no
diagnostic emits and the ident stays Identifier-coloured.
Within one debounce cycle.

### Why this is equivalent

§10.6's stated goal is correctness of the end-of-walk
classification — "rewriting the highlight class" is one
implementation strategy for that goal. The HighlightClass
enum in the codebase has only one identifier slot
(`Identifier`); the Error tint comes from diagnostic overlay,
not from a separate `Column` vs `UnknownIdentifier` class.
The two-pass diagnostic pass is the "post-walk fixup" that
§10.6 calls for — it just runs inside the diagnostic emitter
rather than as a separate rewrite step. The integration
point (§10.6's "final stage of the walk itself") still
holds: `schema_existence_diagnostics` runs after the walk's
main work, mutating the walker's accumulated diagnostic
vector in place. Consumers see a single coherent snapshot.

### Completion mid-typing

§10.6's second user-visible promise — "during-typing
completion of projection-list column names uses the global
fallback" — is preserved as a posture, but improved at the
edges in sub-phase 2e by a look-ahead probe in
`src/completion.rs`. When the leading walk produces no
`from_scope` (the projection-before-FROM state) **and** the
full input does have a FROM after the cursor, a second walk
on the full input populates the binding set, and column
candidates narrow to that scope. The fallback to global
`SchemaCache.columns` remains the path when the full input
doesn't parse cleanly (e.g., the user deleted `*` and is
mid-edit). This is a strict improvement: the realistic
"edit an existing query" workflow now narrows correctly.

### What §10.6's prescription becomes

The "rewrite the highlight class" wording is superseded by:
**the post-walk diagnostic pass re-resolves projection
idents against the complete scope and emits / withholds the
unknown-column diagnostic accordingly; the renderer's
diagnostic-overlay path achieves the visual change**. No
new `HighlightClass` variant is required.

§10.6's other prescriptions stand verbatim — the integration
point (final walk stage, in-place mutation of walker
accumulators), the per-keystroke re-walk (ADR-0027's
debounced cadence), and the ORDER BY no-fixup-needed
clarification.

## See also

- ADR-0005 — the ten-type vocabulary §10 resolves back to.
- ADR-0016 — the data-table renderer SELECT results reuse.
- ADR-0019 — the friendly-error layer engine-side rejections
  route through (§7).
- ADR-0021 — per-command parse-error usage; the Phase-2 surface
  inherits the framework, Phase 6 polishes per-clause messages
  (§11 OOS-12).
- ADR-0022 — ambient typing assistance. §5/§6/§8 inherit its
  keyword-completion / highlighting / hint mechanisms for free,
  but §10 extends its `IdentSource::Columns` / `SchemaCache` /
  `WalkContext` infrastructure with the scope accumulators,
  qualified-prefix narrowing, and the post-walk fixup pass that
  Phase 2 needs.
- ADR-0023 / ADR-0024 — the unified grammar tree Phase 2 extends.
- ADR-0026 — the `WHERE` grammar's `Subgrammar` node, depth
  counter, and `MAX_SUBGRAMMAR_DEPTH = 64` cap, all reused
  unchanged (§9).
- ADR-0027 — the validity indicator, free for the Phase-2
  surface; §1 (ERROR/WARNING guideline) is the source quoted
  verbatim in §11; Amendment 1 (`LIKE`-on-numeric WARNING) is
  the one that the SQL-expression predicate-warnings gap of
  §11.6 closes for the SQL surface.
- ADR-0028 — the styled `OutputLine` mechanism the renderer
  uses; not directly touched by Phase 2.
- ADR-0030 — the parent ADR; §3 commissions this phase, §4/§6
  fix execution-as-text, §7 fixes engine neutrality, §11 fixes
  history / replay, §13 fixes the long-running OOS list.
- ADR-0031 — the SQL expression grammar this ADR extends
  additively (§5, §6); §7 named the two extensions implemented
  here.
- `docs/simple-mode-limitations.md` — the DSL limits advanced
  mode lifts; Phase 2 lifts the JOIN, subquery, set-op, CTE,
  and grouping limits.