Files
rdbms-playground/docs/adr/0032-sql-select-grammar.md
T
claude@clouddev1 ee0dafd86b docs: ADR-0032 Amendment 2 + §10.6 regression tests
Amendment 2 records the §10.6 fixup-pass mechanism choice. §10.6
prescribes "rewriting the highlight class" on projection-list
idents at end-of-walk; the actual implementation uses a different
mechanism that achieves the identical user-visible behavior:

1. 2d's two-pass schema-existence diagnostic collects every FROM
   binding from the matched path first, then resolves projection
   idents against the complete scope. The post-walk re-resolve
   §10.6 calls for, just embedded in the diagnostic emitter.

2. input_render.rs's diagnostic-overlay path colors each
   diagnostic span Error/Warning, achieving the visual change
   §10.6 describes without needing a new HighlightClass variant.

The completion-mid-typing piece is improved by the §10.5
look-ahead probe (sub-phase 2e earlier).

Four new regression tests in `projection_before_from_tests` pin
the behavior so a future refactor can't silently regress it:
correct ident resolves silently, unknown ident flags via
diagnostic on its span, multi-projection only flags unknowns,
projection-without-FROM is silent.

ADR index entry updated to reference Amendment 2.

Test totals: 1424 → 1428 passing (+4). Clippy clean.
2026-05-20 21:19:57 +00:00

1518 lines
72 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0032: The full SQL `SELECT` grammar
## Status
Accepted
## Context
ADR-0030 commissions advanced mode as a body of **SQL grammar
inside the unified grammar tree** (ADR-0023/0024), phased. Phase 1
("Foundations + first `SELECT`") shipped: a single-table `SELECT`
with projection, `WHERE`, `ORDER BY`, and `LIMIT`, executed as
validated SQL text through the existing data-table renderer.
ADR-0031 authored the SQL **expression grammar** the Phase-1
`SELECT` consumed.
Phase 2 — "`SELECT` — full" — is the next slice. ADR-0030 §3 lists
it: `JOIN`s, `GROUP BY` / `HAVING`, aggregates, subquery
expressions, `UNION`/`INTERSECT`/`EXCEPT`, common table
expressions, `LIMIT … OFFSET`, qualified column references.
ADR-0030 §3 also says the full `SELECT` grammar "is each large
enough to warrant their own focused ADR when implemented — the
precedent is ADR-0026 for the `WHERE` grammar." This is that ADR.
The architecture is fixed (ADR-0030 §1, §4, §6, §8): one walker,
grammar-as-text execution, ambient assistance for free. This ADR
fixes the **shape** of the grammar — the productions, the
recursion, the additive extensions to ADR-0031's expression
fragment, and the few execution-path implications (worker-side
column-origin lookup so result columns recover their playground
type). It deliberately does **not** revisit ADR-0030's structural
decisions; references in this ADR's text to ADR-0030 §X mean
"that decision is the controlling one."
### What ADR-0030 and ADR-0031 already fix
- **No batch parser; SQL is grammar in the unified tree.**
Subquery recursion is a `Node::Subgrammar(&NAMED)` reference,
exactly as the expression ladder uses it (ADR-0031 §3).
- **No AST builder for the parts that execute as text.**
`Command::Select { sql: String }` carries the validated source;
the worker prepares and runs it (ADR-0030 §4/§6, ADR-0031 §2).
- **The `__rdbms_*` rejection** at every table-name slot
(ADR-0030 §6) — re-applied to every Phase-2 table-source slot
(`FROM`, `JOIN`, CTE-name).
- **No allowlist for function names** (ADR-0030 §13 OOS-3,
ADR-0031 §6). Aggregates (`count`, `sum`, `avg`, `min`, `max`)
parse through the generic `name_or_call` path — the grammar is
structurally aggregate-blind, by design.
- **No quoted identifiers** (ADR-0031 §7 OOS-3) — unchanged.
- **`MAX_SUBGRAMMAR_DEPTH = 64`** (ADR-0026) is the shared
recursion budget across DSL `Expr`, SQL expression, and (added
here) SQL `SELECT` recursion. No new walker capability is
introduced (§9).
### The boundary with ADR-0031
ADR-0031 §7 named two additive extensions deferred to this ADR:
- **OOS-1: subquery expressions** — `( SELECT … )` as a `primary`,
`IN ( SELECT … )`, `EXISTS ( SELECT … )`. Their grammar is fixed
in §6; they are additive `Choice` branches in `sql_expr.rs`,
recursing into the named `SELECT` fragment authored here.
- **OOS-2: qualified column references** — `t.c` / `alias.c`.
Their grammar is fixed in §5; they are an additive tail on
`name_or_call` in `sql_expr.rs`.
`sql_expr.rs` was shaped to receive both branches without
restructuring (ADR-0031 §7 promise). This ADR redeems that
promise; the changes there are strictly additive.
## Decision
### 1. The top-level `SELECT` grammar
The full statement decomposes into a top-level *compound query*
(set-operator chains around per-leg *core selects*), wrapped by
an optional `WITH` prefix and trailing `ORDER BY` / `LIMIT`:
```
select_statement := [ with_clause ] compound_select
compound_select := select_core ( set_op select_core )*
[ order_by_clause ]
[ limit_clause ]
set_op := UNION [ ALL ] | INTERSECT | EXCEPT
select_core := SELECT [ DISTINCT | ALL ]
projection_list
[ from_clause ]
[ where_clause ]
[ group_by_clause ]
[ having_clause ]
with_clause := WITH [ RECURSIVE ] cte_def
( ',' cte_def )*
cte_def := identifier [ '(' column_name_list ')' ]
AS '(' compound_select ')'
projection_list := projection_item ( ',' projection_item )*
projection_item := '*'
| identifier '.' '*'
| sql_expr [ [ AS ] identifier ]
from_clause := FROM table_source ( join_clause )*
table_source := identifier [ [ AS ] identifier ]
join_clause := [ INNER ] JOIN table_source ON sql_expr
| LEFT [ OUTER ] JOIN table_source ON sql_expr
| RIGHT [ OUTER ] JOIN table_source ON sql_expr
| FULL [ OUTER ] JOIN table_source ON sql_expr
| CROSS JOIN table_source
where_clause := WHERE sql_expr
group_by_clause := GROUP BY sql_expr ( ',' sql_expr )*
having_clause := HAVING sql_expr
order_by_clause := ORDER BY order_item ( ',' order_item )*
order_item := sql_expr [ ASC | DESC ]
limit_clause := LIMIT sql_expr [ OFFSET sql_expr ]
```
`sql_expr` is ADR-0031's `SQL_OR_EXPR`, extended additively per
§5 and §6. `column_name_list` is `identifier (, identifier)*`.
The named `static Node` exported by the new
`src/dsl/grammar/sql_select.rs` is `SQL_SELECT_STATEMENT`
(matching the full statement) and `SQL_SELECT_COMPOUND` (the
embedded form, omitting the outer `WITH`; this is what subqueries
recurse into — see §6, §9).
Notes on specific productions:
- **`FROM` stays optional.** Phase 1's autonomous decision §4.1
is upheld: `SELECT 1` and `SELECT upper('x')` continue to
parse. With JOINs landing, the absence of a `FROM` simply
means no `from_clause`/`join_clause` was matched; no extra
shape is needed.
- **Bare-alias projection (`select a x`) is admitted.** Phase 1's
autonomous decision §4.2 deliberately rejected it as
structurally ambiguous. With Phase 2's grammar — `FROM` is
the only word that can legitimately follow a projection list,
and it is a keyword in the walker's expected-set — the
ambiguity dissolves: an identifier following the last
projection expression that is not `FROM`, `,`, `WHERE`,
`GROUP`, `ORDER`, `LIMIT`, or a set-op keyword is a bare
alias, and is so admitted. This lifts a small but visible
Phase-1 limitation.
- **`SELECT [ DISTINCT | ALL ]`.** `ALL` is the default and is
admitted for symmetry; `DISTINCT` is the meaningful case. They
are mutually exclusive at this position (a `Choice`, not two
`Optional`s).
- **`identifier '.' '*'`** lives only in `projection_item`, never
in `sql_expr`. This is intentional: `t.*` is *projection
syntax*, not an expression, and admitting it as an expression
primary would let it appear in `WHERE` / `ORDER BY` / etc.
where the engine would reject it and the engine-neutral error
would be hard to phrase. The grammar simply refuses it
structurally outside projection.
- **`UNION ALL` is a single set-op,** not `UNION` followed by an
`ALL` modifier on the next leg. `set_op` is a `Choice` of the
four atoms (with `UNION` and `UNION ALL` as separate branches);
factoring `UNION [ ALL ]` is also valid but the explicit four
branches keep the matched-path classes cleaner for
highlighting.
### 2. JOIN flavours admitted
The grammar admits exactly the flavours the user picked:
- `INNER JOIN` / bare `JOIN`
- `LEFT [ OUTER ] JOIN`
- `RIGHT [ OUTER ] JOIN`
- `FULL [ OUTER ] JOIN`
- `CROSS JOIN`
The first four take a mandatory `ON sql_expr`; `CROSS JOIN`
takes none. `OUTER` is the optional explicit modifier on
`LEFT` / `RIGHT` / `FULL`.
**Explicitly out (§11):** `NATURAL JOIN`, `JOIN … USING (col)`,
and comma-list `FROM t1, t2` (the legacy implicit cross join).
The first two add grammar weight for limited teaching value;
comma-FROM teaches habits we do not want to encourage —
`CROSS JOIN` covers the same shape explicitly.
JOIN chains are admitted as a flat `( join_clause )*`. Standard
SQL is left-associative; since the grammar builds no AST and the
engine receives the source text verbatim (ADR-0030 §4), the
engine resolves the associativity. The grammar's job ends at "the
chain parses".
### 3. Set operators and compound queries
`UNION`, `UNION ALL`, `INTERSECT`, `EXCEPT` all admitted —
ADR-0030 §3's full set.
The compound shape (§1) is `select_core (set_op select_core)*`,
flat. Standard SQL gives `INTERSECT` higher precedence than
`UNION` / `EXCEPT`; the engine resolves this — the grammar admits
the chain as written. This mirrors §2's JOIN-chain decision.
A user who wants explicit grouping writes
`(SELECT … INTERSECT SELECT …) UNION SELECT …`, which falls out
of the subquery-`primary` branch (§6) — though for a top-level
statement this requires an extra `SELECT` wrapping. In practice
the engine's precedence is what learners encounter; calling it
out in the `help sql` page (ADR-0030 Phase 6) is sufficient.
`ORDER BY` / `LIMIT` on a compound apply to the whole compound,
not to a leg — fixed by the position of `order_by_clause` and
`limit_clause` in §1's `compound_select`.
### 4. CTEs (`WITH` and `WITH RECURSIVE`)
The full `with_clause` per §1. Both forms admitted: non-recursive
`WITH` for naming intermediate results, and `WITH RECURSIVE` for
recursive queries (tree traversals, transitive closure,
generated sequences).
The `cte_def` body is a parenthesised `compound_select`, so the
recursion is into `SQL_SELECT_COMPOUND` via `Subgrammar` — the
same recursion mechanism subqueries use (§9).
**CTE-name collisions.** A CTE name shares the table-name
namespace at the engine. Standard SQL: the CTE shadows a
same-named base table within the statement. The grammar is
agnostic — both are identifiers in a table-source slot — so the
shadowing falls out of engine resolution. The
`reject_internal_table` validator still rejects any `__rdbms_*`
identifier in any table-source slot, **including** CTE-name
slots and the `FROM`s inside CTE bodies. That is the right
posture: the reserved namespace is reserved everywhere.
Recursive CTEs use the standard `cte_name AS ( base_case UNION
[ALL] recursive_case )` shape — already admitted by §1's
`compound_select` body. No grammar branch specific to recursion
is needed; the `RECURSIVE` keyword is a hint to the engine, not
a grammar gate.
### 5. Qualified column references
Additive extension to `sql_expr.rs` (ADR-0031 §7 OOS-2).
`name_or_call`'s identifier prefix gains a `Choice` tail:
```
name_or_call := identifier
( '.' identifier
| '(' call_args? ')'
)?
```
The leading identifier is matched once (preserving ADR-0031 §1's
factoring — no `Choice` branch begins with an identifier). The
optional tail is *either* a qualified-reference suffix
(`. identifier`) *or* a function-call argument list (`( … )`),
not both. A bare identifier with no tail remains a plain column
reference.
A function call with a qualified name — `schema.f(…)` — is not in
scope (we have no schemas) and is structurally inadmissible by
construction: there is no production that admits both a `.`-tail
and a `(`-tail.
Completion for the qualified form: when the cursor is past
`identifier '.'`, the completion source is "columns of the table
or alias named by the leading identifier", resolved from the
active `SchemaCache` (the same source the DSL completion uses,
ADR-0030 §8). This is a small extension to the existing
`IdentSource::Columns` machinery — when in scope, column
completion is scoped to the named source.
### 6. Subquery expressions
Additive extensions to `sql_expr.rs` (ADR-0031 §7 OOS-1):
- **Scalar subquery as `primary`.** A `Choice` branch
`'(' compound_select ')'`. The existing `'(' or_expr ')'`
branch handles parenthesised expressions. Both start with
`'('`, so per ADR-0031 §1's factoring principle, the `'('` is
matched once and the inside is a `Choice` between
`compound_select` and `or_expr`. The first inside token
disambiguates: `SELECT` or `WITH` → subquery; anything else →
expression. The two `Choice` branches have non-overlapping
first-token sets, so the walker's expected-set at the
ambiguity point merges naturally without `Optional`-first
hazards.
- **`IN ( subquery )`.** The existing `predicate_tail`'s
`IN '(' additive (',' additive)* ')'` branch gains a sibling
`IN '(' compound_select ')'`. Same `'('` factoring as the
scalar case: after `'('`, branch on `SELECT`/`WITH` (subquery)
vs additive-first-token (literal list). `NOT IN` follows from
the existing `[ NOT ]` factoring on the predicate tail.
- **`[ NOT ] EXISTS ( subquery )`.** Added as a `primary`
`Choice` branch:
```
primary := … | EXISTS '(' compound_select ')'
```
The bare `EXISTS` form lives in `primary`; `NOT EXISTS` falls
out of the existing `not_expr := NOT not_expr` tier above
`primary` in the precedence ladder. This is structurally
cleaner than putting `[ NOT ] EXISTS` inside `primary`: there
is only one place `NOT` is admitted, and it composes uniformly.
All three branches recurse through `Subgrammar(&SQL_SELECT_COMPOUND)`.
Correlated subqueries fall out for free — a subquery's
`sql_expr` reaches identifiers, which the engine resolves
against outer scopes. The grammar imposes no correlation
constraint; correlation is engine-side semantics.
### 7. `GROUP BY` and `HAVING`
`GROUP BY` takes a comma-separated list of `sql_expr`s.
Standard SQL admits any expression as a grouping key (not just
bare columns) — e.g. `GROUP BY date(created_at)`. The grammar
admits this without special-casing.
`HAVING` is a single `sql_expr`. Its semantics is "boolean over
grouped rows"; the grammar does not enforce that — the
expression's typing is the engine's concern.
**Aggregate correctness is not grammar-checked.** Whether a
projection's non-aggregated columns are valid given the
`GROUP BY` keys is a semantic question. ADR-0030 §9 settled this:
the grammar admits structurally, the engine rejects semantically,
and the friendly-error layer renders engine-neutral wording
(ADR-0019). A learner who writes `SELECT Name, COUNT(*) FROM t`
sees an engine-neutral "Name must appear in a GROUP BY clause or
be wrapped in an aggregate function"-style message, not a raw
engine string and not a parse error. This is the project's
honest limitation (ADR-0030 §7) and remains so.
### 8. `LIMIT` / `OFFSET` and `ORDER BY` extras
`LIMIT n [ OFFSET m ]` — the standard form. Both `n` and `m` are
`sql_expr`s (in practice integer literals, but the grammar
admits the general form so e.g. `LIMIT max(10, x) OFFSET 0` is
structurally accepted; the engine constrains values).
The MySQL/SQLite legacy comma form `LIMIT m, n` is **out** (§11).
Its argument order (offset first, then count) inverts the
keyword form — a needless source of confusion.
`ORDER BY` already admits `sql_expr` items with optional
`ASC` / `DESC` (Phase 1). With Phase 2:
- **Column-position references** (`ORDER BY 1, 3 DESC`) fall out
for free — an integer literal is a valid `sql_expr`, and the
engine interprets a bare positive integer in `ORDER BY` as a
column position. The grammar does not distinguish the case;
rendering interprets the position. Document in `help sql`.
- **Qualified refs** in `ORDER BY` (e.g. `ORDER BY t.c`) fall
out of §5 — the grammar uses the same `sql_expr` body.
### 9. Recursion, the depth budget, and the walker
`SELECT` recurses into itself at four points:
- A subquery `primary` in `sql_expr` (§6).
- An `IN ( subquery )` predicate tail (§6).
- An `EXISTS ( subquery )` primary (§6).
- A CTE body (§4).
Every recursion is wired through
`Node::Subgrammar(&SQL_SELECT_COMPOUND)` — the named `static` Node
exported by `sql_select.rs`. The recursion is token-guarded in
every case: a subquery `primary` is preceded by `'('`; an
`IN ( subquery )` by `IN (`; an `EXISTS ( subquery )` by
`EXISTS (`; a CTE body by `AS (`. There is no left recursion;
the walker always makes progress.
`MAX_SUBGRAMMAR_DEPTH = 64` (ADR-0026, reused by ADR-0031) is
**shared**: DSL `Expr` recursion, SQL expression recursion, and
SQL `SELECT` recursion all increment the same
`WalkContext::subgrammar_depth`. A worst-case learner query
might be `SELECT … WHERE id IN (SELECT … WHERE id IN (SELECT …))`
with each inner select carrying a few-deep expression — well
below the cap. The cap remains purely a stack-overflow guard;
**this ADR does not raise it**. If pathological-but-realistic
learner queries reach 64 in practice, a focused ADR lifts it
with measurements. Speculative raising would weaken the guard
without evidence.
**No new walker capability is introduced.** `Subgrammar`, the
depth counter, the cap, and the friendly depth-exceeded error
all carry over from ADR-0026 unchanged — the same posture
ADR-0031 took. This is a non-trivial property: Phase 2 is the
biggest single grammar slice in the project, and it lands
without changing the walker's contract.
### 10. Completion scope and the `WalkContext` extension
ADR-0030 §8 promises that "ambient assistance comes for free"
because SQL is grammar in the unified tree. For Phase 1's
single-table `SELECT` this was substantially true: the existing
`WalkContext::current_table` mechanism (populated via the
`writes_table: true` flag on the `FROM` table-name slot) gave
`WHERE` and `ORDER BY` column-name completion against the right
table at no incremental cost.
Phase 2 breaks the "free" claim. Multiple `FROM` tables via
`JOIN`s, aliases, CTE-defined table sources, subqueries with their
own `FROM` scope, qualified `t.c` references, projection aliases
referenced in `ORDER BY` — every Phase-2 surface needs **scope
information that `WalkContext` does not currently carry**. §9's
"no new walker capability" claim holds for grammar recursion
(`Subgrammar` and the depth cap suffice); for completion scope it
is too strong, and is softened here to an honest split.
The current `WalkContext` carries one table at a time
(`current_table: Option<String>` + `current_table_columns`), set
by `writes_table: true` on a `Tables` identifier. DSL paths
(`update T`, `delete from T`, `insert into T`) rely on this
single-table contract and continue to work unchanged. Phase 2
adds layered accumulators alongside, not in place of.
#### 10.1. The from-scope accumulator
A new `WalkContext` field:
```
from_scope: Vec<TableBinding>
TableBinding { table: String, alias: Option<String>,
columns: Vec<TableColumn> }
```
Populated incrementally as the walker descends through
`from_clause` and each `join_clause` (§1). The first table-source
slot pushes a binding; every subsequent `JOIN` pushes another.
`Ident` slots whose `IdentSource` is `Columns` now resolve against
the union of every binding's columns, with deduplication.
`current_table` / `current_table_columns` remain as derived
helpers: when `from_scope.len() == 1`, they expose that single
binding's data, preserving the contract every existing DSL path
relies on. DSL `UPDATE` / `DELETE` / `INSERT` continue to push
exactly one binding via the existing `writes_table: true`
mechanism, unchanged.
#### 10.2. Scope-stack discipline at `Subgrammar` boundaries
Subqueries (§6) and CTE bodies (§4) introduce new lexical scopes.
A column reference inside `SELECT … WHERE id IN (SELECT id FROM
u)` resolves first against the inner `SELECT`'s `FROM` (`u`), and
— for correlation — also against the outer scope.
`subgrammar_depth` is a counter; it suffices for §9's depth cap
but not for scope.
Phase 2 layers a stack on top. A new field:
```
from_scope_stack: Vec<ScopeFrame>
ScopeFrame {
from_scope: Vec<TableBinding>,
cte_bindings: Vec<CteBinding>,
projection_aliases: Vec<String>,
}
```
The new walker node variant — `Node::ScopedSubgrammar(&Node)` —
is what triggers a scope push. It is a sibling of the existing
`Node::Subgrammar(&Node)`, with the same recursion semantics
(reference-following, depth-counted) and one additional driver
behaviour: on entry, push the current `ScopeFrame` onto
`from_scope_stack` and start a fresh empty frame; on exit, pop
back. The existing `Node::Subgrammar` variant is unchanged — DSL
`Expr` recursion (ADR-0026) and the `sql_expr.rs` precedence-
ladder recursion (ADR-0031) keep using it and never push a scope.
The grammar source spells the choice explicitly at each call
site: subqueries in `sql_expr.rs` and CTE bodies in
`sql_select.rs` reference the compound-SELECT through
`Node::ScopedSubgrammar(&SQL_SELECT_COMPOUND)`; predicate-ladder
recursion in `sql_expr.rs` continues to use
`Node::Subgrammar(&SQL_OR_EXPR)`. Self-documenting, no flag
bookkeeping, and the walker change is localised to one extra arm
in the driver's `match` over `Node` variants.
Column-completion candidates inside a scope frame are the union
of the current frame's `from_scope` and (for correlated refs)
all outer frames; outer-frame columns are admitted as additional
candidates so correlated references work. Ordering or visual
differentiation between current-frame and outer-frame candidates
is completion-tier polish and is not specified by this ADR — the
current completion API (`candidates_at_cursor*`) returns a flat
`Vec`, and adding a priority dimension is a separate concern.
CTE bindings resolve the same way (outward-walking) — a CTE
defined in an outer query is visible inside an inner subquery as
a table source, unless the inner subquery defines a CTE of the
same name and shadows it.
This is the one explicit walker-capability extension Phase 2
makes. It is scoped: one new node variant, no new walker entry
point, no change to how Subgrammar bodies are entered
structurally. The depth cap (§9) applies to both variants
uniformly through the shared `subgrammar_depth` counter.
#### 10.3. CTE bindings
A frame-local accumulator carries CTE definitions visible in the
current scope:
```
cte_bindings: Vec<CteBinding>
CteBinding {
name: String,
columns: Vec<CteColumn>,
}
CteColumn {
name: Option<String>, // None for unnamed
// computed projections
type_: Option<Type>, // resolved playground type
// if derivable
}
```
A CTE definition `cte_name [(col-list)] AS (compound_select)`
produces a binding in two stages:
1. **Pre-body push** (so `WITH RECURSIVE` self-references resolve).
When the walker reaches `AS` and is about to enter the body's
`Node::ScopedSubgrammar(&SQL_SELECT_COMPOUND)`, it pushes a
placeholder binding into the *outer* frame's `cte_bindings`
with `columns = []` (an empty stand-in). The CTE name is now
visible as a table source from inside the body.
2. **Body-finalised harvest** (when the body's scope frame
completes). On `ScopedSubgrammar` exit, before popping the
frame, the driver derives the body's projection-list output
columns (rules below) and rewrites the placeholder binding in
the outer frame.
**Output-column derivation rules.** Walking the body's
projection items:
| Projection item | Derived CTE column(s) |
|---------------------------------------|----------------------------------------------------------------------------------------|
| `*` | Every column from the body frame's `from_scope`, in order, with their resolved types |
| `t.*` (qualified wildcard) | Every column from binding `t` in the body frame's `from_scope`, with their types |
| `col` (bare ref, resolves uniquely) | One column: name = `col`, type = the resolved column's playground type |
| `t.col` (qualified ref) | One column: name = `col`, type = `t`'s column's type |
| `expr AS alias` or bare `expr alias` | One column: name = `alias`, type = the underlying type if `expr` is a single column ref, else `None` |
| `expr` (computed, no alias) | One column: name = `None`, type = `None` — engine assigns an implementation-defined name |
For compound bodies (`UNION` / `INTERSECT` / `EXCEPT`) the columns
come from the **first leg** per standard SQL. For recursive CTE
bodies (`WITH RECURSIVE`) the same rule — the non-recursive leg
dictates.
If a `(col-list)` was supplied on the CTE name, it **renames** the
derived columns positionally and overrides their names; types are
preserved from the derivation. If the column-count of `(col-list)`
disagrees with the body's projection arity, the grammar admits
this and the engine surfaces the mismatch — `do_run_select`'s
engine-neutral error layer carries the message (ADR-0030 §9,
ADR-0019).
**Completion past `cte_alias.|`.** Where the derivation produced
named columns (every form above except computed-no-alias), they
complete with their names and (where typed) participate in §11's
result-type resolution if the CTE's columns are projected
upstream. Where the derivation produced an unnamed column slot,
that slot is silently skipped from the qualified-prefix candidate
list — the user typing `cte.|` past it sees only the nameable
columns. The cure for "I want my expression to be referenceable
from outside the CTE" is to add an alias, which is the same cure
the engine itself enforces at execution time.
This is substantially better than the earlier "honest limitation"
posture: the common `SELECT *` body is fully resolvable; explicit
projections are resolvable; only un-aliased computed columns
elude us, and the right learner response there is the same as
the engine's right learner response — write an alias.
`cte_bindings` lives on the scope frame, so a CTE defined in an
outer query is visible inside an inner subquery as a table source
unless that subquery defines a CTE of the same name (which
shadows it, per standard SQL).
#### 10.4. Projection-alias bindings
Standard SQL admits `ORDER BY` referencing a SELECT-list alias:
`SELECT a + b AS total FROM t ORDER BY total`. A third frame-local
accumulator:
```
projection_aliases: Vec<String>
```
Each `projection_item`'s optional alias (whether `AS x` or bare
`x` — see §1) appends its name. `Ident` slots inside the trailing
`ORDER BY`'s `sql_expr`s offer projection aliases as additional
candidates alongside column names. This addresses §1's bare-alias
admission's completion behaviour at the same time.
The accumulator is not consulted inside `WHERE`, `GROUP BY`, or
`HAVING` — standard SQL forbids alias references there
(aliases are not yet bound at evaluation time). The grammar
admits them structurally regardless; the engine rejects; ADR-0019
renders the engine-neutral error.
#### 10.5. Qualified-prefix completion
§5 fixed the grammar for `t.c` references. The completion
behaviour at qualified positions:
- At an `Ident` cursor with **no prefix**, candidates are the
union of every `from_scope` binding's columns, plus
`projection_aliases` when in `ORDER BY`, deduplicated. CTE-name
candidates apply only in table-source slots, not column slots.
- At an `Ident` cursor immediately after `prefix '.'`, candidates
are **scoped**: resolve `prefix` against the active `from_scope`
(preferring alias matches over table matches, since aliases
shadow), and offer that binding's columns alone. If `prefix`
doesn't resolve to a binding, the candidate list is empty — the
walker's expected-set still surfaces the syntactic alternatives
(the user sees no column candidates but the structural error
message reports the unresolved prefix).
The qualified-prefix narrowing is a small extension to the
existing `IdentSource::Columns` handling: when the matched-path
immediately preceding the `Ident` ends with `Ident '.'`, the
completer is told the prefix and narrows accordingly. This is the
only completion-source-level change; the rest is data flowing
through the new accumulators.
#### 10.6. The projection-before-FROM problem
Standard SQL writes projection **before** `FROM`. A user typing
`select col1, col2 from mytable` produces, mid-typing, a state
where the projection list has been parsed but the `FROM` has not.
At that point the column-name completer cannot scope to
`mytable` — it does not know `mytable` is coming. Validation and
highlighting face the same problem: `col1` and `col2` cannot be
checked as belonging to `mytable` until the user types `from
mytable`. The debounced re-walk on every keystroke (ADR-0027) is
**not** sufficient on its own to fix this in a single-pass walker,
because by the time the FROM is parsed, the projection
identifiers have already been resolved (left-to-right) against
the only scope information available at that moment — the empty
`from_scope`.
There is no fully satisfying single-pass answer. Phase 2's
posture is therefore explicit:
1. **During-typing completion** of projection-list column names,
when `from_scope` is empty (no `FROM` yet), uses the unioned
`SchemaCache.columns` — every column known to the schema —
as the candidate set. This is the same global fallback Phase 1
uses and remains the right behaviour: a noisier-but-useful
completion is better than no completion.
2. **A post-walk fixup pass** re-evaluates projection-list column
refs against the *final* `from_scope` after the walk
completes. The walk records each projection `Ident`'s
span and matched-path location; once the walk reaches end-of-
input (or end-of-statement), the fixup walks the recorded
list, looks up each identifier against the final `from_scope`,
and:
- **Rewrites the highlight class** on that terminal —
downgrading "column" → "unknown identifier" when the
identifier doesn't belong to any in-scope binding,
upgrading "unknown identifier" → "column" when it does.
- **Updates the diagnostic** for the validity indicator
(ADR-0027) — a column-not-found ERROR either appears or
disappears based on the post-walk scope.
**Integration point.** The fixup runs as the **final stage of
the walk itself**, after all grammar nodes have been processed
but before `WalkResult` is returned to the caller. It mutates
the walker's accumulated highlight runs and diagnostics vector
in place, so the consumer (the renderer, the validity
indicator) sees a single coherent snapshot. This keeps the
walker the single source of truth for what reaches the
renderer — the fixup is conceptually part of "what the walker
produces", not a separate post-processing layer. The same
convention applies to the §11.6 SQL-expression predicate
warnings, which also run as a final walk stage.
3. The fixup runs on every debounced re-walk (ADR-0027 already
triggers the full walk per keystroke), so the user observes:
typing `col1, col2 from mytable`, the `col1` / `col2`
initially highlight as generic identifiers (with a soft
warning if not found anywhere in the schema); the moment
`mytable` is typed, the highlight snaps to the column class
if `col1` / `col2` belong to `mytable`, or to the
unknown-identifier diagnostic if they don't — within one
debounce cycle.
The fixup pass does not re-parse; it only re-resolves
identifiers against the final `from_scope`.
`ORDER BY` alias resolution needs no fixup. Projection precedes
`ORDER BY` in walk order, so `projection_aliases` is fully
populated by the time the walker reaches an `ORDER BY` `Ident`;
the alias-as-column-candidate is resolved in the single forward
pass.
This is the answer to the user's "I think this may be automatic"
intuition: the debounced re-walk is automatic; the
post-walk fixup pass is the new infrastructure that makes the
re-walk produce *correct* results. Without it, projection-list
column refs would forever validate against the global column set
even after the `FROM` is typed.
#### 10.7. The honest split
§9 still holds for **grammar recursion**: `Subgrammar` and the
depth cap are reused unchanged. For **completion scope**, this
section introduces:
- New `WalkContext` fields: `from_scope`, `from_scope_stack`,
`cte_bindings`, `projection_aliases`.
- Scope push/pop discipline at `SQL_SELECT_COMPOUND` `Subgrammar`
boundaries — driven by a marker on the Subgrammar target so DSL
Subgrammars are unaffected.
- A qualified-prefix narrowing in the `IdentSource::Columns`
completion path.
- A post-walk fixup pass for projection-list identifier
highlighting and validity (§10.6).
These are real walker-contract extensions. They are scoped: no
new node kinds, no new walk-driver entry points, no changes to
how Subgrammar bodies are entered structurally. The existing DSL
paths are unaffected — their grammars never push a SELECT scope,
never define a CTE, never carry projection aliases — and the
single-table `current_table` / `current_table_columns` view is
preserved as a derived helper.
§9's claim is therefore restated honestly: **grammar recursion
needs no new walker capability; completion scope needs the
additions above.**
### 11. Diagnostics for Phase-2 validation cases
ADR-0027 fixes the warning-vs-error guideline verbatim:
> **ERROR** — the input is *known* to fail. Either it does not
> parse (incomplete, or a mismatched / invalid token), or it
> parses but names something that does not exist (an unknown
> table or column).
>
> **WARNING** — the input is valid and *will* run, but is very
> likely not what a knowledgeable user wants: a type-mismatched
> comparison, or `= NULL` (both from ADR-0026 §7). Amendment 1
> adds a third trigger — `LIKE` against a numeric column.
>
> The split is *certainty of failure* versus *likely misleading*.
This section walks the Phase-2 surface case-by-case, classifies
each against that guideline, and identifies the diagnostic
machinery additions needed. It also flags a Phase-1 carry-over
gap (§11.6) that Phase 2 closes.
#### 11.1. Existing diagnostics, briefly
Two post-walk passes today (`src/dsl/walker/mod.rs`):
- **Schema-existence pass** (ERROR). Walks the `MatchedPath`,
checks every `IdentSource::Tables` / `IdentSource::Columns`
ident against `SchemaCache`. Emits `diagnostic.unknown_table`
and `diagnostic.unknown_column`. Today this assumes a single
`current_table` for column resolution.
- **Expression predicate-warnings pass** (WARNING). Walks the
parsed DSL `Expr` AST emitted by `expr.rs`'s builder.
Emits `diagnostic.eq_null`, `diagnostic.type_mismatch`,
`diagnostic.like_numeric`. Runs only on WHERE expressions in
the DSL.
Phase 2 extends both, and §11.6 fills a SQL-side gap.
#### 11.2. Phase-2 new ERROR cases
Every case below is "known to fail on the engine" — the engine
would surface a message the friendly-error layer would translate
(ADR-0019). Surfacing them as pre-flight ERROR diagnostics gives
the learner the answer one debounce cycle faster, with the
walker as the single source of truth.
- **Unknown table in any `FROM`/`JOIN` slot.** The existing
schema-existence pass extends from "the one
`current_table`" to walking every `from_scope` binding's
`table` and emitting `diagnostic.unknown_table` per
unresolved name. CTE-name slots in the active
`cte_bindings` are valid table sources and exempt from
this check.
- **Unknown CTE-as-table.** A table-source slot whose name is
not in `SchemaCache.tables` *and* not in the active
`cte_bindings` chain emits `diagnostic.unknown_table` (same
catalog key — from the learner's perspective the engine
message is the same; the slot is a "table that doesn't
exist", whether they meant a CTE or a base table).
- **Unknown table or alias in a qualified column reference**
(`t.c` where `t` doesn't resolve in the active
`from_scope`). New catalog key
`diagnostic.unknown_qualifier` `{qualifier}`.
- **Unknown column in a qualified reference** (`t.c` where `t`
resolves but `c` is not a column of that binding). Reuses
`diagnostic.unknown_column` with the column name in context.
- **Ambiguous unqualified column reference** — a column name
used unqualified that exists in two or more `from_scope`
bindings. The engine raises "ambiguous column name"; we
surface it as ERROR with a new catalog key
`diagnostic.ambiguous_column` `{column}, {qualifiers}` so
the learner sees which two tables the name appeared in.
- **Reference to a projection alias in `WHERE` / `GROUP BY` /
`HAVING`.** Standard SQL forbids it (aliases are not bound
at evaluation time). The grammar admits the identifier
structurally; a new diagnostic pass emits ERROR with a new
catalog key `diagnostic.projection_alias_misplaced`
`{alias}, {clause}`.
- **CTE column-list arity mismatch.** When `cte_name (col1,
col2, …) AS (compound_select)` declares N columns and the
body's projection (§10.3) derives M columns with N ≠ M, the
CTE harvest pass (§10.3 stage 2) emits ERROR with a new
catalog key `diagnostic.cte_arity_mismatch` `{cte},
{declared}, {actual}`.
- **Compound-query column-count mismatch.** When a `UNION` /
`INTERSECT` / `EXCEPT` chain has legs whose projection
arities differ, the engine errors at execution. Phase 2
catches it pre-flight: each leg's derived arity (the same
derivation the CTE harvest uses) is compared as the
compound is assembled. ERROR with a new catalog key
`diagnostic.compound_arity_mismatch` `{op}, {left_n},
{right_n}`.
- **Internal-table reference in any new table-source slot.**
Already a parse-time rejection via
`reject_internal_table` (§1, §4) — surfaces as a parse
error, not a post-walk diagnostic. Listed here for
completeness: the catalog key `select.internal_table`
authored in Phase 1 covers every Phase-2 slot too.
#### 11.3. Phase-2 new WARNING cases
The existing WARNING set (`= NULL`, type-mismatched
comparison, `LIKE`-on-numeric) is the right set. Phase-2 adds
**no new WARNING categories** — every Phase-2-specific case
falls into ERROR (§11.2) or engine-rejected (§11.4).
Considered and rejected as WARNINGs:
- **CTE name shadowing a base table.** Standard SQL behaviour;
often intentional (the canonical "filter to a subset, then
query as if it were the base table" pattern). No diagnostic.
- **Correlated reference without explicit qualification.**
Correlation is implicit in standard SQL; per the user
guideline a knowledgeable user does want this. The walker
validates the reference silently against the outer-frame
scope; no warning, no diagnostic.
- **Unused CTE.** A CTE defined in `WITH` but never referenced.
The engine ignores it; many learners write CTEs as
intermediate scratch space. Not a warning.
#### 11.4. Engine-rejected (no diagnostic)
These fail on the engine and surface via ADR-0019's
friendly-error layer at execution time. The walker does not
attempt pre-flight detection because:
- **Non-aggregated columns in projection with `GROUP BY`** —
detecting requires knowing which function names are
aggregates; ADR-0030 §13 OOS-3 / ADR-0031 §6 keep us
allowlist-free.
- **Aggregate function in `WHERE`** — same reason.
- **Scalar subquery returning multiple rows** — semantic, not
syntactic; requires execution.
- **Recursive CTE without a `UNION`** — requires inspection of
the body's compound shape against the recursive contract;
doable in principle, deferred as engine territory.
- **Duplicate CTE names within the same `WITH`** — checkable
in principle (walking `cte_bindings` for duplicates), but
the engine catches it cleanly. Phase 2 does not pre-flight
it; could be added later if its absence proves confusing.
- **Type-mismatched JOIN ON predicates** — the existing
expression type-mismatch warning (extended per §11.6)
handles the explicit-literal case; arbitrary-expression
cases require type inference and stay engine-side.
#### 11.5. Catalog additions
Phase 2 adds the following message-catalog keys (ADR-0019).
Every key is engine-neutral by construction.
Parse-time-detectable (post-walk diagnostic passes):
| Key | Slots |
|----------------------------------------|--------------------------------------------------|
| `diagnostic.unknown_qualifier` | `{qualifier}` |
| `diagnostic.ambiguous_column` | `{column}, {qualifiers}` |
| `diagnostic.projection_alias_misplaced`| `{alias}, {clause}` |
| `diagnostic.cte_arity_mismatch` | `{cte}, {declared}, {actual}` |
| `diagnostic.compound_arity_mismatch` | `{op}, {left_n}, {right_n}` |
Engine-error translations (friendly-error layer; reached on
execution failure):
| Key | Engine cause |
|----------------------------------------|--------------------------------------------------|
| `engine.no_such_table` | `no such table: <name>` (post-execution path) |
| `engine.no_such_column` | `no such column: <name>` (post-execution path) |
| `engine.ambiguous_column` | `ambiguous column name: <name>` |
| `engine.aggregate_misuse` | `misuse of aggregate function <name>()` |
| `engine.group_by_required` | `column must appear in the GROUP BY clause or be used in an aggregate function` (or equivalent) |
| `engine.compound_arity_mismatch` | `SELECTs to the left and right of UNION do not have the same number of result columns` (or equivalent) |
| `engine.scalar_subquery_too_many_rows` | scalar subquery cardinality violation |
| `engine.recursive_cte_malformed` | recursive CTE shape errors |
The parse-time keys and the engine keys are intentionally
separate even when they describe the same situation
(`engine.ambiguous_column` mirrors
`diagnostic.ambiguous_column`) — the parse-time message can
include the learner's typed text and span; the engine-time
message catches what the parser missed and routes through the
friendly-error layer with whatever context the engine yielded.
Two pre-existing parse-time keys are reused unchanged for
Phase-2 slots: `diagnostic.unknown_table`,
`diagnostic.unknown_column`, and the Phase-1
`select.internal_table`.
#### 11.6. The Phase-1 SQL-expression predicate-warning gap
ADR-0027 Amendment 1's `LIKE`-on-numeric warning, and ADR-0026
§7's `= NULL` and type-mismatch warnings, are emitted by a pass
that walks the **DSL** `Expr` AST. Phase 1's `sql_expr.rs`
deliberately builds **no AST** (ADR-0031 §2). The consequence
is a Phase-1 carry-over gap: **SQL `WHERE` expressions today
emit none of these warnings** — `select * from t where name
like 5` parses, the engine runs it, and the learner gets the
engine's verdict, not the friendly pre-flight nudge ADR-0027
Amendment 1 promised.
Phase 2 closes this. The predicate-warnings pass gains a
**MatchedPath-walking variant** that runs over the SQL
expression nodes and identifies the predicate shapes
structurally (a `LIKE` predicate-tail with a column-ref left
operand; a `=`/`!=` predicate-tail with a `NULL` literal
operand; a comparison predicate-tail with a column-literal
operand pair of mismatched types). It does not need an `Expr`
AST because the matched-path terminals carry both the byte spans
(for the diagnostic) and the node-name labels (for shape
identification). The same catalog keys (`diagnostic.eq_null`,
`diagnostic.type_mismatch`, `diagnostic.like_numeric`) apply
unchanged; only the pass implementation is new.
The MatchedPath-walking pass runs over **every** Phase-2
`sql_expr` slot — `WHERE`, `HAVING`, `ON`, `CASE` branches,
projection items, `ORDER BY` items — so warnings surface
uniformly across the SQL surface rather than just `WHERE`. This
is a strict improvement over Phase 1's behaviour, where even
Phase-1 SELECT WHERE expressions got no predicate warnings.
Type-resolution for the MatchedPath-walking pass: a column ref's
type comes from §10's `from_scope` (or, for `t.c`, the specific
binding); a literal's type comes from its lexical class. When
the column ref doesn't resolve (the schema-existence ERROR pass
will already have flagged it), the warning pass skips the
predicate — no point compounding diagnostics on an already-
broken reference.
#### 11.7. Mechanism summary
Three diagnostic passes by end of Phase 2, all running as final
stages of the walk (per §10.6's integration-point convention):
1. **Schema-existence ERROR pass** — extended from single
`current_table` to walking every `from_scope` binding and
the active `cte_bindings`. Adds the qualified-reference
and ambiguity checks (§11.2).
2. **Arity-check ERROR pass** (new) — runs at CTE-body and
compound-query frame-exits (the same `ScopedSubgrammar`
exit hook §10.3 uses), comparing declared vs derived
column counts.
3. **Predicate-warnings pass** — extended with a
MatchedPath-walking variant for `sql_expr` (§11.6) covering
`= NULL`, type mismatch, and `LIKE`-on-numeric across every
SQL expression slot, in addition to the existing DSL `Expr`
AST variant for DSL expressions.
Per the integration-point convention (§10.6), each pass
mutates the walker's accumulated highlight runs and diagnostics
in place; the consumer sees a single coherent snapshot.
The projection-list fixup of §10.6 is conceptually part of pass
(1) — it is the same "re-resolve identifier against final
scope" operation, applied to the small subset of identifiers
whose scope wasn't fully known at first-pass walk time.
### 12. Result-column type resolution
Phase 1's `column_types: Vec<None>` is partially lifted: where a
projection item is structurally a single column reference, the
worker resolves it back to the source column's playground type
(ADR-0005) and populates that slot in `DataResult.column_types`.
Everything else stays `None`.
This addresses Phase-1 autonomous decision §4.5 (bool SELECT
results render as `0`/`1`): a bare `bool` column now renders as
`true` / `false` again, alignment recovers, and the `show data`
rendering path is reached for the common case.
**Resolution rule.** A projection item is "structurally a single
column reference" when, after stripping an optional `[ AS ]
alias`, its expression is one of:
- An unqualified identifier (`Name`) that resolves uniquely to a
single column across the FROM tables;
- A qualified reference (`t.c` / `alias.c`) that resolves
unambiguously through the FROM aliases.
Anything else — function calls, arithmetic, `CASE`, literals,
subquery expressions, the `*` and `t.*` wildcards — keeps
`column_types[i] = None`. When resolution is ambiguous
(unqualified column name appears in two FROM tables) the
grammar admits it (engine resolves or errors); the type-resolver
returns `None` and the renderer falls back to neutral alignment.
**Implementation seam.** The strongly preferred mechanism is
**engine-side column-origin lookup**: after preparing the
statement, query the prepared statement for each result column's
underlying table and column. The engine knows authoritatively
which result columns are direct references and which are
expressions; for direct references it returns the source
table+column, for expressions it returns nothing. This avoids
re-parsing the source or adding structured projection-item data
to the `MatchedPath` — the grammar tier is not involved at all,
which preserves ADR-0031 §2's "no AST" decision and stays on the
right side of ADR-0030's "one source of truth" rule.
The Phase-2 implementer verifies that the rusqlite version
pinned in `Cargo.toml` exposes this metadata (the SQLite C API
calls are `sqlite3_column_table_name` /
`sqlite3_column_origin_name` — they have been stable for two
decades; rusqlite either exposes them directly or via the
underlying `*mut sqlite3_stmt` handle). If exposure turns out
to be awkward, the fallback is a small post-parse walk over the
projection-item subtrees in the `MatchedPath` — strictly worse
because it duplicates a slice of parsing, but available.
The resolution pass adds one method on `Database` (something
like `resolve_select_column_types`) called from `do_run_select`
before the `DataResult` is shipped. It takes the prepared
statement and the active `SchemaCache`, and returns
`Vec<Option<Type>>`. The renderer needs no change — `None`
slots already render as typeless.
This is the only execution-path change Phase 2 makes; everything
else routes through Phase 1's grammar-as-text execution.
### 13. Out of scope
- **OOS-1. Derived tables in `FROM`** — `FROM (SELECT …) [AS]
alias`. The same shapes are reachable via CTEs (§4), which
Phase 2 ships. Derived tables in `FROM` are not authored here.
- **OOS-2. `NATURAL JOIN` and `JOIN … USING (col)`.** Both are
convenience forms. NATURAL is widely considered a footgun;
USING is cleaner but adds grammar weight without lifting any
expressive ceiling. Out.
- **OOS-3. Comma-list `FROM t1, t2` (implicit cross join).** Out.
`CROSS JOIN` covers the same shape explicitly.
- **OOS-4. `LIMIT m, n` (the legacy comma form).** Out (§8).
- **OOS-5. Window functions** (`OVER (…)`, `PARTITION BY`,
window-frame syntax). A meaningful learning topic, but a large
surface of its own and out of ADR-0030's commissioned set.
- **OOS-6. `LATERAL` joins.** Not commissioned by ADR-0030.
- **OOS-7. `VALUES (…)` as a row source.** Not commissioned.
- **OOS-8. A function/aggregate allowlist** — ADR-0030 §13
OOS-3 / ADR-0031 §7 OOS-4 still apply: aggregate names parse
generically through `name_or_call`.
- **OOS-9. Quoted identifiers** (`"column name"`). Tracked as
ADR-0031 §7 OOS-3, still tracked.
- **OOS-10. Engine-checked aggregate correctness at parse
time.** The grammar admits structurally; engine rejects
semantically; ADR-0019 surfaces the engine's verdict in
engine-neutral wording (§7).
- **OOS-11. Result-column type resolution beyond bare column
refs.** Computed columns (`a + b`, `upper(name)`, `CASE …`)
stay typeless (§10).
- **OOS-12. The `help sql` page and parse-error usage entries**
for the Phase-2 surface. The grammar carries the `help_id`s
authored in this phase, but the page content and the rich
per-command usage messages are Phase 6 (ADR-0030 §10) and
ADR-0021. Phase 2 leaves the same `help_id: None` shape Phase
1 used for `select`.
## Consequences
- A new grammar file, `src/dsl/grammar/sql_select.rs`, parallel
to `sql_expr.rs`, exporting `pub static SQL_SELECT_STATEMENT:
Node` and `pub static SQL_SELECT_COMPOUND: Node`. The Phase-1
`data::SELECT` `CommandNode` is rebuilt against
`SQL_SELECT_STATEMENT` (its body becomes a `Subgrammar`
reference); the `CommandNode` itself stays.
- **Phase-1 SQL `SELECT` grammar nodes migrate.** The Phase-1
static nodes that live in `src/dsl/grammar/data.rs` for the
single-table SELECT (the projection, FROM, WHERE, ORDER-BY,
LIMIT sub-trees) move into `sql_select.rs` as the
starting-point for the §1 productions; the file leaves only
the `CommandNode` shell behind. The seven Phase-1 SQL `SELECT`
integration tests are part of the safety net for this
migration — they must continue to pass under the rebuilt
grammar, in addition to the new Phase-2 integration tests
authored in step 4 of the implementation notes.
- **Hint-panel prose** for the new clauses (JOIN flavours, ON,
GROUP BY, HAVING, UNION / INTERSECT / EXCEPT, WITH, OFFSET, the
qualified-prefix and CTE-prefix completion states) is
authored at the structural level alongside each grammar node
in step 1 — a one-liner per slot, enough to drive the hint
panel. Richer per-clause teaching prose and the `help sql`
reference page remain ADR-0030 Phase 6 work (§12 OOS-12).
- **Walker cost is expected to stay proportional to source
length.** The new accumulators are `O(bindings + aliases)`
per frame; the scope stack is bounded by `MAX_SUBGRAMMAR_DEPTH
= 64` (§9); the §10.6 post-walk fixup pass touches one entry
per projection-list `Ident` (a small set). Each debounced
keystroke (ADR-0027) walks once, fixes up once, and emits a
single coherent output. No new pathological case is
introduced — if a learner-realistic query produces a
noticeable typing-time stall, measure first and revisit the
recursion budget or the accumulator structure on evidence.
- `sql_expr.rs` gains three additive `Choice` branches and one
additive tail on `name_or_call` (§5, §6). The existing tiers
and the depth-cap discipline are unchanged. The Phase-1 tests
continue to exercise the existing branches as they stand.
- **No new walker capability** (§9). `Subgrammar`, the depth
counter, the cap, and the friendly depth error are all reused
unchanged — the same posture ADR-0031 took.
- `Command::Select { sql: String }` is unchanged. The validated
source SQL is simply larger; the worker still routes it through
`Database::run_select` and `do_run_select` (Phase 1 path).
- The worker gains a post-prepare type-resolution helper that
populates `column_types` for direct-reference projection items
(§12) via the engine's column-origin metadata. **`Cargo.toml`
gains `column_metadata` to `rusqlite`'s feature list**
(alongside the existing `bundled`); this pulls in the SQLite
`SQLITE_ENABLE_COLUMN_METADATA` compile flag and exposes
`RawStatement::column_table_name` /
`column_origin_name` / `column_database_name` on the prepared
statement. Verified against the project's pinned rusqlite
0.39.0. This is the only Phase-2 execution-path change.
- **Three diagnostic passes** (§11.7) — schema-existence
(extended), CTE/compound arity-check (new), and predicate
warnings (extended with a MatchedPath-walking variant for
`sql_expr` — §11.6). All run as final walk stages and
mutate the walker's accumulated output in place. Closes the
Phase-1 carry-over gap where SQL `WHERE` expressions emitted
no `LIKE`-on-numeric / type-mismatch / `= NULL` warnings.
- **Catalog additions** (§11.5) — five new `diagnostic.*` keys
for parse-time-detectable cases and eight new `engine.*`
keys for friendly-error layer translations of engine
messages.
- The walker's `WalkContext` gains the completion-scope
accumulators of §10 — a `from_scope_stack: Vec<ScopeFrame>`
whose top frame is the active `from_scope` / `cte_bindings` /
`projection_aliases`. A **new node variant `Node::Scoped­
Subgrammar(&Node)`** (§10.2) is the trigger for push/pop;
existing `Node::Subgrammar` is unchanged so DSL `Expr` and
`sql_expr` recursion are unaffected. A post-walk fixup pass
re-resolves projection-list identifier highlighting and
validity once the final `from_scope` is known (§10.6). CTE
output columns are derived from the body's projection list at
body-frame exit, populating the binding back into the outer
frame (§10.3) — so `SELECT *` and explicit-projection CTE
bodies both yield real column completion past `cte_alias.|`.
This **softens §9's "no new walker capability" claim** for
completion scope; grammar recursion still needs nothing new.
- `__rdbms_*` rejection extends to **every** table-source slot
introduced by Phase 2: the `FROM` table, each `JOIN`'s table,
each CTE name, and the `FROM` table inside any CTE body
(§4, §6). The `reject_internal_table` validator is reused.
- Completion gains: SQL keywords for joins / set ops / `WITH` /
`GROUP` / `HAVING` / `OFFSET` (all walker-derived, no
bespoke code); column completion scoped to a qualified prefix
`t.` resolves through the active `SchemaCache` (§5).
- Phase-1 autonomous decisions §4.1 and §4.3–§4.4 stand (optional
`FROM`, `help_id: None`, walker-mode defaults). §4.2 is lifted
(bare-alias projection admitted, §1). §4.5 is partially lifted
(bare bool column refs recover their type via §12).
- `requirements.md`'s `Q1` / `Q2` advance further; `Q4` was
already ticked by ADR-0030 and ADR-0031.
## Implementation notes
A build order, each step guarded by the test suite. The phases
within Phase 2 mirror the ADR-0030 / ADR-0031 staging — grammar
first, execution-path change last.
**Detailed plan: `docs/plans/20260520-adr-0032-phase-2.md`.**
The notes below are the outline; the plan refines them into
seven sub-phases (2a2g) with per-gate exit criteria, a
cross-cut verification matrix that explicitly tests every
"X comes for free" claim from ADR-0030/0031/0032 (the kind of
implicit claim that produced the Phase-1 SQL-expression
predicate-warning gap §11.6 closes), and a final phase-exit
verification report template. Implementers work through the
plan; the ADR remains the decisions.
1. **The `sql_select.rs` grammar fragment.** Author the
stratified tiers of §1 as named `static` `Node`s, recursion
via `Subgrammar`. Export `SQL_SELECT_STATEMENT` and
`SQL_SELECT_COMPOUND`. The existing `data::SELECT`
`CommandNode` is rebuilt against `SQL_SELECT_STATEMENT`.
2. **Unit tests** against the fragment directly (the
`expr.rs` / `sql_expr.rs` test pattern): JOIN flavours,
GROUP BY / HAVING, qualified refs, every set-op, recursive
and non-recursive CTEs, `LIMIT … OFFSET`, `DISTINCT`,
`t.*` projection, the bare-alias projection, plus the
keyword-case-insensitivity check.
3. **`sql_expr.rs` additive extensions** (§5, §6): the
qualified-ref tail on `name_or_call`; the scalar-subquery
`primary` branch; the `IN (subquery)` predicate-tail branch;
the `EXISTS (subquery)` `primary` branch. Unit tests for each.
4. **Integration tests** (the `tests/` Tier-3 path, building on
Phase 1's SQL `SELECT` tests): each JOIN flavour returns the
expected rows; GROUP BY / HAVING aggregates over real data;
`UNION` / `INTERSECT` / `EXCEPT` between two SELECTs; a
non-recursive CTE; a recursive CTE (a small tree traversal
or generated-sequence example); a scalar subquery in
`WHERE`; `IN (SELECT …)`; `EXISTS (…)`; qualified refs
resolving correctly.
5. **The `WalkContext` scope accumulators** (§10). Add the
`ScopeFrame` type (`from_scope` / `cte_bindings` /
`projection_aliases`) and the `from_scope_stack`; add the
`Node::ScopedSubgrammar(&Node)` variant alongside the
existing `Node::Subgrammar`; teach the driver to push/pop a
fresh frame on `ScopedSubgrammar` entry/exit; rewrite every
reference to `&SQL_SELECT_COMPOUND` from outside its own
definition to use the new variant (subqueries in
`sql_expr.rs`, CTE bodies in `sql_select.rs`); teach
`from_clause` / `join_clause` to populate the frame's
`from_scope`; teach `with_clause` to push placeholder CTE
bindings before the body and harvest derived output columns
on body-exit per §10.3; teach `projection_item` to append to
`projection_aliases`. Keep `current_table` /
`current_table_columns` as derived helpers (top frame's
single-binding view) so the DSL paths stay green.
6. **Qualified-prefix completion** (§10.5). When the
matched-path immediately preceding an `IdentSource::Columns`
slot ends with `Ident '.'`, narrow candidates to the named
binding's columns. Unit tests: `select t.` Tab offers
`t`'s columns; an unresolved prefix returns an empty list.
7. **Post-walk fixup pass** (§10.6). Collect projection-list
`Ident` terminals during the walk; after the walk, re-resolve
each against the final `from_scope`, rewriting the highlight
class and validity diagnostic. Tests: typing `select col1
from t` lights `col1` correctly once `t` is typed; typing
`select bogus from t` produces a column-not-found diagnostic.
8. **Diagnostic passes** (§11). Extend the schema-existence
ERROR pass to walk every `from_scope` binding plus
`cte_bindings`; add the qualified-reference and ambiguity
checks (§11.2). Add the new arity-check ERROR pass at the
CTE-body and compound-query frame-exit hooks (§11.7 case
2). Extend the predicate-warnings pass with a
MatchedPath-walking variant covering every Phase-2
`sql_expr` slot (§11.6) — closes the Phase-1 carry-over
gap. Author the five new `diagnostic.*` catalog keys and
the eight new `engine.*` translation keys (§11.5).
Tests: one positive and one negative case per new ERROR
key; predicate warnings firing on `select * from t where
col like 5` (the Phase-1 gap closure); arity-mismatch
ERRORs on a CTE and on a `UNION`.
9. **Result-column type resolution** (§12). Add
`"column_metadata"` to rusqlite's feature list in
`Cargo.toml`. The worker's `do_run_select` calls the new
resolver — `RawStatement::column_table_name` /
`column_origin_name` per result column — before constructing
the `DataResult`. Tests: a single-column SELECT recovers the
playground type (covering each of the ten types, the
pedagogically important one being `bool` → `true` / `false`);
a SELECT with a computed projection keeps it typeless; a
SELECT through a CTE recovers the underlying column's type
if the engine's column-origin metadata follows through the
CTE (verified, not assumed).
10. **Highlighting / completion / hint** spot-checks via the
typing-surface matrix (ADR-0022 / ADR-0030 §8): a SELECT
with a JOIN highlights the JOIN keywords; Tab past
`select t.` offers columns of `t`; column completion
inside a `WHERE` after `from a join b on …` offers both
`a`'s and `b`'s columns; column completion inside a
correlated subquery sees the outer scope; the `[ERR]`
indicator fires on a malformed subquery; an out-of-subset
construct (e.g. `OVER (…)`) produces an engine-neutral
parse error.
11. **`reject_internal_table`** spot-checks against every new
table-source slot: a `FROM __rdbms_columns` parse-rejects;
a `WITH __rdbms_x AS (…)` parse-rejects; a `FROM` inside a
CTE body referencing `__rdbms_*` parse-rejects.
Later phases continue ADR-0030's plan unchanged — Phase 3 (DML),
Phase 4 (DDL), Phase 5 (DSL → SQL echo), Phase 6 (polish).
ADR-0030 §13 OOS items (window functions, `LATERAL`, function
allowlist, quoted identifiers) remain tracked separately and are
authored if and when they are taken up; they are not implicit
follow-ups of Phase 2.
## Amendment 1 — Empirical scope of column-origin metadata (2026-05-20)
§12 was written conservatively: it constrained type recovery to
projection items "structurally a single column reference" and
listed "subquery expressions" alongside arithmetic and `CASE` as
cases that stay `None`. The implementation plan's Open Question 1
(`docs/plans/20260520-adr-0032-phase-2.md`) captured the matching
uncertainty about CTEs and scalar subqueries, leaving the test in
sub-phase 2f to "assert the actual behaviour (not the wished-for
behaviour)".
A throwaway probe against the pinned bundled SQLite (run
2026-05-20, with `rusqlite` 0.39.0 + `column_metadata`) settles
the question. Across twenty representative query shapes, the
engine's `sqlite3_column_table_name` / `sqlite3_column_origin_name`
metadata follows through:
- direct bare column refs (the baseline);
- `AS alias` projections (the alias remaps the output name but
the origin pair stays the source `(table, column)`);
- table-alias qualified refs (`u.name` → `(users, name)`);
- non-recursive CTEs, including `SELECT *` bodies, bare-ref
bodies, qualified-ref bodies, and `(col-list)`-renamed
bodies (the rename remaps the output name; origin stays the
underlying column);
- CTE chains (a CTE that selects from a prior CTE — origin
traces back to the base table);
- derived tables in `FROM (SELECT …) AS sub` (out-of-scope for
Phase 2 per §13 OOS-1, but useful to note: if ever admitted,
type recovery comes for free);
- scalar subqueries used as a projection primary (`SELECT
(SELECT name FROM users WHERE id = 1)` — origin is preserved
whether the subquery has an outer alias or not);
- `UNION` / `UNION ALL` / `INTERSECT` / `EXCEPT` compound
queries (result columns carry the first leg's origin);
- multi-table `JOIN` projections (per-column origin per leg);
- `IN (SELECT …)` subqueries in `WHERE` (the inner subquery
does not affect the outer projection's origin).
The metadata returns `None` for exactly two structural classes:
- **Computed projections** — function calls, arithmetic
expressions, string concatenation, `CASE` expressions,
literals, the `*` and `t.*` wildcards. Expected; pedagogically
obvious; no surprise for the learner.
- **Recursive CTE result columns** (`WITH RECURSIVE r(n) AS
(SELECT 1 UNION ALL SELECT n + 1 FROM r WHERE n < 5) SELECT n
FROM r`). The recursion materialises through an internal
temporary table that has no base-column origin to point at.
This is the one structural surprise — a recursive-CTE result
column is typeless even when it is structurally a bare name
reference, because the engine cannot trace the column back
past the recursion.
### What §12's resolution rule becomes
The original §12 rule classifies projection items structurally
(unqualified ident / qualified ref → recover; everything else →
None). The empirical finding makes that classification redundant
and slightly wrong: it misses scalar subqueries and CTE-routed
refs that the engine does carry through, and it would have
needed extending for `(col-list)`-renamed CTEs.
The amended posture: **trust the engine's column-origin metadata
verbatim**. For each result column, call
`column_table_name(i)` / `column_origin_name(i)`. If both return
`Some`, look the pair up in the active `SchemaCache` and use the
playground type. If either is `None`, the slot stays `None` and
the renderer falls back to neutral alignment. No structural
classification of the projection item is needed; the grammar tier
stays uninvolved (preserving ADR-0031 §2's "no AST" decision and
ADR-0030's "one source of truth" rule, both as before).
The "structurally a single column reference" definition in §12's
**Resolution rule** is superseded by the engine-driven rule
above. The §12 **Implementation seam** is unchanged in approach
(engine-side column-origin lookup is still the mechanism), but
the speculative fallback paragraph ("If exposure turns out to be
awkward, the fallback is a small post-parse walk over the
projection-item subtrees in the `MatchedPath`") is moot — the
exposure works, and the engine's metadata is broader than a
grammar-side walk could be without re-implementing SQLite's
query-planner traceback. The fallback path is removed.
### Effect on the Phase-2 plan's sub-phase 2f
The 2f exit gate's "CTE pass-through" row should be asserted
positive (recovers `Some(text)`). The "Subquery result" row,
which the plan left as "assert whichever behaviour the engine
exhibits", should be asserted positive as well. A new explicit
2f test row covers the named limitation: a recursive CTE result
column must produce `column_types[0] = None` and the renderer
must fall back to neutral alignment without panicking.
The catalog and grammar-side work in 2a2e is unaffected by this
amendment. Only 2f's test list and the worker's
`resolve_select_column_types` helper change shape (the helper
becomes simpler — no structural classification, just a direct
metadata lookup per result column).
This amendment narrows the honest limitation in §12 from
"computed / non-direct projection items" to "computed projections
and recursive CTE result columns" — a tighter, factually
verified carve-out.
## Amendment 2 — §10.6 fixup-pass mechanism (2026-05-20)
§10.6's prescription for the post-walk fixup is written in
terms of "rewriting the highlight class" on projection-list
`Ident` terminals — downgrading "column" → "unknown identifier"
when an ident doesn't belong to the eventual `from_scope`, or
upgrading the reverse direction once a `FROM` is typed. The
implementation chose a different mechanism that achieves the
identical user-visible effect; this Amendment records the
choice so a reader of §10.6 doesn't go looking for a literal
`per_byte_class` rewrite step that does not exist.
### Mechanism actually used
Two pieces, both already in the codebase by the end of
sub-phase 2d:
1. **Two-pass schema-existence diagnostic.** The 2d rewrite of
`schema_existence_diagnostics` (`src/dsl/walker/mod.rs`)
runs a pre-pass over the matched path that collects every
`IdentSource::Tables` / `cte_name` / `table_alias` ident
into a single binding vec, regardless of where in the path
it sits. The main pass then resolves each `sql_expr_ident`
against the **complete** binding set. A projection ident
that resolves under the eventual FROM scope produces no
diagnostic; one that doesn't produces an
`unknown_column` diagnostic on its own span.
2. **Diagnostic-overlay renderer.** `src/input_render.rs`
reads the walker's diagnostic list at every keystroke and
overlays each diagnostic's span with the appropriate
colour (Error red for unknown-column, Warning for
type-mismatch / `LIKE`-on-numeric / etc.). The overlay
sits on top of the walker's `per_byte_class` (which keeps
all idents at `HighlightClass::Identifier`).
Combined, the two yield the §10.6 user-visible behaviour:
typing `select bogus_col`, the diagnostic emits and the
overlay paints the ident red as soon as a FROM appears that
shows the column doesn't exist; typing `select real_col`, no
diagnostic emits and the ident stays Identifier-coloured.
Within one debounce cycle.
### Why this is equivalent
§10.6's stated goal is correctness of the end-of-walk
classification — "rewriting the highlight class" is one
implementation strategy for that goal. The HighlightClass
enum in the codebase has only one identifier slot
(`Identifier`); the Error tint comes from diagnostic overlay,
not from a separate `Column` vs `UnknownIdentifier` class.
The two-pass diagnostic pass is the "post-walk fixup" that
§10.6 calls for — it just runs inside the diagnostic emitter
rather than as a separate rewrite step. The integration
point (§10.6's "final stage of the walk itself") still
holds: `schema_existence_diagnostics` runs after the walk's
main work, mutating the walker's accumulated diagnostic
vector in place. Consumers see a single coherent snapshot.
### Completion mid-typing
§10.6's second user-visible promise — "during-typing
completion of projection-list column names uses the global
fallback" — is preserved as a posture, but improved at the
edges in sub-phase 2e by a look-ahead probe in
`src/completion.rs`. When the leading walk produces no
`from_scope` (the projection-before-FROM state) **and** the
full input does have a FROM after the cursor, a second walk
on the full input populates the binding set, and column
candidates narrow to that scope. The fallback to global
`SchemaCache.columns` remains the path when the full input
doesn't parse cleanly (e.g., the user deleted `*` and is
mid-edit). This is a strict improvement: the realistic
"edit an existing query" workflow now narrows correctly.
### What §10.6's prescription becomes
The "rewrite the highlight class" wording is superseded by:
**the post-walk diagnostic pass re-resolves projection
idents against the complete scope and emits / withholds the
unknown-column diagnostic accordingly; the renderer's
diagnostic-overlay path achieves the visual change**. No
new `HighlightClass` variant is required.
§10.6's other prescriptions stand verbatim — the integration
point (final walk stage, in-place mutation of walker
accumulators), the per-keystroke re-walk (ADR-0027's
debounced cadence), and the ORDER BY no-fixup-needed
clarification.
## See also
- ADR-0005 — the ten-type vocabulary §10 resolves back to.
- ADR-0016 — the data-table renderer SELECT results reuse.
- ADR-0019 — the friendly-error layer engine-side rejections
route through (§7).
- ADR-0021 — per-command parse-error usage; the Phase-2 surface
inherits the framework, Phase 6 polishes per-clause messages
(§11 OOS-12).
- ADR-0022 — ambient typing assistance. §5/§6/§8 inherit its
keyword-completion / highlighting / hint mechanisms for free,
but §10 extends its `IdentSource::Columns` / `SchemaCache` /
`WalkContext` infrastructure with the scope accumulators,
qualified-prefix narrowing, and the post-walk fixup pass that
Phase 2 needs.
- ADR-0023 / ADR-0024 — the unified grammar tree Phase 2 extends.
- ADR-0026 — the `WHERE` grammar's `Subgrammar` node, depth
counter, and `MAX_SUBGRAMMAR_DEPTH = 64` cap, all reused
unchanged (§9).
- ADR-0027 — the validity indicator, free for the Phase-2
surface; §1 (ERROR/WARNING guideline) is the source quoted
verbatim in §11; Amendment 1 (`LIKE`-on-numeric WARNING) is
the one that the SQL-expression predicate-warnings gap of
§11.6 closes for the SQL surface.
- ADR-0028 — the styled `OutputLine` mechanism the renderer
uses; not directly touched by Phase 2.
- ADR-0030 — the parent ADR; §3 commissions this phase, §4/§6
fix execution-as-text, §7 fixes engine neutrality, §11 fixes
history / replay, §13 fixes the long-running OOS list.
- ADR-0031 — the SQL expression grammar this ADR extends
additively (§5, §6); §7 named the two extensions implemented
here.
- `docs/simple-mode-limitations.md` — the DSL limits advanced
mode lifts; Phase 2 lifts the JOIN, subquery, set-op, CTE,
and grouping limits.