Files
rdbms-playground/docs/adr/0031-sql-expression-grammar.md
claude@clouddev1 6d8c9eea36 feat: curated SQL function list — Tab completion (#15) + typing-time typo hint (#16)
Add src/dsl/sql_functions.rs (KNOWN_SQL_FUNCTIONS) as the shared source
of truth at sql_expr_ident slots:

- #15: offer the functions as Tab candidates under a new
  CandidateKind::Function + ninth Theme colour tok_function (blue,
  distinct from keyword/identifier/type).
- #16: restore the column-typo flag the #6 fix had dropped wholesale —
  invalid_ident_at_cursor now bails only when the partial prefix-matches
  a known function, else falls through to the schema-column check.

A column named like a function (e.g. `count`) is deduped (column wins).
`cast` is excluded — CAST(x AS type) is not a plain-call shape.
The no-validation-allowlist posture stands: the list drives completion +
the typo hint only, never parse-time acceptance.

Docs: ADR-0022 Amendment 6, ADR-0031 status note, README index,
requirements I3/I4 + refreshed test baseline.
2026-05-31 11:49:10 +00:00

433 lines
20 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0031: The SQL expression grammar
## Status
Accepted
## Context
ADR-0030 made advanced mode a body of **SQL grammar inside the
unified grammar tree** (ADR-0023/0024) rather than a separate
batch parser. It deferred two large grammar slices to their own
focused ADRs (ADR-0030 §3): the **full `SELECT` grammar** and the
**SQL expression grammar**. This ADR fixes the second.
The SQL expression grammar is the fragment that fills every
expression slot in advanced-mode SQL — ADR-0030 §3 names them:
`WHERE`, `HAVING`, `CHECK`, `SELECT` projections, and `DEFAULT`.
ADR-0030 §3 describes it as "the superset of ADR-0026's `WHERE`
grammar" — adding arithmetic, function calls, `CASE`, and
(eventually) subquery expressions on top of the comparison /
`LIKE` / `IN` / `BETWEEN` / `IS NULL` predicate set that ADR-0026
already authored for the DSL.
It is the first concrete piece of ADR-0030's phased plan: ADR-0030
Phase 1 ("Foundations + first `SELECT`") opens with "Author the
core SQL **expression grammar** — the ADR-0026 superset — as its
own ADR." This is that ADR.
### What ADR-0026 already established
ADR-0026 authored a recursive `WHERE` expression for the DSL. The
machinery this ADR builds on is all in place:
- **`Node::Subgrammar(&'static Node)`** — a reference-following
node that lets a named `static` grammar fragment appear inside
its own subtree, so a recursive grammar can be expressed even
though `Seq`/`Choice` embed children by value and cannot close
a cycle.
- **A stratified grammar** — one named `static` `Node` per
precedence tier — which removes left recursion (every recursion
is guarded by a token) and encodes precedence in the layering.
- **`WalkContext::subgrammar_depth`** and
`MAX_SUBGRAMMAR_DEPTH = 64` — a stack-overflow guard that turns
pathologically nested input into a friendly error.
- **The factored `predicate_tail`** — the shared operand prefix
matched once; the infix `NOT` factored as an explicit
`NOT negatable` branch; no `Choice` branch starting with an
`Optional` (an `Optional`-first `Seq` "commits" and discards
sibling branches' expected sets).
This ADR reuses every one of those. The new grammar is larger,
but it is the same *kind* of grammar, walked by the same walker.
### Why this is not just "extend `expr.rs`"
The DSL's `WHERE` grammar (`src/dsl/grammar/expr.rs`) is bound by
ADR-0026's deliberate teaching limits, recorded in
`docs/simple-mode-limitations.md`: operands are a column or a
literal — *no* arithmetic, *no* string concatenation, *no* scalar
functions, *no* subqueries. Those limits are a feature of simple
mode, not an accident; the DSL `WHERE` grammar must keep them.
Advanced mode is the surface that lifts them (ADR-0030 §4). So
the SQL expression grammar cannot be the DSL grammar with a few
nodes added — it has a different operand set (a full scalar
expression, not column-or-literal) and a different relationship
to its consumers (see Decision §2). It is a parallel fragment.
Keeping it parallel also keeps simple mode's 1240-test surface
untouched: nothing in `expr.rs` changes.
## Decision
### 1. One unified expression ladder
ADR-0026's DSL grammar stratifies into a *boolean* layer
(`or`/`and`/`not`/`bool_primary`) sitting above a *predicate*
layer, because the DSL deliberately forbids a boolean
sub-expression as a comparison operand — `(a > b) = (c > d)`
cannot be written.
Standard SQL draws no such line: a boolean *is* a value, `AND` /
`OR` / `NOT` and the comparison operators are simply operators at
their own precedence tiers, and a parenthesised group is a whole
expression regardless of whether it reads as "boolean" or
"scalar". The SQL expression grammar therefore is a **single
precedence ladder**, loosest tier to tightest:
```
expr := or_expr
or_expr := and_expr ( OR and_expr )*
and_expr := not_expr ( AND not_expr )*
not_expr := NOT not_expr | predicate
predicate := additive predicate_tail?
predicate_tail := cmp_op additive
| [ NOT ] LIKE additive
| [ NOT ] BETWEEN additive AND additive
| [ NOT ] IN ( additive ( , additive )* )
| IS [ NOT ] NULL
cmp_op := = | <> | != | < | <= | > | >=
additive := multiplicative ( ( + | - | || ) multiplicative )*
multiplicative := unary ( ( * | / | % ) unary )*
unary := ( - | + ) unary | primary
primary := literal
| ( or_expr )
| case_expr
| name_or_call
name_or_call := identifier [ '(' call_args? ')' ]
call_args := '*' | [ DISTINCT ] or_expr ( , or_expr )*
case_expr := CASE [ or_expr ]
( WHEN or_expr THEN or_expr )+
[ ELSE or_expr ]
END
literal := number | string | TRUE | FALSE | NULL
```
Precedence, loosest first: `OR`, `AND`, `NOT`, the comparison /
predicate tier, additive (`+ - ||`), multiplicative (`* / %`),
unary sign, primary. This is standard SQL operator precedence
restricted to the teaching-relevant operators.
Notes on specific productions:
- **`name_or_call` is factored, not a `Choice`.** A function call
(`upper(Name)`) and a column reference (`Name`) share an
identifier prefix. Splitting them into two `Choice` branches
would let the function-call branch *commit* on the identifier
and then fail at the missing `(`, discarding the column-ref
branch (the ADR-0026 "no `Optional`-first branch" hazard, in
reverse). Instead the identifier is matched once and the
`( call_args )` group is an `Optional` tail: present → a call,
absent → a column reference. The grammar need not decide which
— see §2 — it only validates that one of the two shapes holds.
- **`call_args` handles `*` and `DISTINCT`.** `count(*)` is the
one place `*` is an argument; `count(distinct col)` the one
place `DISTINCT` leads an argument list. (The projection-level
`select *` is *not* an expression — it belongs to the `SELECT`
grammar, ADR-0030 / Phase 1, not here.) The grammar admits
function calls structurally; it does not know which names are
aggregates — that distinction is the engine's, and matters
only once `GROUP BY` lands (ADR-0030 Phase 2).
- **`case_expr` covers both forms** — searched `CASE WHEN … END`
and simple `CASE <operand> WHEN … END`. Every sub-part is an
`or_expr` for uniformity (SQL allows any expression in each
slot); `END` closes it.
- **`||` is string concatenation**, standard SQL, at the additive
tier. It lifts `simple-mode-limitations.md`'s "no string
concatenation".
- **`%` is modulo.** It is not in ISO SQL (which spells it
`MOD(a, b)`), but it is near-universal across mainstream
engines and is what a learner expects. ADR-0030's "pedagogy
wins ties" admits it; `MOD` also remains reachable through the
generic `name_or_call` path.
### 2. The fragment validates; it builds no AST
ADR-0026's `WHERE` grammar carries an AST-fragment builder
(`build_expr`) that folds the matched terminals into a recursive
`Expr`, because its consumers — `update` / `delete` / `show data`
— are typed `Command`s whose executor compiles that `Expr` to
parameterised SQL.
**The SQL expression grammar deliberately builds no AST.** This
follows directly from ADR-0030 §4 and §6:
- `WHERE` / `HAVING` / `SELECT` projections live inside a
`SELECT` or a DML statement, and ADR-0030 §4 executes those
"as the validated SQL itself … they change no schema, so
modelling them as a typed `Command` buys nothing." There is no
`Expr` to compile — the engine parses the SQL.
- `CHECK` and `DEFAULT` live inside advanced-mode DDL. ADR-0030
§11 stores their expressions in `project.yaml` "as SQL the user
could re-enter" — text, not a structured tree. ADR-0030 §4 is
explicit that these expressions are "**not** lowered into the
DSL's deliberately-limited `Expr`."
So no consumer of this grammar wants an `Expr`. The fragment's
entire job is the other three walker outputs:
1. **Accept or reject** — the input either is or is not a
well-formed in-subset SQL expression.
2. **The flat `MatchedPath`** of matched terminals — which is
what drives syntax highlighting, completion, the expected-set,
and the hint panel (§5).
3. **A source span.** A consumer that needs the expression *as
text* (the `SELECT` builder assembling `Command::Select`'s
SQL; a future `CHECK` builder) recovers it by slicing the
original source between the first and last matched terminal's
byte offsets. The terminals already carry `span` for
highlighting; nothing new is needed on the matched path.
This is a real simplification over ADR-0026 — no `build_expr`
analogue, no second structural pass, no expression AST type — and
it is the correct shape for a grammar whose consumers run SQL
rather than compile it. The grammar tier still owns validation,
highlighting, completion, and the no-left-recursion guarantee;
it simply has no tree to hand back.
**Consequence for the `SELECT` builder (ADR-0030 / Phase 1).**
A command `ast_builder` today receives only `&MatchedPath`. The
`SELECT` builder needs the original source to populate
`Command::Select`'s validated SQL text. The builder signature
gains a `source: &str` parameter — a mechanical sweep across the
~21 existing `CommandNode` builders (most ignore it), of the same
category as ADR-0030's noted `match Command` sweep. It is called
out here because it is a direct consequence of the no-AST
decision; the change itself belongs to the Phase 1 SELECT work,
governed by ADR-0030.
### 3. Recursion, and the depth cap
The grammar's recursion points are all **token-guarded** — each
consumes at least one token before recursing, so the greedy
top-down walker always makes progress:
- `not_expr := NOT not_expr` — after `NOT`.
- `primary := ( or_expr )` — after `(`.
- `unary := ( - | + ) unary` — after a sign.
- `call_args` operands — after the call's `(`.
- `case_expr` sub-parts — after `CASE` / `WHEN` / `THEN` /
`ELSE`.
- `IN ( … )` operands — after `IN (`.
Every recursion is wired through `Node::Subgrammar(&NAMED)`
referencing a named `static` tier, exactly as in `expr.rs`. The
walker counts active `Subgrammar` frames in
`WalkContext::subgrammar_depth`; this grammar reuses ADR-0026's
`MAX_SUBGRAMMAR_DEPTH = 64` cap and its friendly
"expression nested too deeply" error — no new walker capability
is required. The ladder descends a few `Subgrammar` frames per
nesting level, so the effective hand-written nesting limit is
comfortably past anything a learner types; the cap is purely a
stack-overflow guard.
### 4. A separate fragment, parallel to the DSL grammar
The SQL expression grammar is authored in a new file,
`src/dsl/grammar/sql_expr.rs`, parallel to `expr.rs` (which keeps
the DSL `WHERE` grammar). They are deliberately *not* merged:
- **Different operand sets.** The DSL operand is a column or a
literal; the SQL operand is a full scalar expression.
- **Different output.** `expr.rs` builds an `Expr`;
`sql_expr.rs` builds nothing (§2).
- **Mode isolation.** Simple mode must never gain arithmetic or
functions — the limits in `simple-mode-limitations.md` are a
teaching feature. A shared fragment risks leaking the SQL
surface into the DSL grammar.
- **Regression containment.** `expr.rs` is exercised by a large
share of the 1240-test suite. A parallel file changes none of
it.
The predicate-tail shapes (`cmp_op` / `LIKE` / `BETWEEN` / `IN` /
`IS NULL`) look structurally identical between the two grammars,
but each branch's operand sub-node differs (column-or-literal vs
`additive`), so the `static` nodes cannot literally be shared.
The *design* is shared — `sql_expr.rs` follows `expr.rs`'s
factoring (operand prefix matched once, infix `NOT` as an
explicit branch, no `Optional`-first branch) — and that is the
reuse that matters.
### 5. Ambient assistance comes for free
Because the fragment is grammar in the unified tree, the walker
gives it — with no expression-specific assistance code — the same
ambient assistance every DSL command gets (ADR-0030 §8,
ADR-0022):
- **Syntax highlighting** of SQL keywords, identifiers, literals,
and operators, from the per-byte highlight classes the walk
records.
- **Tab completion** of SQL keywords (`and`, `or`, `like`,
`between`, `case`, `when`, …) and of column names — the
`name_or_call` identifier slot uses `IdentSource::Columns`, so
it completes against the statement's table(s) from the same
`SchemaCache` the DSL uses. Function names are not completed
(there is no allowlist — ADR-0030 §7 OOS-3); a typed function
name simply is not a candidate.
- **Hint-panel prose** at each grammar slot.
- **The `[ERR]` / `[WRN]` validity indicator** (ADR-0027).
- **Per-command parse-error usage** (ADR-0021).
The `name_or_call` identifier slot resolves to `Columns` because,
at the moment the identifier is typed, the common case is a
column reference and column completion is the helpful default; a
function call is recognised a token later when `(` follows. The
grammar does not need to decide between the two (§2), so the slot
can optimise for the common completion.
### 6. Errors and the unsupported surface
A construct outside this grammar — a window function's `OVER`
clause, a `CAST` with `::` syntax, an array literal — is an
ordinary walker parse error, carrying the expected-set and
routed through the friendly-error layer with engine-neutral
wording (ADR-0030 §9, ADR-0019). There is no separate
"valid SQL but unsupported" classifier — ADR-0030 §1 dropped the
batch parser that would be needed for one.
Expression-level engine neutrality is **best-effort**, exactly as
ADR-0030 §7 states: the grammar enforces the *structural* subset
(operators, `CASE`, call syntax), but because there is no
function allowlist, an engine-specific function the grammar
admits and the engine then rejects surfaces an engine-neutral
*execution* error rather than being caught at parse time. This is
the accepted honest limitation; a function allowlist remains
ADR-0030 §13 OOS-3.
### 7. Out of scope
- **OOS-1. Subquery expressions.** A `( SELECT … )` as a
`primary`, `<op> ( SELECT … )`, `IN ( SELECT … )`, and
`EXISTS ( SELECT … )` are part of the eventual surface
(ADR-0030 §3) but cannot be realised until the `SELECT`
grammar itself exists and is recursive — that is ADR-0030
Phase 2 ("`SELECT` — full"). This ADR's grammar is authored so
that adding a subquery branch to `primary` (and an
`IN ( subquery )` / `EXISTS` form) is an additive change: a new
`Choice` branch guarded by `(`/`EXISTS`, recursing through
`Subgrammar` into the `SELECT` fragment. No restructuring is
foreseen.
- **OOS-2. Qualified column references** (`table.column`,
`alias.column`). A single-table `SELECT` (ADR-0030 Phase 1)
never needs them; they become meaningful with `JOIN`s
(Phase 2). `name_or_call` takes an unqualified identifier for
now; a `[ '.' identifier ]` tail is an additive extension.
- **OOS-3. Quoted identifiers** (`"column name"`). The DSL has no
quoted-identifier syntax; introducing one is a cross-cutting
lexer change, tracked separately.
- **OOS-4. A function allowlist** — ADR-0030 §13 OOS-3,
restated: function calls are admitted generically.
- **OOS-5. An expression AST.** Explicitly not built (§2). If a
future consumer genuinely needs structured expression data
(none is foreseen — DDL `CHECK`/`DEFAULT` store text), that is
a new decision, not a deferral.
## Consequences
- A new grammar file, `src/dsl/grammar/sql_expr.rs`, exporting a
single `pub static SQL_EXPRESSION: Node` (a
`Subgrammar(&SQL_OR_EXPR)`) that any SQL `CommandNode` drops
into its `Seq` as one node — the same drop-in shape as
`expr::EXPRESSION`.
- **No new walker capability.** `Subgrammar`, the depth counter,
the cap, and the friendly depth error are all reused from
ADR-0026 unchanged.
- **No expression AST, no fragment builder** — a deliberate
simplification over ADR-0026 (§2).
- `expr.rs` and the simple-mode `WHERE` surface are **untouched**;
the 1240-test baseline is insulated by construction (§4).
- The command `ast_builder` signature gains a `source: &str`
parameter (§2) — a ~21-site mechanical sweep, executed as part
of the Phase 1 `SELECT` work (ADR-0030), not here.
- Subquery expressions and qualified column references are
authored later as additive `primary` branches (§7) — the
grammar is shaped to receive them.
- The fragment is the shared dependency of every advanced-mode
expression slot — `WHERE`, `HAVING`, `SELECT` projections,
`CHECK`, `DEFAULT` — defined once.
## Implementation notes
A build order, each step guarded by the test suite. Steps 15 are
ADR-0030 Phase 1; the fragment is consumed first by the
single-table `SELECT`'s `WHERE` and projection slots.
1. **The grammar fragment**`sql_expr.rs` with the stratified
tiers of §1 as named `static` `Node`s, recursion via
`Subgrammar`. No builder. `pub static SQL_EXPRESSION`.
2. **Unit tests** walking representative inputs against the
fragment directly (the `expr.rs` test pattern): every operator
and precedence pair, `CASE` both forms, function calls
including `count(*)` and `count(distinct …)`, the full
predicate set, parenthesised regrouping, the depth cap, and
the keyword-case-insensitivity check.
3. **Wire it into the Phase 1 `SELECT` grammar** — the `WHERE`
slot and the projection items reference `SQL_EXPRESSION`
(ADR-0030 Phase 1).
4. **Highlighting / completion / hint** spot-checks — confirm the
§5 assistance works through a SQL expression with no
expression-specific code, via the typing-surface matrix.
5. **Engine-neutral error** spot-checks for out-of-subset
constructs (§6).
Later phases extend the same fragment:
- **ADR-0030 Phase 2** adds the subquery `primary` branches and
qualified column references (OOS-1, OOS-2) once the recursive
`SELECT` grammar exists, and exercises the fragment from
`HAVING`.
- **ADR-0030 Phase 4** consumes the fragment from advanced-mode
DDL `CHECK` and `DEFAULT`.
## See also
- ADR-0019 — the friendly-error layer SQL parse and execution
errors route through (§6).
- ADR-0021 — per-command parse-error usage, free for SQL (§5).
- ADR-0022 — ambient typing assistance; §5 is its reach into the
SQL expression.
- ADR-0023 / ADR-0024 — the unified grammar tree this fragment
is authored into.
- ADR-0026 — the DSL `WHERE` expression grammar this is the
superset of: the `Subgrammar` node, the stratified-grammar
technique, the depth cap, and the `predicate_tail` factoring
are all inherited from it.
- ADR-0027 — the validity indicator, free for SQL (§5).
- ADR-0030 — advanced mode's SQL surface; §3 commissions this
ADR, §4/§6 are the source of the no-AST decision (§2), §7/§13
set the engine-neutrality posture and the no-allowlist rule.
- `docs/simple-mode-limitations.md` — the DSL limits this grammar
lifts for advanced mode (§1, §4).
## Status note — known-function list layered on the slot (2026-05-30)
The `sql_expr_ident` slot is `IdentSource::Columns` and, per §1 / §5,
does **not** itself know which identifiers are function names — it
optimises for the common case (a column reference) and admits the
function-call shape structurally; §5 explicitly noted "function names
are not completed … a typed function name simply is not a candidate".
**ADR-0022 Amendment 6** layers a curated known-function list
(`src/dsl/sql_functions.rs`) on top of this slot, consumed two ways:
as Tab-completion candidates so a learner can discover `sum` / `upper`
/ … (issue #15 — softening §5's "not completed" line to "completed
from a curated pedagogical list, not an allowlist for validation"),
and as the allow-list that lets the typing-time column-typo hint stay
strict at this slot — flag a partial as "no such column" only when it
matches neither a schema column nor a known function name (issue #16).
The grammar here is unchanged, and §6/§7's no-validation-allowlist
posture stands: the list drives completion + the typo hint, **not**
parse-time acceptance (an unknown function still parses and surfaces an
engine-neutral execution error). The list sits in the completion /
hint layer above the grammar.