6d8c9eea36
Add src/dsl/sql_functions.rs (KNOWN_SQL_FUNCTIONS) as the shared source of truth at sql_expr_ident slots: - #15: offer the functions as Tab candidates under a new CandidateKind::Function + ninth Theme colour tok_function (blue, distinct from keyword/identifier/type). - #16: restore the column-typo flag the #6 fix had dropped wholesale — invalid_ident_at_cursor now bails only when the partial prefix-matches a known function, else falls through to the schema-column check. A column named like a function (e.g. `count`) is deduped (column wins). `cast` is excluded — CAST(x AS type) is not a plain-call shape. The no-validation-allowlist posture stands: the list drives completion + the typo hint only, never parse-time acceptance. Docs: ADR-0022 Amendment 6, ADR-0031 status note, README index, requirements I3/I4 + refreshed test baseline.
433 lines
20 KiB
Markdown
433 lines
20 KiB
Markdown
# ADR-0031: The SQL expression grammar
|
||
|
||
## Status
|
||
|
||
Accepted
|
||
|
||
## Context
|
||
|
||
ADR-0030 made advanced mode a body of **SQL grammar inside the
|
||
unified grammar tree** (ADR-0023/0024) rather than a separate
|
||
batch parser. It deferred two large grammar slices to their own
|
||
focused ADRs (ADR-0030 §3): the **full `SELECT` grammar** and the
|
||
**SQL expression grammar**. This ADR fixes the second.
|
||
|
||
The SQL expression grammar is the fragment that fills every
|
||
expression slot in advanced-mode SQL — ADR-0030 §3 names them:
|
||
`WHERE`, `HAVING`, `CHECK`, `SELECT` projections, and `DEFAULT`.
|
||
ADR-0030 §3 describes it as "the superset of ADR-0026's `WHERE`
|
||
grammar" — adding arithmetic, function calls, `CASE`, and
|
||
(eventually) subquery expressions on top of the comparison /
|
||
`LIKE` / `IN` / `BETWEEN` / `IS NULL` predicate set that ADR-0026
|
||
already authored for the DSL.
|
||
|
||
It is the first concrete piece of ADR-0030's phased plan: ADR-0030
|
||
Phase 1 ("Foundations + first `SELECT`") opens with "Author the
|
||
core SQL **expression grammar** — the ADR-0026 superset — as its
|
||
own ADR." This is that ADR.
|
||
|
||
### What ADR-0026 already established
|
||
|
||
ADR-0026 authored a recursive `WHERE` expression for the DSL. The
|
||
machinery this ADR builds on is all in place:
|
||
|
||
- **`Node::Subgrammar(&'static Node)`** — a reference-following
|
||
node that lets a named `static` grammar fragment appear inside
|
||
its own subtree, so a recursive grammar can be expressed even
|
||
though `Seq`/`Choice` embed children by value and cannot close
|
||
a cycle.
|
||
- **A stratified grammar** — one named `static` `Node` per
|
||
precedence tier — which removes left recursion (every recursion
|
||
is guarded by a token) and encodes precedence in the layering.
|
||
- **`WalkContext::subgrammar_depth`** and
|
||
`MAX_SUBGRAMMAR_DEPTH = 64` — a stack-overflow guard that turns
|
||
pathologically nested input into a friendly error.
|
||
- **The factored `predicate_tail`** — the shared operand prefix
|
||
matched once; the infix `NOT` factored as an explicit
|
||
`NOT negatable` branch; no `Choice` branch starting with an
|
||
`Optional` (an `Optional`-first `Seq` "commits" and discards
|
||
sibling branches' expected sets).
|
||
|
||
This ADR reuses every one of those. The new grammar is larger,
|
||
but it is the same *kind* of grammar, walked by the same walker.
|
||
|
||
### Why this is not just "extend `expr.rs`"
|
||
|
||
The DSL's `WHERE` grammar (`src/dsl/grammar/expr.rs`) is bound by
|
||
ADR-0026's deliberate teaching limits, recorded in
|
||
`docs/simple-mode-limitations.md`: operands are a column or a
|
||
literal — *no* arithmetic, *no* string concatenation, *no* scalar
|
||
functions, *no* subqueries. Those limits are a feature of simple
|
||
mode, not an accident; the DSL `WHERE` grammar must keep them.
|
||
|
||
Advanced mode is the surface that lifts them (ADR-0030 §4). So
|
||
the SQL expression grammar cannot be the DSL grammar with a few
|
||
nodes added — it has a different operand set (a full scalar
|
||
expression, not column-or-literal) and a different relationship
|
||
to its consumers (see Decision §2). It is a parallel fragment.
|
||
Keeping it parallel also keeps simple mode's 1240-test surface
|
||
untouched: nothing in `expr.rs` changes.
|
||
|
||
## Decision
|
||
|
||
### 1. One unified expression ladder
|
||
|
||
ADR-0026's DSL grammar stratifies into a *boolean* layer
|
||
(`or`/`and`/`not`/`bool_primary`) sitting above a *predicate*
|
||
layer, because the DSL deliberately forbids a boolean
|
||
sub-expression as a comparison operand — `(a > b) = (c > d)`
|
||
cannot be written.
|
||
|
||
Standard SQL draws no such line: a boolean *is* a value, `AND` /
|
||
`OR` / `NOT` and the comparison operators are simply operators at
|
||
their own precedence tiers, and a parenthesised group is a whole
|
||
expression regardless of whether it reads as "boolean" or
|
||
"scalar". The SQL expression grammar therefore is a **single
|
||
precedence ladder**, loosest tier to tightest:
|
||
|
||
```
|
||
expr := or_expr
|
||
or_expr := and_expr ( OR and_expr )*
|
||
and_expr := not_expr ( AND not_expr )*
|
||
not_expr := NOT not_expr | predicate
|
||
predicate := additive predicate_tail?
|
||
predicate_tail := cmp_op additive
|
||
| [ NOT ] LIKE additive
|
||
| [ NOT ] BETWEEN additive AND additive
|
||
| [ NOT ] IN ( additive ( , additive )* )
|
||
| IS [ NOT ] NULL
|
||
cmp_op := = | <> | != | < | <= | > | >=
|
||
additive := multiplicative ( ( + | - | || ) multiplicative )*
|
||
multiplicative := unary ( ( * | / | % ) unary )*
|
||
unary := ( - | + ) unary | primary
|
||
primary := literal
|
||
| ( or_expr )
|
||
| case_expr
|
||
| name_or_call
|
||
name_or_call := identifier [ '(' call_args? ')' ]
|
||
call_args := '*' | [ DISTINCT ] or_expr ( , or_expr )*
|
||
case_expr := CASE [ or_expr ]
|
||
( WHEN or_expr THEN or_expr )+
|
||
[ ELSE or_expr ]
|
||
END
|
||
literal := number | string | TRUE | FALSE | NULL
|
||
```
|
||
|
||
Precedence, loosest first: `OR`, `AND`, `NOT`, the comparison /
|
||
predicate tier, additive (`+ - ||`), multiplicative (`* / %`),
|
||
unary sign, primary. This is standard SQL operator precedence
|
||
restricted to the teaching-relevant operators.
|
||
|
||
Notes on specific productions:
|
||
|
||
- **`name_or_call` is factored, not a `Choice`.** A function call
|
||
(`upper(Name)`) and a column reference (`Name`) share an
|
||
identifier prefix. Splitting them into two `Choice` branches
|
||
would let the function-call branch *commit* on the identifier
|
||
and then fail at the missing `(`, discarding the column-ref
|
||
branch (the ADR-0026 "no `Optional`-first branch" hazard, in
|
||
reverse). Instead the identifier is matched once and the
|
||
`( call_args )` group is an `Optional` tail: present → a call,
|
||
absent → a column reference. The grammar need not decide which
|
||
— see §2 — it only validates that one of the two shapes holds.
|
||
- **`call_args` handles `*` and `DISTINCT`.** `count(*)` is the
|
||
one place `*` is an argument; `count(distinct col)` the one
|
||
place `DISTINCT` leads an argument list. (The projection-level
|
||
`select *` is *not* an expression — it belongs to the `SELECT`
|
||
grammar, ADR-0030 / Phase 1, not here.) The grammar admits
|
||
function calls structurally; it does not know which names are
|
||
aggregates — that distinction is the engine's, and matters
|
||
only once `GROUP BY` lands (ADR-0030 Phase 2).
|
||
- **`case_expr` covers both forms** — searched `CASE WHEN … END`
|
||
and simple `CASE <operand> WHEN … END`. Every sub-part is an
|
||
`or_expr` for uniformity (SQL allows any expression in each
|
||
slot); `END` closes it.
|
||
- **`||` is string concatenation**, standard SQL, at the additive
|
||
tier. It lifts `simple-mode-limitations.md`'s "no string
|
||
concatenation".
|
||
- **`%` is modulo.** It is not in ISO SQL (which spells it
|
||
`MOD(a, b)`), but it is near-universal across mainstream
|
||
engines and is what a learner expects. ADR-0030's "pedagogy
|
||
wins ties" admits it; `MOD` also remains reachable through the
|
||
generic `name_or_call` path.
|
||
|
||
### 2. The fragment validates; it builds no AST
|
||
|
||
ADR-0026's `WHERE` grammar carries an AST-fragment builder
|
||
(`build_expr`) that folds the matched terminals into a recursive
|
||
`Expr`, because its consumers — `update` / `delete` / `show data`
|
||
— are typed `Command`s whose executor compiles that `Expr` to
|
||
parameterised SQL.
|
||
|
||
**The SQL expression grammar deliberately builds no AST.** This
|
||
follows directly from ADR-0030 §4 and §6:
|
||
|
||
- `WHERE` / `HAVING` / `SELECT` projections live inside a
|
||
`SELECT` or a DML statement, and ADR-0030 §4 executes those
|
||
"as the validated SQL itself … they change no schema, so
|
||
modelling them as a typed `Command` buys nothing." There is no
|
||
`Expr` to compile — the engine parses the SQL.
|
||
- `CHECK` and `DEFAULT` live inside advanced-mode DDL. ADR-0030
|
||
§11 stores their expressions in `project.yaml` "as SQL the user
|
||
could re-enter" — text, not a structured tree. ADR-0030 §4 is
|
||
explicit that these expressions are "**not** lowered into the
|
||
DSL's deliberately-limited `Expr`."
|
||
|
||
So no consumer of this grammar wants an `Expr`. The fragment's
|
||
entire job is the other three walker outputs:
|
||
|
||
1. **Accept or reject** — the input either is or is not a
|
||
well-formed in-subset SQL expression.
|
||
2. **The flat `MatchedPath`** of matched terminals — which is
|
||
what drives syntax highlighting, completion, the expected-set,
|
||
and the hint panel (§5).
|
||
3. **A source span.** A consumer that needs the expression *as
|
||
text* (the `SELECT` builder assembling `Command::Select`'s
|
||
SQL; a future `CHECK` builder) recovers it by slicing the
|
||
original source between the first and last matched terminal's
|
||
byte offsets. The terminals already carry `span` for
|
||
highlighting; nothing new is needed on the matched path.
|
||
|
||
This is a real simplification over ADR-0026 — no `build_expr`
|
||
analogue, no second structural pass, no expression AST type — and
|
||
it is the correct shape for a grammar whose consumers run SQL
|
||
rather than compile it. The grammar tier still owns validation,
|
||
highlighting, completion, and the no-left-recursion guarantee;
|
||
it simply has no tree to hand back.
|
||
|
||
**Consequence for the `SELECT` builder (ADR-0030 / Phase 1).**
|
||
A command `ast_builder` today receives only `&MatchedPath`. The
|
||
`SELECT` builder needs the original source to populate
|
||
`Command::Select`'s validated SQL text. The builder signature
|
||
gains a `source: &str` parameter — a mechanical sweep across the
|
||
~21 existing `CommandNode` builders (most ignore it), of the same
|
||
category as ADR-0030's noted `match Command` sweep. It is called
|
||
out here because it is a direct consequence of the no-AST
|
||
decision; the change itself belongs to the Phase 1 SELECT work,
|
||
governed by ADR-0030.
|
||
|
||
### 3. Recursion, and the depth cap
|
||
|
||
The grammar's recursion points are all **token-guarded** — each
|
||
consumes at least one token before recursing, so the greedy
|
||
top-down walker always makes progress:
|
||
|
||
- `not_expr := NOT not_expr` — after `NOT`.
|
||
- `primary := ( or_expr )` — after `(`.
|
||
- `unary := ( - | + ) unary` — after a sign.
|
||
- `call_args` operands — after the call's `(`.
|
||
- `case_expr` sub-parts — after `CASE` / `WHEN` / `THEN` /
|
||
`ELSE`.
|
||
- `IN ( … )` operands — after `IN (`.
|
||
|
||
Every recursion is wired through `Node::Subgrammar(&NAMED)`
|
||
referencing a named `static` tier, exactly as in `expr.rs`. The
|
||
walker counts active `Subgrammar` frames in
|
||
`WalkContext::subgrammar_depth`; this grammar reuses ADR-0026's
|
||
`MAX_SUBGRAMMAR_DEPTH = 64` cap and its friendly
|
||
"expression nested too deeply" error — no new walker capability
|
||
is required. The ladder descends a few `Subgrammar` frames per
|
||
nesting level, so the effective hand-written nesting limit is
|
||
comfortably past anything a learner types; the cap is purely a
|
||
stack-overflow guard.
|
||
|
||
### 4. A separate fragment, parallel to the DSL grammar
|
||
|
||
The SQL expression grammar is authored in a new file,
|
||
`src/dsl/grammar/sql_expr.rs`, parallel to `expr.rs` (which keeps
|
||
the DSL `WHERE` grammar). They are deliberately *not* merged:
|
||
|
||
- **Different operand sets.** The DSL operand is a column or a
|
||
literal; the SQL operand is a full scalar expression.
|
||
- **Different output.** `expr.rs` builds an `Expr`;
|
||
`sql_expr.rs` builds nothing (§2).
|
||
- **Mode isolation.** Simple mode must never gain arithmetic or
|
||
functions — the limits in `simple-mode-limitations.md` are a
|
||
teaching feature. A shared fragment risks leaking the SQL
|
||
surface into the DSL grammar.
|
||
- **Regression containment.** `expr.rs` is exercised by a large
|
||
share of the 1240-test suite. A parallel file changes none of
|
||
it.
|
||
|
||
The predicate-tail shapes (`cmp_op` / `LIKE` / `BETWEEN` / `IN` /
|
||
`IS NULL`) look structurally identical between the two grammars,
|
||
but each branch's operand sub-node differs (column-or-literal vs
|
||
`additive`), so the `static` nodes cannot literally be shared.
|
||
The *design* is shared — `sql_expr.rs` follows `expr.rs`'s
|
||
factoring (operand prefix matched once, infix `NOT` as an
|
||
explicit branch, no `Optional`-first branch) — and that is the
|
||
reuse that matters.
|
||
|
||
### 5. Ambient assistance comes for free
|
||
|
||
Because the fragment is grammar in the unified tree, the walker
|
||
gives it — with no expression-specific assistance code — the same
|
||
ambient assistance every DSL command gets (ADR-0030 §8,
|
||
ADR-0022):
|
||
|
||
- **Syntax highlighting** of SQL keywords, identifiers, literals,
|
||
and operators, from the per-byte highlight classes the walk
|
||
records.
|
||
- **Tab completion** of SQL keywords (`and`, `or`, `like`,
|
||
`between`, `case`, `when`, …) and of column names — the
|
||
`name_or_call` identifier slot uses `IdentSource::Columns`, so
|
||
it completes against the statement's table(s) from the same
|
||
`SchemaCache` the DSL uses. Function names are not completed
|
||
(there is no allowlist — ADR-0030 §7 OOS-3); a typed function
|
||
name simply is not a candidate.
|
||
- **Hint-panel prose** at each grammar slot.
|
||
- **The `[ERR]` / `[WRN]` validity indicator** (ADR-0027).
|
||
- **Per-command parse-error usage** (ADR-0021).
|
||
|
||
The `name_or_call` identifier slot resolves to `Columns` because,
|
||
at the moment the identifier is typed, the common case is a
|
||
column reference and column completion is the helpful default; a
|
||
function call is recognised a token later when `(` follows. The
|
||
grammar does not need to decide between the two (§2), so the slot
|
||
can optimise for the common completion.
|
||
|
||
### 6. Errors and the unsupported surface
|
||
|
||
A construct outside this grammar — a window function's `OVER`
|
||
clause, a `CAST` with `::` syntax, an array literal — is an
|
||
ordinary walker parse error, carrying the expected-set and
|
||
routed through the friendly-error layer with engine-neutral
|
||
wording (ADR-0030 §9, ADR-0019). There is no separate
|
||
"valid SQL but unsupported" classifier — ADR-0030 §1 dropped the
|
||
batch parser that would be needed for one.
|
||
|
||
Expression-level engine neutrality is **best-effort**, exactly as
|
||
ADR-0030 §7 states: the grammar enforces the *structural* subset
|
||
(operators, `CASE`, call syntax), but because there is no
|
||
function allowlist, an engine-specific function the grammar
|
||
admits and the engine then rejects surfaces an engine-neutral
|
||
*execution* error rather than being caught at parse time. This is
|
||
the accepted honest limitation; a function allowlist remains
|
||
ADR-0030 §13 OOS-3.
|
||
|
||
### 7. Out of scope
|
||
|
||
- **OOS-1. Subquery expressions.** A `( SELECT … )` as a
|
||
`primary`, `<op> ( SELECT … )`, `IN ( SELECT … )`, and
|
||
`EXISTS ( SELECT … )` are part of the eventual surface
|
||
(ADR-0030 §3) but cannot be realised until the `SELECT`
|
||
grammar itself exists and is recursive — that is ADR-0030
|
||
Phase 2 ("`SELECT` — full"). This ADR's grammar is authored so
|
||
that adding a subquery branch to `primary` (and an
|
||
`IN ( subquery )` / `EXISTS` form) is an additive change: a new
|
||
`Choice` branch guarded by `(`/`EXISTS`, recursing through
|
||
`Subgrammar` into the `SELECT` fragment. No restructuring is
|
||
foreseen.
|
||
- **OOS-2. Qualified column references** (`table.column`,
|
||
`alias.column`). A single-table `SELECT` (ADR-0030 Phase 1)
|
||
never needs them; they become meaningful with `JOIN`s
|
||
(Phase 2). `name_or_call` takes an unqualified identifier for
|
||
now; a `[ '.' identifier ]` tail is an additive extension.
|
||
- **OOS-3. Quoted identifiers** (`"column name"`). The DSL has no
|
||
quoted-identifier syntax; introducing one is a cross-cutting
|
||
lexer change, tracked separately.
|
||
- **OOS-4. A function allowlist** — ADR-0030 §13 OOS-3,
|
||
restated: function calls are admitted generically.
|
||
- **OOS-5. An expression AST.** Explicitly not built (§2). If a
|
||
future consumer genuinely needs structured expression data
|
||
(none is foreseen — DDL `CHECK`/`DEFAULT` store text), that is
|
||
a new decision, not a deferral.
|
||
|
||
## Consequences
|
||
|
||
- A new grammar file, `src/dsl/grammar/sql_expr.rs`, exporting a
|
||
single `pub static SQL_EXPRESSION: Node` (a
|
||
`Subgrammar(&SQL_OR_EXPR)`) that any SQL `CommandNode` drops
|
||
into its `Seq` as one node — the same drop-in shape as
|
||
`expr::EXPRESSION`.
|
||
- **No new walker capability.** `Subgrammar`, the depth counter,
|
||
the cap, and the friendly depth error are all reused from
|
||
ADR-0026 unchanged.
|
||
- **No expression AST, no fragment builder** — a deliberate
|
||
simplification over ADR-0026 (§2).
|
||
- `expr.rs` and the simple-mode `WHERE` surface are **untouched**;
|
||
the 1240-test baseline is insulated by construction (§4).
|
||
- The command `ast_builder` signature gains a `source: &str`
|
||
parameter (§2) — a ~21-site mechanical sweep, executed as part
|
||
of the Phase 1 `SELECT` work (ADR-0030), not here.
|
||
- Subquery expressions and qualified column references are
|
||
authored later as additive `primary` branches (§7) — the
|
||
grammar is shaped to receive them.
|
||
- The fragment is the shared dependency of every advanced-mode
|
||
expression slot — `WHERE`, `HAVING`, `SELECT` projections,
|
||
`CHECK`, `DEFAULT` — defined once.
|
||
|
||
## Implementation notes
|
||
|
||
A build order, each step guarded by the test suite. Steps 1–5 are
|
||
ADR-0030 Phase 1; the fragment is consumed first by the
|
||
single-table `SELECT`'s `WHERE` and projection slots.
|
||
|
||
1. **The grammar fragment** — `sql_expr.rs` with the stratified
|
||
tiers of §1 as named `static` `Node`s, recursion via
|
||
`Subgrammar`. No builder. `pub static SQL_EXPRESSION`.
|
||
2. **Unit tests** walking representative inputs against the
|
||
fragment directly (the `expr.rs` test pattern): every operator
|
||
and precedence pair, `CASE` both forms, function calls
|
||
including `count(*)` and `count(distinct …)`, the full
|
||
predicate set, parenthesised regrouping, the depth cap, and
|
||
the keyword-case-insensitivity check.
|
||
3. **Wire it into the Phase 1 `SELECT` grammar** — the `WHERE`
|
||
slot and the projection items reference `SQL_EXPRESSION`
|
||
(ADR-0030 Phase 1).
|
||
4. **Highlighting / completion / hint** spot-checks — confirm the
|
||
§5 assistance works through a SQL expression with no
|
||
expression-specific code, via the typing-surface matrix.
|
||
5. **Engine-neutral error** spot-checks for out-of-subset
|
||
constructs (§6).
|
||
|
||
Later phases extend the same fragment:
|
||
|
||
- **ADR-0030 Phase 2** adds the subquery `primary` branches and
|
||
qualified column references (OOS-1, OOS-2) once the recursive
|
||
`SELECT` grammar exists, and exercises the fragment from
|
||
`HAVING`.
|
||
- **ADR-0030 Phase 4** consumes the fragment from advanced-mode
|
||
DDL `CHECK` and `DEFAULT`.
|
||
|
||
## See also
|
||
|
||
- ADR-0019 — the friendly-error layer SQL parse and execution
|
||
errors route through (§6).
|
||
- ADR-0021 — per-command parse-error usage, free for SQL (§5).
|
||
- ADR-0022 — ambient typing assistance; §5 is its reach into the
|
||
SQL expression.
|
||
- ADR-0023 / ADR-0024 — the unified grammar tree this fragment
|
||
is authored into.
|
||
- ADR-0026 — the DSL `WHERE` expression grammar this is the
|
||
superset of: the `Subgrammar` node, the stratified-grammar
|
||
technique, the depth cap, and the `predicate_tail` factoring
|
||
are all inherited from it.
|
||
- ADR-0027 — the validity indicator, free for SQL (§5).
|
||
- ADR-0030 — advanced mode's SQL surface; §3 commissions this
|
||
ADR, §4/§6 are the source of the no-AST decision (§2), §7/§13
|
||
set the engine-neutrality posture and the no-allowlist rule.
|
||
- `docs/simple-mode-limitations.md` — the DSL limits this grammar
|
||
lifts for advanced mode (§1, §4).
|
||
|
||
## Status note — known-function list layered on the slot (2026-05-30)
|
||
|
||
The `sql_expr_ident` slot is `IdentSource::Columns` and, per §1 / §5,
|
||
does **not** itself know which identifiers are function names — it
|
||
optimises for the common case (a column reference) and admits the
|
||
function-call shape structurally; §5 explicitly noted "function names
|
||
are not completed … a typed function name simply is not a candidate".
|
||
**ADR-0022 Amendment 6** layers a curated known-function list
|
||
(`src/dsl/sql_functions.rs`) on top of this slot, consumed two ways:
|
||
as Tab-completion candidates so a learner can discover `sum` / `upper`
|
||
/ … (issue #15 — softening §5's "not completed" line to "completed
|
||
from a curated pedagogical list, not an allowlist for validation"),
|
||
and as the allow-list that lets the typing-time column-typo hint stay
|
||
strict at this slot — flag a partial as "no such column" only when it
|
||
matches neither a schema column nor a known function name (issue #16).
|
||
The grammar here is unchanged, and §6/§7's no-validation-allowlist
|
||
posture stands: the list drives completion + the typo hint, **not**
|
||
parse-time acceptance (an unknown function still parses and surfaces an
|
||
engine-neutral execution error). The list sits in the completion /
|
||
hint layer above the grammar.
|