docs: ADR-0031 — SQL expression grammar

ADR-0030 §3 commissioned a focused ADR for the stratified SQL expression grammar fragment. ADR-0031 records the decisions: - One unified precedence ladder (OR/AND/NOT, comparison/LIKE/IN/ BETWEEN/IS NULL predicates, arithmetic incl. `||`, function calls, CASE) — SQL treats booleans as values, so unlike ADR-0026's bool/scalar split this is a single ladder. - No AST — every Phase-1 consumer (SELECT projection, WHERE) runs validated SQL as text per ADR-0030 §4/§6; CHECK/DEFAULT in Phase 4 store text too. The fragment's job is accept / reject + the matched-terminal path + a source span. - Recursion via Subgrammar with ADR-0026's depth cap reused. - A parallel `grammar/sql_expr.rs` — separate from `expr.rs` so simple mode's 1240-test surface is untouched by construction. - Subquery expressions and qualified `t.c` column refs deferred to ADR-0030 Phase 2 (they need the recursive SELECT grammar). `%` modulo is included alongside `+ - * /` and `||` — it isn't ISO SQL but is near-universal across mainstream engines and matches learner expectations (pedagogy wins ties, ADR-0030). Status: Accepted. The implementation lands in subsequent commits.
2026-05-19 21:37:23 +00:00
parent 5438ba6a47
commit 81793a3a85
2 changed files with 412 additions and 0 deletions
@@ -0,0 +1,411 @@
+# ADR-0031: The SQL expression grammar
+
+## Status
+
+Accepted
+
+## Context
+
+ADR-0030 made advanced mode a body of **SQL grammar inside the
+unified grammar tree** (ADR-0023/0024) rather than a separate
+batch parser. It deferred two large grammar slices to their own
+focused ADRs (ADR-0030 §3): the **full `SELECT` grammar** and the
+**SQL expression grammar**. This ADR fixes the second.
+
+The SQL expression grammar is the fragment that fills every
+expression slot in advanced-mode SQL — ADR-0030 §3 names them:
+`WHERE`, `HAVING`, `CHECK`, `SELECT` projections, and `DEFAULT`.
+ADR-0030 §3 describes it as "the superset of ADR-0026's `WHERE`
+grammar" — adding arithmetic, function calls, `CASE`, and
+(eventually) subquery expressions on top of the comparison /
+`LIKE` / `IN` / `BETWEEN` / `IS NULL` predicate set that ADR-0026
+already authored for the DSL.
+
+It is the first concrete piece of ADR-0030's phased plan: ADR-0030
+Phase 1 ("Foundations + first `SELECT`") opens with "Author the
+core SQL **expression grammar** — the ADR-0026 superset — as its
+own ADR." This is that ADR.
+
+### What ADR-0026 already established
+
+ADR-0026 authored a recursive `WHERE` expression for the DSL. The
+machinery this ADR builds on is all in place:
+
+- **`Node::Subgrammar(&'static Node)`** — a reference-following
+  node that lets a named `static` grammar fragment appear inside
+  its own subtree, so a recursive grammar can be expressed even
+  though `Seq`/`Choice` embed children by value and cannot close
+  a cycle.
+- **A stratified grammar** — one named `static` `Node` per
+  precedence tier — which removes left recursion (every recursion
+  is guarded by a token) and encodes precedence in the layering.
+- **`WalkContext::subgrammar_depth`** and
+  `MAX_SUBGRAMMAR_DEPTH = 64` — a stack-overflow guard that turns
+  pathologically nested input into a friendly error.
+- **The factored `predicate_tail`** — the shared operand prefix
+  matched once; the infix `NOT` factored as an explicit
+  `NOT negatable` branch; no `Choice` branch starting with an
+  `Optional` (an `Optional`-first `Seq` "commits" and discards
+  sibling branches' expected sets).
+
+This ADR reuses every one of those. The new grammar is larger,
+but it is the same *kind* of grammar, walked by the same walker.
+
+### Why this is not just "extend `expr.rs`"
+
+The DSL's `WHERE` grammar (`src/dsl/grammar/expr.rs`) is bound by
+ADR-0026's deliberate teaching limits, recorded in
+`docs/simple-mode-limitations.md`: operands are a column or a
+literal — *no* arithmetic, *no* string concatenation, *no* scalar
+functions, *no* subqueries. Those limits are a feature of simple
+mode, not an accident; the DSL `WHERE` grammar must keep them.
+
+Advanced mode is the surface that lifts them (ADR-0030 §4). So
+the SQL expression grammar cannot be the DSL grammar with a few
+nodes added — it has a different operand set (a full scalar
+expression, not column-or-literal) and a different relationship
+to its consumers (see Decision §2). It is a parallel fragment.
+Keeping it parallel also keeps simple mode's 1240-test surface
+untouched: nothing in `expr.rs` changes.
+
+## Decision
+
+### 1. One unified expression ladder
+
+ADR-0026's DSL grammar stratifies into a *boolean* layer
+(`or`/`and`/`not`/`bool_primary`) sitting above a *predicate*
+layer, because the DSL deliberately forbids a boolean
+sub-expression as a comparison operand — `(a > b) = (c > d)`
+cannot be written.
+
+Standard SQL draws no such line: a boolean *is* a value, `AND` /
+`OR` / `NOT` and the comparison operators are simply operators at
+their own precedence tiers, and a parenthesised group is a whole
+expression regardless of whether it reads as "boolean" or
+"scalar". The SQL expression grammar therefore is a **single
+precedence ladder**, loosest tier to tightest:
+
+```
+expr            := or_expr
+or_expr         := and_expr      ( OR  and_expr )*
+and_expr        := not_expr      ( AND not_expr )*
+not_expr        := NOT not_expr  |  predicate
+predicate       := additive predicate_tail?
+predicate_tail  := cmp_op additive
+                 | [ NOT ] LIKE additive
+                 | [ NOT ] BETWEEN additive AND additive
+                 | [ NOT ] IN ( additive ( , additive )* )
+                 | IS [ NOT ] NULL
+cmp_op          := =  |  <>  |  !=  |  <  |  <=  |  >  |  >=
+additive        := multiplicative ( ( + | - | || ) multiplicative )*
+multiplicative  := unary ( ( * | / | % ) unary )*
+unary           := ( - | + ) unary  |  primary
+primary         := literal
+                 | ( or_expr )
+                 | case_expr
+                 | name_or_call
+name_or_call    := identifier  [ '(' call_args? ')' ]
+call_args       := '*'  |  [ DISTINCT ] or_expr ( , or_expr )*
+case_expr       := CASE [ or_expr ]
+                        ( WHEN or_expr THEN or_expr )+
+                        [ ELSE or_expr ]
+                   END
+literal         := number | string | TRUE | FALSE | NULL
+```
+
+Precedence, loosest first: `OR`, `AND`, `NOT`, the comparison /
+predicate tier, additive (`+ - ||`), multiplicative (`* / %`),
+unary sign, primary. This is standard SQL operator precedence
+restricted to the teaching-relevant operators.
+
+Notes on specific productions:
+
+- **`name_or_call` is factored, not a `Choice`.** A function call
+  (`upper(Name)`) and a column reference (`Name`) share an
+  identifier prefix. Splitting them into two `Choice` branches
+  would let the function-call branch *commit* on the identifier
+  and then fail at the missing `(`, discarding the column-ref
+  branch (the ADR-0026 "no `Optional`-first branch" hazard, in
+  reverse). Instead the identifier is matched once and the
+  `( call_args )` group is an `Optional` tail: present → a call,
+  absent → a column reference. The grammar need not decide which
+  — see §2 — it only validates that one of the two shapes holds.
+- **`call_args` handles `*` and `DISTINCT`.** `count(*)` is the
+  one place `*` is an argument; `count(distinct col)` the one
+  place `DISTINCT` leads an argument list. (The projection-level
+  `select *` is *not* an expression — it belongs to the `SELECT`
+  grammar, ADR-0030 / Phase 1, not here.) The grammar admits
+  function calls structurally; it does not know which names are
+  aggregates — that distinction is the engine's, and matters
+  only once `GROUP BY` lands (ADR-0030 Phase 2).
+- **`case_expr` covers both forms** — searched `CASE WHEN … END`
+  and simple `CASE <operand> WHEN … END`. Every sub-part is an
+  `or_expr` for uniformity (SQL allows any expression in each
+  slot); `END` closes it.
+- **`||` is string concatenation**, standard SQL, at the additive
+  tier. It lifts `simple-mode-limitations.md`'s "no string
+  concatenation".
+- **`%` is modulo.** It is not in ISO SQL (which spells it
+  `MOD(a, b)`), but it is near-universal across mainstream
+  engines and is what a learner expects. ADR-0030's "pedagogy
+  wins ties" admits it; `MOD` also remains reachable through the
+  generic `name_or_call` path.
+
+### 2. The fragment validates; it builds no AST
+
+ADR-0026's `WHERE` grammar carries an AST-fragment builder
+(`build_expr`) that folds the matched terminals into a recursive
+`Expr`, because its consumers — `update` / `delete` / `show data`
+— are typed `Command`s whose executor compiles that `Expr` to
+parameterised SQL.
+
+**The SQL expression grammar deliberately builds no AST.** This
+follows directly from ADR-0030 §4 and §6:
+
+- `WHERE` / `HAVING` / `SELECT` projections live inside a
+  `SELECT` or a DML statement, and ADR-0030 §4 executes those
+  "as the validated SQL itself … they change no schema, so
+  modelling them as a typed `Command` buys nothing." There is no
+  `Expr` to compile — the engine parses the SQL.
+- `CHECK` and `DEFAULT` live inside advanced-mode DDL. ADR-0030
+  §11 stores their expressions in `project.yaml` "as SQL the user
+  could re-enter" — text, not a structured tree. ADR-0030 §4 is
+  explicit that these expressions are "**not** lowered into the
+  DSL's deliberately-limited `Expr`."
+
+So no consumer of this grammar wants an `Expr`. The fragment's
+entire job is the other three walker outputs:
+
+1. **Accept or reject** — the input either is or is not a
+   well-formed in-subset SQL expression.
+2. **The flat `MatchedPath`** of matched terminals — which is
+   what drives syntax highlighting, completion, the expected-set,
+   and the hint panel (§5).
+3. **A source span.** A consumer that needs the expression *as
+   text* (the `SELECT` builder assembling `Command::Select`'s
+   SQL; a future `CHECK` builder) recovers it by slicing the
+   original source between the first and last matched terminal's
+   byte offsets. The terminals already carry `span` for
+   highlighting; nothing new is needed on the matched path.
+
+This is a real simplification over ADR-0026 — no `build_expr`
+analogue, no second structural pass, no expression AST type — and
+it is the correct shape for a grammar whose consumers run SQL
+rather than compile it. The grammar tier still owns validation,
+highlighting, completion, and the no-left-recursion guarantee;
+it simply has no tree to hand back.
+
+**Consequence for the `SELECT` builder (ADR-0030 / Phase 1).**
+A command `ast_builder` today receives only `&MatchedPath`. The
+`SELECT` builder needs the original source to populate
+`Command::Select`'s validated SQL text. The builder signature
+gains a `source: &str` parameter — a mechanical sweep across the
+~21 existing `CommandNode` builders (most ignore it), of the same
+category as ADR-0030's noted `match Command` sweep. It is called
+out here because it is a direct consequence of the no-AST
+decision; the change itself belongs to the Phase 1 SELECT work,
+governed by ADR-0030.
+
+### 3. Recursion, and the depth cap
+
+The grammar's recursion points are all **token-guarded** — each
+consumes at least one token before recursing, so the greedy
+top-down walker always makes progress:
+
+- `not_expr := NOT not_expr` — after `NOT`.
+- `primary := ( or_expr )` — after `(`.
+- `unary := ( - | + ) unary` — after a sign.
+- `call_args` operands — after the call's `(`.
+- `case_expr` sub-parts — after `CASE` / `WHEN` / `THEN` /
+  `ELSE`.
+- `IN ( … )` operands — after `IN (`.
+
+Every recursion is wired through `Node::Subgrammar(&NAMED)`
+referencing a named `static` tier, exactly as in `expr.rs`. The
+walker counts active `Subgrammar` frames in
+`WalkContext::subgrammar_depth`; this grammar reuses ADR-0026's
+`MAX_SUBGRAMMAR_DEPTH = 64` cap and its friendly
+"expression nested too deeply" error — no new walker capability
+is required. The ladder descends a few `Subgrammar` frames per
+nesting level, so the effective hand-written nesting limit is
+comfortably past anything a learner types; the cap is purely a
+stack-overflow guard.
+
+### 4. A separate fragment, parallel to the DSL grammar
+
+The SQL expression grammar is authored in a new file,
+`src/dsl/grammar/sql_expr.rs`, parallel to `expr.rs` (which keeps
+the DSL `WHERE` grammar). They are deliberately *not* merged:
+
+- **Different operand sets.** The DSL operand is a column or a
+  literal; the SQL operand is a full scalar expression.
+- **Different output.** `expr.rs` builds an `Expr`;
+  `sql_expr.rs` builds nothing (§2).
+- **Mode isolation.** Simple mode must never gain arithmetic or
+  functions — the limits in `simple-mode-limitations.md` are a
+  teaching feature. A shared fragment risks leaking the SQL
+  surface into the DSL grammar.
+- **Regression containment.** `expr.rs` is exercised by a large
+  share of the 1240-test suite. A parallel file changes none of
+  it.
+
+The predicate-tail shapes (`cmp_op` / `LIKE` / `BETWEEN` / `IN` /
+`IS NULL`) look structurally identical between the two grammars,
+but each branch's operand sub-node differs (column-or-literal vs
+`additive`), so the `static` nodes cannot literally be shared.
+The *design* is shared — `sql_expr.rs` follows `expr.rs`'s
+factoring (operand prefix matched once, infix `NOT` as an
+explicit branch, no `Optional`-first branch) — and that is the
+reuse that matters.
+
+### 5. Ambient assistance comes for free
+
+Because the fragment is grammar in the unified tree, the walker
+gives it — with no expression-specific assistance code — the same
+ambient assistance every DSL command gets (ADR-0030 §8,
+ADR-0022):
+
+- **Syntax highlighting** of SQL keywords, identifiers, literals,
+  and operators, from the per-byte highlight classes the walk
+  records.
+- **Tab completion** of SQL keywords (`and`, `or`, `like`,
+  `between`, `case`, `when`, …) and of column names — the
+  `name_or_call` identifier slot uses `IdentSource::Columns`, so
+  it completes against the statement's table(s) from the same
+  `SchemaCache` the DSL uses. Function names are not completed
+  (there is no allowlist — ADR-0030 §7 OOS-3); a typed function
+  name simply is not a candidate.
+- **Hint-panel prose** at each grammar slot.
+- **The `[ERR]` / `[WRN]` validity indicator** (ADR-0027).
+- **Per-command parse-error usage** (ADR-0021).
+
+The `name_or_call` identifier slot resolves to `Columns` because,
+at the moment the identifier is typed, the common case is a
+column reference and column completion is the helpful default; a
+function call is recognised a token later when `(` follows. The
+grammar does not need to decide between the two (§2), so the slot
+can optimise for the common completion.
+
+### 6. Errors and the unsupported surface
+
+A construct outside this grammar — a window function's `OVER`
+clause, a `CAST` with `::` syntax, an array literal — is an
+ordinary walker parse error, carrying the expected-set and
+routed through the friendly-error layer with engine-neutral
+wording (ADR-0030 §9, ADR-0019). There is no separate
+"valid SQL but unsupported" classifier — ADR-0030 §1 dropped the
+batch parser that would be needed for one.
+
+Expression-level engine neutrality is **best-effort**, exactly as
+ADR-0030 §7 states: the grammar enforces the *structural* subset
+(operators, `CASE`, call syntax), but because there is no
+function allowlist, an engine-specific function the grammar
+admits and the engine then rejects surfaces an engine-neutral
+*execution* error rather than being caught at parse time. This is
+the accepted honest limitation; a function allowlist remains
+ADR-0030 §13 OOS-3.
+
+### 7. Out of scope
+
+- **OOS-1. Subquery expressions.** A `( SELECT … )` as a
+  `primary`, `<op> ( SELECT … )`, `IN ( SELECT … )`, and
+  `EXISTS ( SELECT … )` are part of the eventual surface
+  (ADR-0030 §3) but cannot be realised until the `SELECT`
+  grammar itself exists and is recursive — that is ADR-0030
+  Phase 2 ("`SELECT` — full"). This ADR's grammar is authored so
+  that adding a subquery branch to `primary` (and an
+  `IN ( subquery )` / `EXISTS` form) is an additive change: a new
+  `Choice` branch guarded by `(`/`EXISTS`, recursing through
+  `Subgrammar` into the `SELECT` fragment. No restructuring is
+  foreseen.
+- **OOS-2. Qualified column references** (`table.column`,
+  `alias.column`). A single-table `SELECT` (ADR-0030 Phase 1)
+  never needs them; they become meaningful with `JOIN`s
+  (Phase 2). `name_or_call` takes an unqualified identifier for
+  now; a `[ '.' identifier ]` tail is an additive extension.
+- **OOS-3. Quoted identifiers** (`"column name"`). The DSL has no
+  quoted-identifier syntax; introducing one is a cross-cutting
+  lexer change, tracked separately.
+- **OOS-4. A function allowlist** — ADR-0030 §13 OOS-3,
+  restated: function calls are admitted generically.
+- **OOS-5. An expression AST.** Explicitly not built (§2). If a
+  future consumer genuinely needs structured expression data
+  (none is foreseen — DDL `CHECK`/`DEFAULT` store text), that is
+  a new decision, not a deferral.
+
+## Consequences
+
+- A new grammar file, `src/dsl/grammar/sql_expr.rs`, exporting a
+  single `pub static SQL_EXPRESSION: Node` (a
+  `Subgrammar(&SQL_OR_EXPR)`) that any SQL `CommandNode` drops
+  into its `Seq` as one node — the same drop-in shape as
+  `expr::EXPRESSION`.
+- **No new walker capability.** `Subgrammar`, the depth counter,
+  the cap, and the friendly depth error are all reused from
+  ADR-0026 unchanged.
+- **No expression AST, no fragment builder** — a deliberate
+  simplification over ADR-0026 (§2).
+- `expr.rs` and the simple-mode `WHERE` surface are **untouched**;
+  the 1240-test baseline is insulated by construction (§4).
+- The command `ast_builder` signature gains a `source: &str`
+  parameter (§2) — a ~21-site mechanical sweep, executed as part
+  of the Phase 1 `SELECT` work (ADR-0030), not here.
+- Subquery expressions and qualified column references are
+  authored later as additive `primary` branches (§7) — the
+  grammar is shaped to receive them.
+- The fragment is the shared dependency of every advanced-mode
+  expression slot — `WHERE`, `HAVING`, `SELECT` projections,
+  `CHECK`, `DEFAULT` — defined once.
+
+## Implementation notes
+
+A build order, each step guarded by the test suite. Steps 1–5 are
+ADR-0030 Phase 1; the fragment is consumed first by the
+single-table `SELECT`'s `WHERE` and projection slots.
+
+1. **The grammar fragment** — `sql_expr.rs` with the stratified
+   tiers of §1 as named `static` `Node`s, recursion via
+   `Subgrammar`. No builder. `pub static SQL_EXPRESSION`.
+2. **Unit tests** walking representative inputs against the
+   fragment directly (the `expr.rs` test pattern): every operator
+   and precedence pair, `CASE` both forms, function calls
+   including `count(*)` and `count(distinct …)`, the full
+   predicate set, parenthesised regrouping, the depth cap, and
+   the keyword-case-insensitivity check.
+3. **Wire it into the Phase 1 `SELECT` grammar** — the `WHERE`
+   slot and the projection items reference `SQL_EXPRESSION`
+   (ADR-0030 Phase 1).
+4. **Highlighting / completion / hint** spot-checks — confirm the
+   §5 assistance works through a SQL expression with no
+   expression-specific code, via the typing-surface matrix.
+5. **Engine-neutral error** spot-checks for out-of-subset
+   constructs (§6).
+
+Later phases extend the same fragment:
+
+- **ADR-0030 Phase 2** adds the subquery `primary` branches and
+  qualified column references (OOS-1, OOS-2) once the recursive
+  `SELECT` grammar exists, and exercises the fragment from
+  `HAVING`.
+- **ADR-0030 Phase 4** consumes the fragment from advanced-mode
+  DDL `CHECK` and `DEFAULT`.
+
+## See also
+
+- ADR-0019 — the friendly-error layer SQL parse and execution
+  errors route through (§6).
+- ADR-0021 — per-command parse-error usage, free for SQL (§5).
+- ADR-0022 — ambient typing assistance; §5 is its reach into the
+  SQL expression.
+- ADR-0023 / ADR-0024 — the unified grammar tree this fragment
+  is authored into.
+- ADR-0026 — the DSL `WHERE` expression grammar this is the
+  superset of: the `Subgrammar` node, the stratified-grammar
+  technique, the depth cap, and the `predicate_tail` factoring
+  are all inherited from it.
+- ADR-0027 — the validity indicator, free for SQL (§5).
+- ADR-0030 — advanced mode's SQL surface; §3 commissions this
+  ADR, §4/§6 are the source of the no-AST decision (§2), §7/§13
+  set the engine-neutrality posture and the no-allowlist rule.
+- `docs/simple-mode-limitations.md` — the DSL limits this grammar
+  lifts for advanced mode (§1, §4).
@@ -36,3 +36,4 @@ This directory contains the project's ADRs, recorded per
 - [ADR-0028 — Query plans (`EXPLAIN QUERY PLAN`)](0028-query-plans.md) — **Accepted**, an `explain` prefix command over `show data` / `update` / `delete`; an annotated, span-styled plan tree; introduces the `OutputLine` styled-runs mechanism (ADR-0016's deferred per-span styling) (`QA1` / `QA2`)
 - [ADR-0029 — Column constraints (NOT NULL / UNIQUE / CHECK / DEFAULT)](0029-column-constraints.md) — **Accepted**, the four column-level constraints declared in the column-spec suffix (`create table` / `add column`) and modified on existing columns via `add constraint …` / `drop constraint …`; a pre-flight dry-run guards populated columns; `CHECK` reuses the ADR-0026 expression grammar via `Subgrammar` (`C3`)
 - [ADR-0030 — Advanced mode: the standard-SQL surface](0030-advanced-mode-sql-surface.md) — **Accepted**, SQL added as grammar *within the unified grammar tree* (ADR-0024), not a separate batch parser — so SQL gets the same completion / highlighting / hints / parse-errors as the DSL; mode gates the SQL forms; DDL routes through the typed `Command` executor (metadata + type vocabulary preserved), DML and `SELECT` execute as validated SQL; engine-neutral posture, the DSL→SQL teaching echo; supersedes ADR-0001's `sqlparser-rs` reservation; phased plan (`Q1` / `Q2` / `Q4`)
+- [ADR-0031 — The SQL expression grammar](0031-sql-expression-grammar.md) — **Accepted**, the stratified SQL expression grammar fragment commissioned by ADR-0030 §3: a single precedence ladder (`OR`/`AND`/`NOT`, the comparison/`LIKE`/`IN`/`BETWEEN`/`IS NULL` predicate set, arithmetic incl. `||`, function calls, `CASE`) — the superset of ADR-0026's DSL `WHERE` grammar, authored as a parallel fragment so simple mode is untouched; pure validation, builds **no** AST (consumers run/store SQL as text per ADR-0030 §4/§6); reuses ADR-0026's `Subgrammar` recursion + depth cap unchanged; subquery expressions and qualified column refs deferred to ADR-0030 Phase 2