Files
rdbms-playground/docs/adr/0031-sql-expression-grammar.md
T
claude@clouddev1 6d8c9eea36 feat: curated SQL function list — Tab completion (#15) + typing-time typo hint (#16)
Add src/dsl/sql_functions.rs (KNOWN_SQL_FUNCTIONS) as the shared source
of truth at sql_expr_ident slots:

- #15: offer the functions as Tab candidates under a new
  CandidateKind::Function + ninth Theme colour tok_function (blue,
  distinct from keyword/identifier/type).
- #16: restore the column-typo flag the #6 fix had dropped wholesale —
  invalid_ident_at_cursor now bails only when the partial prefix-matches
  a known function, else falls through to the schema-column check.

A column named like a function (e.g. `count`) is deduped (column wins).
`cast` is excluded — CAST(x AS type) is not a plain-call shape.
The no-validation-allowlist posture stands: the list drives completion +
the typo hint only, never parse-time acceptance.

Docs: ADR-0022 Amendment 6, ADR-0031 status note, README index,
requirements I3/I4 + refreshed test baseline.
2026-05-31 11:49:10 +00:00

20 KiB
Raw Permalink Blame History

ADR-0031: The SQL expression grammar

Status

Accepted

Context

ADR-0030 made advanced mode a body of SQL grammar inside the unified grammar tree (ADR-0023/0024) rather than a separate batch parser. It deferred two large grammar slices to their own focused ADRs (ADR-0030 §3): the full SELECT grammar and the SQL expression grammar. This ADR fixes the second.

The SQL expression grammar is the fragment that fills every expression slot in advanced-mode SQL — ADR-0030 §3 names them: WHERE, HAVING, CHECK, SELECT projections, and DEFAULT. ADR-0030 §3 describes it as "the superset of ADR-0026's WHERE grammar" — adding arithmetic, function calls, CASE, and (eventually) subquery expressions on top of the comparison / LIKE / IN / BETWEEN / IS NULL predicate set that ADR-0026 already authored for the DSL.

It is the first concrete piece of ADR-0030's phased plan: ADR-0030 Phase 1 ("Foundations + first SELECT") opens with "Author the core SQL expression grammar — the ADR-0026 superset — as its own ADR." This is that ADR.

What ADR-0026 already established

ADR-0026 authored a recursive WHERE expression for the DSL. The machinery this ADR builds on is all in place:

  • Node::Subgrammar(&'static Node) — a reference-following node that lets a named static grammar fragment appear inside its own subtree, so a recursive grammar can be expressed even though Seq/Choice embed children by value and cannot close a cycle.
  • A stratified grammar — one named static Node per precedence tier — which removes left recursion (every recursion is guarded by a token) and encodes precedence in the layering.
  • WalkContext::subgrammar_depth and MAX_SUBGRAMMAR_DEPTH = 64 — a stack-overflow guard that turns pathologically nested input into a friendly error.
  • The factored predicate_tail — the shared operand prefix matched once; the infix NOT factored as an explicit NOT negatable branch; no Choice branch starting with an Optional (an Optional-first Seq "commits" and discards sibling branches' expected sets).

This ADR reuses every one of those. The new grammar is larger, but it is the same kind of grammar, walked by the same walker.

Why this is not just "extend expr.rs"

The DSL's WHERE grammar (src/dsl/grammar/expr.rs) is bound by ADR-0026's deliberate teaching limits, recorded in docs/simple-mode-limitations.md: operands are a column or a literal — no arithmetic, no string concatenation, no scalar functions, no subqueries. Those limits are a feature of simple mode, not an accident; the DSL WHERE grammar must keep them.

Advanced mode is the surface that lifts them (ADR-0030 §4). So the SQL expression grammar cannot be the DSL grammar with a few nodes added — it has a different operand set (a full scalar expression, not column-or-literal) and a different relationship to its consumers (see Decision §2). It is a parallel fragment. Keeping it parallel also keeps simple mode's 1240-test surface untouched: nothing in expr.rs changes.

Decision

1. One unified expression ladder

ADR-0026's DSL grammar stratifies into a boolean layer (or/and/not/bool_primary) sitting above a predicate layer, because the DSL deliberately forbids a boolean sub-expression as a comparison operand — (a > b) = (c > d) cannot be written.

Standard SQL draws no such line: a boolean is a value, AND / OR / NOT and the comparison operators are simply operators at their own precedence tiers, and a parenthesised group is a whole expression regardless of whether it reads as "boolean" or "scalar". The SQL expression grammar therefore is a single precedence ladder, loosest tier to tightest:

expr            := or_expr
or_expr         := and_expr      ( OR  and_expr )*
and_expr        := not_expr      ( AND not_expr )*
not_expr        := NOT not_expr  |  predicate
predicate       := additive predicate_tail?
predicate_tail  := cmp_op additive
                 | [ NOT ] LIKE additive
                 | [ NOT ] BETWEEN additive AND additive
                 | [ NOT ] IN ( additive ( , additive )* )
                 | IS [ NOT ] NULL
cmp_op          := =  |  <>  |  !=  |  <  |  <=  |  >  |  >=
additive        := multiplicative ( ( + | - | || ) multiplicative )*
multiplicative  := unary ( ( * | / | % ) unary )*
unary           := ( - | + ) unary  |  primary
primary         := literal
                 | ( or_expr )
                 | case_expr
                 | name_or_call
name_or_call    := identifier  [ '(' call_args? ')' ]
call_args       := '*'  |  [ DISTINCT ] or_expr ( , or_expr )*
case_expr       := CASE [ or_expr ]
                        ( WHEN or_expr THEN or_expr )+
                        [ ELSE or_expr ]
                   END
literal         := number | string | TRUE | FALSE | NULL

Precedence, loosest first: OR, AND, NOT, the comparison / predicate tier, additive (+ - ||), multiplicative (* / %), unary sign, primary. This is standard SQL operator precedence restricted to the teaching-relevant operators.

Notes on specific productions:

  • name_or_call is factored, not a Choice. A function call (upper(Name)) and a column reference (Name) share an identifier prefix. Splitting them into two Choice branches would let the function-call branch commit on the identifier and then fail at the missing (, discarding the column-ref branch (the ADR-0026 "no Optional-first branch" hazard, in reverse). Instead the identifier is matched once and the ( call_args ) group is an Optional tail: present → a call, absent → a column reference. The grammar need not decide which — see §2 — it only validates that one of the two shapes holds.
  • call_args handles * and DISTINCT. count(*) is the one place * is an argument; count(distinct col) the one place DISTINCT leads an argument list. (The projection-level select * is not an expression — it belongs to the SELECT grammar, ADR-0030 / Phase 1, not here.) The grammar admits function calls structurally; it does not know which names are aggregates — that distinction is the engine's, and matters only once GROUP BY lands (ADR-0030 Phase 2).
  • case_expr covers both forms — searched CASE WHEN … END and simple CASE <operand> WHEN … END. Every sub-part is an or_expr for uniformity (SQL allows any expression in each slot); END closes it.
  • || is string concatenation, standard SQL, at the additive tier. It lifts simple-mode-limitations.md's "no string concatenation".
  • % is modulo. It is not in ISO SQL (which spells it MOD(a, b)), but it is near-universal across mainstream engines and is what a learner expects. ADR-0030's "pedagogy wins ties" admits it; MOD also remains reachable through the generic name_or_call path.

2. The fragment validates; it builds no AST

ADR-0026's WHERE grammar carries an AST-fragment builder (build_expr) that folds the matched terminals into a recursive Expr, because its consumers — update / delete / show data — are typed Commands whose executor compiles that Expr to parameterised SQL.

The SQL expression grammar deliberately builds no AST. This follows directly from ADR-0030 §4 and §6:

  • WHERE / HAVING / SELECT projections live inside a SELECT or a DML statement, and ADR-0030 §4 executes those "as the validated SQL itself … they change no schema, so modelling them as a typed Command buys nothing." There is no Expr to compile — the engine parses the SQL.
  • CHECK and DEFAULT live inside advanced-mode DDL. ADR-0030 §11 stores their expressions in project.yaml "as SQL the user could re-enter" — text, not a structured tree. ADR-0030 §4 is explicit that these expressions are "not lowered into the DSL's deliberately-limited Expr."

So no consumer of this grammar wants an Expr. The fragment's entire job is the other three walker outputs:

  1. Accept or reject — the input either is or is not a well-formed in-subset SQL expression.
  2. The flat MatchedPath of matched terminals — which is what drives syntax highlighting, completion, the expected-set, and the hint panel (§5).
  3. A source span. A consumer that needs the expression as text (the SELECT builder assembling Command::Select's SQL; a future CHECK builder) recovers it by slicing the original source between the first and last matched terminal's byte offsets. The terminals already carry span for highlighting; nothing new is needed on the matched path.

This is a real simplification over ADR-0026 — no build_expr analogue, no second structural pass, no expression AST type — and it is the correct shape for a grammar whose consumers run SQL rather than compile it. The grammar tier still owns validation, highlighting, completion, and the no-left-recursion guarantee; it simply has no tree to hand back.

Consequence for the SELECT builder (ADR-0030 / Phase 1). A command ast_builder today receives only &MatchedPath. The SELECT builder needs the original source to populate Command::Select's validated SQL text. The builder signature gains a source: &str parameter — a mechanical sweep across the ~21 existing CommandNode builders (most ignore it), of the same category as ADR-0030's noted match Command sweep. It is called out here because it is a direct consequence of the no-AST decision; the change itself belongs to the Phase 1 SELECT work, governed by ADR-0030.

3. Recursion, and the depth cap

The grammar's recursion points are all token-guarded — each consumes at least one token before recursing, so the greedy top-down walker always makes progress:

  • not_expr := NOT not_expr — after NOT.
  • primary := ( or_expr ) — after (.
  • unary := ( - | + ) unary — after a sign.
  • call_args operands — after the call's (.
  • case_expr sub-parts — after CASE / WHEN / THEN / ELSE.
  • IN ( … ) operands — after IN (.

Every recursion is wired through Node::Subgrammar(&NAMED) referencing a named static tier, exactly as in expr.rs. The walker counts active Subgrammar frames in WalkContext::subgrammar_depth; this grammar reuses ADR-0026's MAX_SUBGRAMMAR_DEPTH = 64 cap and its friendly "expression nested too deeply" error — no new walker capability is required. The ladder descends a few Subgrammar frames per nesting level, so the effective hand-written nesting limit is comfortably past anything a learner types; the cap is purely a stack-overflow guard.

4. A separate fragment, parallel to the DSL grammar

The SQL expression grammar is authored in a new file, src/dsl/grammar/sql_expr.rs, parallel to expr.rs (which keeps the DSL WHERE grammar). They are deliberately not merged:

  • Different operand sets. The DSL operand is a column or a literal; the SQL operand is a full scalar expression.
  • Different output. expr.rs builds an Expr; sql_expr.rs builds nothing (§2).
  • Mode isolation. Simple mode must never gain arithmetic or functions — the limits in simple-mode-limitations.md are a teaching feature. A shared fragment risks leaking the SQL surface into the DSL grammar.
  • Regression containment. expr.rs is exercised by a large share of the 1240-test suite. A parallel file changes none of it.

The predicate-tail shapes (cmp_op / LIKE / BETWEEN / IN / IS NULL) look structurally identical between the two grammars, but each branch's operand sub-node differs (column-or-literal vs additive), so the static nodes cannot literally be shared. The design is shared — sql_expr.rs follows expr.rs's factoring (operand prefix matched once, infix NOT as an explicit branch, no Optional-first branch) — and that is the reuse that matters.

5. Ambient assistance comes for free

Because the fragment is grammar in the unified tree, the walker gives it — with no expression-specific assistance code — the same ambient assistance every DSL command gets (ADR-0030 §8, ADR-0022):

  • Syntax highlighting of SQL keywords, identifiers, literals, and operators, from the per-byte highlight classes the walk records.
  • Tab completion of SQL keywords (and, or, like, between, case, when, …) and of column names — the name_or_call identifier slot uses IdentSource::Columns, so it completes against the statement's table(s) from the same SchemaCache the DSL uses. Function names are not completed (there is no allowlist — ADR-0030 §7 OOS-3); a typed function name simply is not a candidate.
  • Hint-panel prose at each grammar slot.
  • The [ERR] / [WRN] validity indicator (ADR-0027).
  • Per-command parse-error usage (ADR-0021).

The name_or_call identifier slot resolves to Columns because, at the moment the identifier is typed, the common case is a column reference and column completion is the helpful default; a function call is recognised a token later when ( follows. The grammar does not need to decide between the two (§2), so the slot can optimise for the common completion.

6. Errors and the unsupported surface

A construct outside this grammar — a window function's OVER clause, a CAST with :: syntax, an array literal — is an ordinary walker parse error, carrying the expected-set and routed through the friendly-error layer with engine-neutral wording (ADR-0030 §9, ADR-0019). There is no separate "valid SQL but unsupported" classifier — ADR-0030 §1 dropped the batch parser that would be needed for one.

Expression-level engine neutrality is best-effort, exactly as ADR-0030 §7 states: the grammar enforces the structural subset (operators, CASE, call syntax), but because there is no function allowlist, an engine-specific function the grammar admits and the engine then rejects surfaces an engine-neutral execution error rather than being caught at parse time. This is the accepted honest limitation; a function allowlist remains ADR-0030 §13 OOS-3.

7. Out of scope

  • OOS-1. Subquery expressions. A ( SELECT … ) as a primary, <op> ( SELECT … ), IN ( SELECT … ), and EXISTS ( SELECT … ) are part of the eventual surface (ADR-0030 §3) but cannot be realised until the SELECT grammar itself exists and is recursive — that is ADR-0030 Phase 2 ("SELECT — full"). This ADR's grammar is authored so that adding a subquery branch to primary (and an IN ( subquery ) / EXISTS form) is an additive change: a new Choice branch guarded by (/EXISTS, recursing through Subgrammar into the SELECT fragment. No restructuring is foreseen.
  • OOS-2. Qualified column references (table.column, alias.column). A single-table SELECT (ADR-0030 Phase 1) never needs them; they become meaningful with JOINs (Phase 2). name_or_call takes an unqualified identifier for now; a [ '.' identifier ] tail is an additive extension.
  • OOS-3. Quoted identifiers ("column name"). The DSL has no quoted-identifier syntax; introducing one is a cross-cutting lexer change, tracked separately.
  • OOS-4. A function allowlist — ADR-0030 §13 OOS-3, restated: function calls are admitted generically.
  • OOS-5. An expression AST. Explicitly not built (§2). If a future consumer genuinely needs structured expression data (none is foreseen — DDL CHECK/DEFAULT store text), that is a new decision, not a deferral.

Consequences

  • A new grammar file, src/dsl/grammar/sql_expr.rs, exporting a single pub static SQL_EXPRESSION: Node (a Subgrammar(&SQL_OR_EXPR)) that any SQL CommandNode drops into its Seq as one node — the same drop-in shape as expr::EXPRESSION.
  • No new walker capability. Subgrammar, the depth counter, the cap, and the friendly depth error are all reused from ADR-0026 unchanged.
  • No expression AST, no fragment builder — a deliberate simplification over ADR-0026 (§2).
  • expr.rs and the simple-mode WHERE surface are untouched; the 1240-test baseline is insulated by construction (§4).
  • The command ast_builder signature gains a source: &str parameter (§2) — a ~21-site mechanical sweep, executed as part of the Phase 1 SELECT work (ADR-0030), not here.
  • Subquery expressions and qualified column references are authored later as additive primary branches (§7) — the grammar is shaped to receive them.
  • The fragment is the shared dependency of every advanced-mode expression slot — WHERE, HAVING, SELECT projections, CHECK, DEFAULT — defined once.

Implementation notes

A build order, each step guarded by the test suite. Steps 15 are ADR-0030 Phase 1; the fragment is consumed first by the single-table SELECT's WHERE and projection slots.

  1. The grammar fragmentsql_expr.rs with the stratified tiers of §1 as named static Nodes, recursion via Subgrammar. No builder. pub static SQL_EXPRESSION.
  2. Unit tests walking representative inputs against the fragment directly (the expr.rs test pattern): every operator and precedence pair, CASE both forms, function calls including count(*) and count(distinct …), the full predicate set, parenthesised regrouping, the depth cap, and the keyword-case-insensitivity check.
  3. Wire it into the Phase 1 SELECT grammar — the WHERE slot and the projection items reference SQL_EXPRESSION (ADR-0030 Phase 1).
  4. Highlighting / completion / hint spot-checks — confirm the §5 assistance works through a SQL expression with no expression-specific code, via the typing-surface matrix.
  5. Engine-neutral error spot-checks for out-of-subset constructs (§6).

Later phases extend the same fragment:

  • ADR-0030 Phase 2 adds the subquery primary branches and qualified column references (OOS-1, OOS-2) once the recursive SELECT grammar exists, and exercises the fragment from HAVING.
  • ADR-0030 Phase 4 consumes the fragment from advanced-mode DDL CHECK and DEFAULT.

See also

  • ADR-0019 — the friendly-error layer SQL parse and execution errors route through (§6).
  • ADR-0021 — per-command parse-error usage, free for SQL (§5).
  • ADR-0022 — ambient typing assistance; §5 is its reach into the SQL expression.
  • ADR-0023 / ADR-0024 — the unified grammar tree this fragment is authored into.
  • ADR-0026 — the DSL WHERE expression grammar this is the superset of: the Subgrammar node, the stratified-grammar technique, the depth cap, and the predicate_tail factoring are all inherited from it.
  • ADR-0027 — the validity indicator, free for SQL (§5).
  • ADR-0030 — advanced mode's SQL surface; §3 commissions this ADR, §4/§6 are the source of the no-AST decision (§2), §7/§13 set the engine-neutrality posture and the no-allowlist rule.
  • docs/simple-mode-limitations.md — the DSL limits this grammar lifts for advanced mode (§1, §4).

Status note — known-function list layered on the slot (2026-05-30)

The sql_expr_ident slot is IdentSource::Columns and, per §1 / §5, does not itself know which identifiers are function names — it optimises for the common case (a column reference) and admits the function-call shape structurally; §5 explicitly noted "function names are not completed … a typed function name simply is not a candidate". ADR-0022 Amendment 6 layers a curated known-function list (src/dsl/sql_functions.rs) on top of this slot, consumed two ways: as Tab-completion candidates so a learner can discover sum / upper / … (issue #15 — softening §5's "not completed" line to "completed from a curated pedagogical list, not an allowlist for validation"), and as the allow-list that lets the typing-time column-typo hint stay strict at this slot — flag a partial as "no such column" only when it matches neither a schema column nor a known function name (issue #16). The grammar here is unchanged, and §6/§7's no-validation-allowlist posture stands: the list drives completion + the typo hint, not parse-time acceptance (an unknown function still parses and surfaces an engine-neutral execution error). The list sits in the completion / hint layer above the grammar.