Files

T

claude@clouddev1 a50c6cdf70 WHERE expressions: matrix cells + predicate_tail grammar fix (ADR-0026 step 6)

Adds tests/typing_surface/where_expression.rs — 9 matrix
cells for the complex WHERE / show-data limit typing surface:
operator candidates after an operand, AND / OR after a
predicate, NOT, BETWEEN / IN bounds, and `show data`
where / limit.

Writing the cells surfaced a grammar bug. `predicate_tail`'s
`[NOT] negatable` branch started with `Optional(not)`, and an
Optional-first `Seq` always "commits" — so on an incomplete
input the walker's `Choice` returned that branch's
`Incomplete` early and discarded every sibling branch's
expected set, dropping `is` and the comparison operators from
completion after a column. Fixed by splitting it into
explicit `NOT negatable` and bare `negatable` branches — no
`predicate_tail` branch starts with an `Optional` now. The
matched terminal sequence is unchanged, so `build_expr` is
untouched.

Docs: ADR-0026 gains an "As-built notes" section recording
the option-1 builder realization, its two deviations from the
§3 sketch, and the deferral of §7 diagnostic flagging to
ADR-0027. requirements.md C5a -> [x] (steps 1-4) with the
test baseline refreshed to 1079; CLAUDE.md's deferred list
reconciled (C5a implemented; the QA1/QA2 note now points at
ADR-0028).

2026-05-18 23:19:53 +00:00

20 KiB

Raw Blame History

ADR-0026: Complex WHERE expressions

Status

Accepted

Context

The requirements checklist commits, in C5a, to complex WHERE expressions — AND / OR, comparison operators, LIKE — for the update, delete, and show data row filters. It is described there as the bridge from DSL fluency toward real SQL. Today the DSL is well short of that:

The only filter the DSL parses is a single where <col> = <val> equality. There is no AND / OR, no operator other than =, no LIKE, no IS NULL, no parentheses.
update and delete carry that filter as RowFilter::Where { column, value }. show data <T> carries no filter at all — it always selects every row.
The Value AST is purely syntactic (Number, Text, Bool, Null); per-column type handling happens at bind time in db.rs.

Three things make this the right moment, and shape the decision:

QA1 (EXPLAIN QUERY PLAN) needs a filtered query. An unfiltered SELECT * FROM t always plans as a full scan; an index can never appear. QA1's pedagogical payoff depends on a WHERE whose plan flips between a scan and an index search.
CHECK constraints (C3) need an expression grammar. A CHECK (Age >= 0 AND Age < 150) is the same expression problem. A throwaway mini-grammar for CHECK plus a second one for WHERE would be waste; this ADR builds the grammar once.
The grammar architecture must grow to host it. The unified walker grammar (ADR-0023 / ADR-0024) is a non-recursive trie of &'static Nodes, and its parse output (MatchedPath) is a flat list of matched terminals. A WHERE expression is recursive and its shape is data-dependent — neither fits as the grammar stands.

The architectural problem, precisely

A boolean expression — a = 1 AND (b > 2 OR c LIKE 'x%') — is recursive (a parenthesised group is itself an expression) and carries operator precedence. Two facts about the current walker:

The Node tree is acyclic. Every combinator references its children through &'static [Node] / &'static Node, and the registry lives in consts. A const cannot refer to itself, so no node can close a cycle. Optional and Repeated already hold a &'static Node reference — recursion through a reference is expressible if the fragment is a named static — but Seq and Choice embed their children by value in a slice, and a cyclic value has no finite representation.
MatchedPath is flat. It is a Vec<MatchedItem> of matched terminals in source order; the combinators shape the order but record no grouping. For every command today that is enough: each command's shape is a fixed template, so the AST builder reads terminals by position or by role. A recursive expression has no fixed template — where a = 1 and where (a=1 or b=2) and c=3 have different shapes — so a flat terminal list cannot be rebuilt into the expression tree without parsing it a second time.

Left recursion is a third fact, and it is a property of parsing technique rather than of this codebase: a top-down walker cannot consume a rule whose leftmost symbol is the rule itself (expr := expr OP expr recurses without consuming input). The standard remedy — a stratified, left-factored grammar — is adopted below.

Decision

1. The expression grammar

The WHERE expression is a stratified grammar — one layer per precedence tier. Stratification removes left recursion (every recursion is guarded by a token) and encodes operator precedence in the layering, so there is no separate precedence-resolution step.

or_expr      := and_expr ( OR and_expr )*
and_expr     := not_expr ( AND not_expr )*
not_expr     := NOT not_expr | bool_primary
bool_primary := ( or_expr ) | predicate
predicate    := operand cmp_op operand
              | operand [ NOT ] LIKE operand
              | operand [ NOT ] BETWEEN operand AND operand
              | operand [ NOT ] IN ( operand [ , operand ]* )
              | operand IS [ NOT ] NULL
operand      := literal | column_ref
cmp_op       := =  |  !=  |  <>  |  <  |  <=  |  >  |  >=

Operator set: the six comparisons, with both != and <> accepted (<> is standard SQL, != the common variant — the engine accepts both); AND / OR / NOT; parentheses; LIKE with % / _ wildcards; IS NULL / IS NOT NULL; IN; BETWEEN. LIKE, IN, and BETWEEN take an optional infix NOT, mirroring IS NOT NULL.
Operands are a column reference or a literal — not a nested expression. Parentheses group boolean sub-expressions (bool_primary), not comparison operands. A bare column reference is not a boolean expression: a predicate always has an operator (write Active = true, not Active).
The only recursion is ( or_expr ) and NOT not_expr; each consumes a token (( or NOT) before recursing, so the greedy top-down walker always makes progress.
Nesting depth is capped at 64. Hand-written WHERE clauses do not approach this; the cap exists only so pathological input (((((…))))) yields a friendly "expression nested too deeply (limit 64)" error rather than a stack overflow.

The grammar is deliberately a subset of standard SQL's WHERE syntax, so a learner's knowledge transfers directly when advanced-mode SQL (Q1) lands.

2. Grammar architecture: a reference-following node

Seq / Choice embed children by value and cannot hold a cyclic node. One new Node variant closes the gap:

/// Walks the referenced node once, mandatory. Because the
/// reference is a `&'static Node`, a named `static`
/// fragment may appear inside its own subtree — the
/// mechanism that lets the expression grammar recurse.
/// (ADR-0023 sketched this as `SubgrammarRef`.)
Subgrammar(&'static Node),

Subgrammar is the static counterpart of the existing DynamicSubgrammar (a walk-time factory). The expression grammar's tiers are declared as named static items; bool_primary's ( or_expr ) branch reaches or_expr through Subgrammar(&OR_EXPR), and not_expr reaches itself the same way. The walker gains one match arm — walk the referenced node once — plus the depth counter for the §1 cap.

The expression grammar is one fragment, referenced by update, delete, and show data alike — defined once.

3. The expression result — built selectively

MatchedPath — the walker's flat list of matched terminals — is left unchanged. The recursive structure lives only where it is needed: inside the expression.

The expression grammar fragment carries its own AST-fragment builder. As the walker recurses through the stratified tiers, that builder runs — the walker's recursion is the precedence-correct tree, so the builder assembles a nested Expr directly, with no second parse and no separate precedence pass. The finished Expr is carried as a single item in the otherwise-unchanged flat MatchedPath (MatchedKind gains one variant to hold a built expression).

Every existing command builder is therefore genuinely untouched — the flat path it reads is exactly as before. update / delete / show data take that one expression item and read its Expr.

This is the "selectively if necessary" option: the parser gains structured output exactly where the grammar is recursive, and nowhere else. A system-wide hierarchical MatchedPath was considered and rejected — it would record group structure for every command while only the expression consumed it, leaving the non-expression grouping computed but unread, and so untested. The general "a grammar fragment may carry a builder" mechanism introduced here is exercised by its one user; nothing is recorded that nothing reads.

4. The `Expr` AST

A new recursive expression AST joins the command AST:

pub enum Expr {
    Or(Vec<Expr>),
    And(Vec<Expr>),
    Not(Box<Expr>),
    Predicate(Predicate),
}

pub enum Predicate {
    Compare { left: Operand, op: CompareOp, right: Operand },
    Like    { target: Operand, pattern: Operand, negated: bool },
    Between { target: Operand, low: Operand, high: Operand, negated: bool },
    In      { target: Operand, items: Vec<Operand>, negated: bool },
    IsNull  { target: Operand, negated: bool },
}

pub enum Operand   { Column(String), Literal(Value) }
pub enum CompareOp { Eq, NotEq, Lt, LtEq, Gt, GtEq }

Or / And are n-ary — a flat a AND b AND c is one And of three. Single-child tiers collapse: a predicate reached through the or → and → not layers with no connective is just that Predicate, not three wrappers.

RowFilter changes from

RowFilter::Where { column: String, value: Value }

RowFilter::Where(Expr)

for update / delete. show data carries filter: Option<Expr> and limit: Option<u64>.

5. The commands

update <T> set <assignments> ( where <expr> | --all-rows )
delete from <T>               ( where <expr> | --all-rows )
show data <T> [ where <expr> ] [ limit <n> ]

update / delete keep ADR-0014's mandatory where-or---all-rows choice; a complex expression satisfies the where side.
show data gains an optional where. Reading every row stays the safe default for a read, so no --all-rows opt-in is needed there — the clause is simply optional.
show data gains an optional limit <n> (<n> a non-negative integer). When limit is present the query is implicitly ordered by the table's primary key, so limit 20 is a stable "first 20 by primary key" rather than an arbitrary subset — every table created through the DSL has a primary key. Explicit order by is out of scope (§10).

6. SQL generation

The Expr is compiled to a parameterised SQL WHERE string:

Every literal becomes a ? placeholder bound as a parameter — never spliced into the SQL text. Identifiers are quote_ident-quoted.
A literal compared against a column is converted to that column's storage representation through the existing bind_for_column path, exactly as the current where col = val does.
Connectives, NOT, and parentheses are emitted from the tree structure.
limit emits LIMIT ? with the bound count, plus the implicit ORDER BY over the primary-key column(s) (§5).

The application never evaluates the expression itself — the database does, and re-derives precedence from the operators. The expression is "passed through" only in that sense; the raw user text is never forwarded.

7. Type handling — permissive and advisory

A type mismatch in a comparison is flagged, not blocked. This matches the app's ambient-assistance posture (ADR-0022): the tool indicates problems, it does not refuse input.

A literal in column OP literal is type-checked against the column. When the types are compatible the literal is converted and bound per the column's type (§6).
When they are not compatible — Name > 5 on a text column, Age LIKE '5%' on an int column — the mismatch is surfaced through the existing highlight and hint channels as an error-class annotation, but the command still parses, still submits, and still runs. The literal binds by its own syntactic type and the database's comparison rules take over — which is precisely the behaviour a learner is experimenting to observe.
This relaxes current behaviour. Today bind_for_column rejects a type-mismatched WHERE literal; under this ADR it does not. The relaxation is scoped to WHERE comparisons. Writes (insert, update … set) stay strict: STRICT storage genuinely cannot hold a mistyped value, so a mistyped write is a real error, not an experiment.
= NULL / != NULL is a specific flagged case. It is valid syntax that almost never does what the user intends (in SQL it is never true). The walker special-cases a comparison whose operator is = / != and whose operand is the NULL literal: it is highlighted as an error, and the hint points at IS NULL / IS NOT NULL. As with type mismatches it still runs if submitted — a learner who wants to see what x = NULL does may.

Always-on submit-time signalling of flagged-but-runnable input (an (INVALID) / WARNING marker at the input field's edge) is a separate, general concern — see §10.

8. Completion, hints, highlighting

Because the expression is parsed in-grammar — not handed to an opaque sub-parser — the ambient-assistance machinery (ADR-0022) works inside an expression with no separate implementation:

column-name completion resolves against the command's table;
value positions carry per-type hints, as they do for the current where col = val;
operator keywords (and, or, like, between, …) surface as completion candidates where the grammar allows them;
syntax highlighting walks the same tree.

Type-mismatch and = NULL flagging (§7) surface through these same highlight and hint channels.

9. Errors

Parse errors continue to route through the existing ParseError shape, so ADR-0021's per-command usage help and the hint panel keep working for the new clauses. The depth-cap breach (§1) is a friendly error of the same kind.

10. Out of scope

ORDER BY. limit uses implicit primary-key ordering for determinism; explicit order by is a clean future addition, tracked separately.
LIMIT … OFFSET, and limit anywhere other than show data.
Operands beyond a column or a literal — arithmetic (a + b), string concatenation, scalar functions, subqueries, EXISTS. The playground's WHERE compares columns and literals.
A bare column as a boolean (where Active).
The input-field validity indicator. An always- visible (INVALID) / WARNING marker at the edge of the input field — signalling, before submit, that the current input would error or is flagged — is a general feature spanning every command, not just WHERE. It gets its own small ADR; this ADR defines only what inside a WHERE is flagged. The indicator is to carry two severities: a hard ERROR / INVALID (the input cannot run) and a softer WARNING (it runs but is probably not intended — type mismatches, = NULL).
CHECK constraints. The constraints ADR (C3) will reuse this expression grammar; it is not built here.

Consequences

C5a is satisfied; show data gains filtering and limit (advancing C5 / V5); QA1 is unblocked; the future CHECK constraint has an expression grammar to reuse.
The grammar gains one node variant (Subgrammar) and a recursion-depth counter.
MatchedPath is unchanged; the expression fragment carries an AST-builder that produces an Expr, carried as a single matched item. Existing command builders are untouched.
A new recursive Expr AST joins the command AST; RowFilter changes from Where { column, value } to Where(Expr).
Type-mismatched WHERE comparisons change from refused to flagged but runnable — a deliberate, scoped behaviour change (§7).
Old history.log lines using the previous where col = val form remain valid — that form is a strict subset of the new grammar — so replay is unaffected. (Not a design driver; noted.)
Forward-look toward Q1: advanced-mode SQL will be parsed by sqlparser-rs, a separate parser, so this Expr AST is not literally shared with it. The value of the DSL expression being a SQL subset is pedagogical — learner knowledge transfers — not code reuse.

Implementation notes

A sensible build order, each step guarded by the test suite and the typing-surface matrix:

The Subgrammar node and the recursion-depth counter — the walker capability for a recursive fragment. MatchedPath is unchanged. No user-visible change.
The expression grammar fragment, the Expr AST, and the fragment-builder the walker invokes to produce the Expr.
Wire the fragment into update / delete (replacing the old where) and into show data (new where, new limit).
Expr → parameterised SQL generation; the implicit primary-key ORDER BY for limit.
Schema-aware type-mismatch and = NULL flagging in the walker.
Typing-surface matrix cells for the new surface.

As-built notes (2026-05-18)

Steps 1–4 are implemented and committed; step 5 (the §7 diagnostic flagging) is deferred — see below. Realization choices, and where they deviate from the design sketch above:

§3 builder — option 1 ("reconstruct in builder"), chosen with the project owner before implementation. The stratified grammar is walked normally; its terminals flow into the flat MatchedPath unchanged (driving highlight / completion / the expected-set). grammar::expr::build_expr then folds that flat terminal slice into the Expr — a deterministic recursive descent mirroring the grammar tiers, run only at submit-time dispatch, never per keystroke. Two honest deviations from the §3 wording:
- No MatchedKind::Expr variant. MatchedPath stays purely terminals — arguably more faithful to "MatchedPath stays flat" than carrying a built Expr in it. The Expr is assembled in the command ast_builders (build_update / build_delete / build_show), which already reconstruct structured Commands from the flat path; build_expr is the same pattern, one tier deeper.
- There is a second structural pass over the expression tokens, scoped to submit-time dispatch. "No second parse" is read as "no separate parser framework": the walk validates and drives assistance, build_expr is the single ast_builder for the fragment — the same category as build_insert.
Grammar shape. predicate is factored as operand predicate_tail (shared operand prefix), and the infix NOT is factored in front of the LIKE / BETWEEN / IN choice — so the walker's first-commit-wins Choice semantics discriminate branches on a cleanly-failing first token.
Subgrammar depth. MAX_SUBGRAMMAR_DEPTH = 64 counts active Subgrammar recursion frames. The stratified grammar descends ~4–5 frames per parenthesis level, so the effective parenthesis-nesting limit is roughly a dozen — far past any hand-written filter; the cap is purely a stack-overflow guard.
§8 hints. The expression's right-hand operands resolve through a schema-aware DynamicSubgrammar (where_rhs_ operand) so the hint panel narrows to the compared column's type, exactly as the pre-ADR where col = val slot did. The operand grammar carries no validators — permissive per §7.
Step 5 deferred to ADR-0027. The §7 behaviour relaxation is done: bind_where_literal binds a type-mismatched WHERE literal by its syntactic shape, and the pre-ADR bind-time rejection is gone. The §7 diagnostic flagging — surfacing a type-mismatched comparison or = NULL as an error-class highlight + hint — is the seam with ADR-0027, which designs the walker diagnostics-severity model that flagging belongs in (and whose WARNING severity is defined to have "no triggers until ADR-0026 is implemented"). Building the flagging as a standalone mechanism first would be reworked by ADR-0027; the recommendation is to implement it as the first triggers of ADR-0027's model.

20 KiB Raw Blame History Unescape Escape