rdbms-playground/docs/adr/0026-complex-where-expressions.md

# ADR-0026: Complex WHERE expressions

## Status

Accepted

## Context

The requirements checklist commits, in `C5a`, to *complex
WHERE expressions* — `AND` / `OR`, comparison operators,
`LIKE` — for the `update`, `delete`, and `show data` row
filters. It is described there as the bridge from DSL
fluency toward real SQL. Today the DSL is well short of
that:

- The only filter the DSL parses is a single
  `where <col> = <val>` equality. There is no `AND` / `OR`,
  no operator other than `=`, no `LIKE`, no `IS NULL`, no
  parentheses.
- `update` and `delete` carry that filter as
  `RowFilter::Where { column, value }`. `show data <T>`
  carries **no filter at all** — it always selects every
  row.
- The `Value` AST is purely syntactic (`Number`, `Text`,
  `Bool`, `Null`); per-column type handling happens at
  bind time in `db.rs`.

Three things make this the right moment, and shape the
decision:

1. **`QA1` (`EXPLAIN QUERY PLAN`) needs a filtered query.**
   An unfiltered `SELECT * FROM t` always plans as a full
   scan; an index can never appear. QA1's pedagogical
   payoff depends on a `WHERE` whose plan flips between a
   scan and an index search.
2. **`CHECK` constraints (`C3`) need an expression
   grammar.** A `CHECK (Age >= 0 AND Age < 150)` is the
   same expression problem. A throwaway mini-grammar for
   `CHECK` plus a second one for `WHERE` would be waste;
   this ADR builds the grammar once.
3. **The grammar architecture must grow to host it.** The
   unified walker grammar (ADR-0023 / ADR-0024) is a
   non-recursive trie of `&'static` `Node`s, and its parse
   output (`MatchedPath`) is a flat list of matched
   terminals. A `WHERE` expression is recursive and its
   shape is data-dependent — neither fits as the grammar
   stands.

### The architectural problem, precisely

A boolean expression — `a = 1 AND (b > 2 OR c LIKE 'x%')`
— is recursive (a parenthesised group is itself an
expression) and carries operator precedence. Two facts
about the current walker:

- **The `Node` tree is acyclic.** Every combinator
  references its children through `&'static [Node]` /
  `&'static Node`, and the registry lives in `const`s. A
  `const` cannot refer to itself, so no node can close a
  cycle. `Optional` and `Repeated` already hold a
  `&'static Node` *reference* — recursion through a
  reference is expressible if the fragment is a named
  `static` — but `Seq` and `Choice` embed their children
  *by value* in a slice, and a cyclic value has no finite
  representation.
- **`MatchedPath` is flat.** It is a `Vec<MatchedItem>` of
  matched *terminals* in source order; the combinators
  shape the order but record no grouping. For every
  command today that is enough: each command's shape is a
  fixed template, so the AST builder reads terminals by
  position or by role. A recursive expression has no fixed
  template — `where a = 1` and
  `where (a=1 or b=2) and c=3` have different shapes — so a
  flat terminal list cannot be rebuilt into the expression
  tree without parsing it a second time.

Left recursion is a third fact, and it is a property of
parsing technique rather than of this codebase: a
top-down walker cannot consume a rule whose leftmost
symbol is the rule itself (`expr := expr OP expr` recurses
without consuming input). The standard remedy — a
stratified, left-factored grammar — is adopted below.

## Decision

### 1. The expression grammar

The `WHERE` expression is a stratified grammar — one layer
per precedence tier. Stratification removes left recursion
(every recursion is guarded by a token) and encodes
operator precedence in the layering, so there is no
separate precedence-resolution step.

```
or_expr      := and_expr ( OR and_expr )*
and_expr     := not_expr ( AND not_expr )*
not_expr     := NOT not_expr | bool_primary
bool_primary := ( or_expr ) | predicate
predicate    := operand cmp_op operand
              | operand [ NOT ] LIKE operand
              | operand [ NOT ] BETWEEN operand AND operand
              | operand [ NOT ] IN ( operand [ , operand ]* )
              | operand IS [ NOT ] NULL
operand      := literal | column_ref
cmp_op       := =  |  !=  |  <>  |  <  |  <=  |  >  |  >=
```

- **Operator set:** the six comparisons, with both `!=`
  and `<>` accepted (`<>` is standard SQL, `!=` the common
  variant — the engine accepts both); `AND` / `OR` / `NOT`;
  parentheses; `LIKE` with `%` / `_` wildcards;
  `IS NULL` / `IS NOT NULL`; `IN`; `BETWEEN`. `LIKE`,
  `IN`, and `BETWEEN` take an optional infix `NOT`,
  mirroring `IS NOT NULL`.
- **Operands are a column reference or a literal** — not a
  nested expression. Parentheses group *boolean*
  sub-expressions (`bool_primary`), not comparison
  operands. A bare column reference is not a boolean
  expression: a predicate always has an operator (write
  `Active = true`, not `Active`).
- The only recursion is `( or_expr )` and `NOT not_expr`;
  each consumes a token (`(` or `NOT`) before recursing,
  so the greedy top-down walker always makes progress.
- **Nesting depth is capped at 64.** Hand-written `WHERE`
  clauses do not approach this; the cap exists only so
  pathological input (`((((…))))`) yields a friendly
  *"expression nested too deeply (limit 64)"* error rather
  than a stack overflow.

The grammar is deliberately a subset of standard SQL's
`WHERE` syntax, so a learner's knowledge transfers
directly when advanced-mode SQL (`Q1`) lands.

### 2. Grammar architecture: a reference-following node

`Seq` / `Choice` embed children by value and cannot hold a
cyclic node. One new `Node` variant closes the gap:

```rust
/// Walks the referenced node once, mandatory. Because the
/// reference is a `&'static Node`, a named `static`
/// fragment may appear inside its own subtree — the
/// mechanism that lets the expression grammar recurse.
/// (ADR-0023 sketched this as `SubgrammarRef`.)
Subgrammar(&'static Node),
```

`Subgrammar` is the static counterpart of the existing
`DynamicSubgrammar` (a walk-time factory). The expression
grammar's tiers are declared as named `static` items;
`bool_primary`'s `( or_expr )` branch reaches `or_expr`
through `Subgrammar(&OR_EXPR)`, and `not_expr` reaches
itself the same way. The walker gains one match arm — walk
the referenced node once — plus the depth counter for the
§1 cap.

The expression grammar is one fragment, referenced by
`update`, `delete`, and `show data` alike — defined once.

### 3. The expression result — built selectively

`MatchedPath` — the walker's flat list of matched
terminals — is left unchanged. The recursive structure
lives only where it is needed: inside the expression.

The expression grammar fragment carries its own
*AST-fragment builder*. As the walker recurses through the
stratified tiers, that builder runs — the walker's
recursion *is* the precedence-correct tree, so the builder
assembles a nested `Expr` directly, with no second parse
and no separate precedence pass. The finished `Expr` is
carried as a single item in the otherwise-unchanged flat
`MatchedPath` (`MatchedKind` gains one variant to hold a
built expression).

Every existing command builder is therefore genuinely
untouched — the flat path it reads is exactly as before.
`update` / `delete` / `show data` take that one expression
item and read its `Expr`.

This is the "selectively if necessary" option: the parser
gains structured output exactly where the grammar is
recursive, and nowhere else. A *system-wide* hierarchical
`MatchedPath` was considered and rejected — it would
record group structure for every command while only the
expression consumed it, leaving the non-expression
grouping computed but unread, and so untested. The general
"a grammar fragment may carry a builder" mechanism
introduced here is exercised by its one user; nothing is
recorded that nothing reads.

### 4. The `Expr` AST

A new recursive expression AST joins the command AST:

```rust
pub enum Expr {
    Or(Vec<Expr>),
    And(Vec<Expr>),
    Not(Box<Expr>),
    Predicate(Predicate),
}

pub enum Predicate {
    Compare { left: Operand, op: CompareOp, right: Operand },
    Like    { target: Operand, pattern: Operand, negated: bool },
    Between { target: Operand, low: Operand, high: Operand, negated: bool },
    In      { target: Operand, items: Vec<Operand>, negated: bool },
    IsNull  { target: Operand, negated: bool },
}

pub enum Operand   { Column(String), Literal(Value) }
pub enum CompareOp { Eq, NotEq, Lt, LtEq, Gt, GtEq }
```

`Or` / `And` are n-ary — a flat `a AND b AND c` is one
`And` of three. Single-child tiers collapse: a `predicate`
reached through the `or → and → not` layers with no
connective is just that `Predicate`, not three wrappers.

`RowFilter` changes from

```rust
RowFilter::Where { column: String, value: Value }
```

to

```rust
RowFilter::Where(Expr)
```

for `update` / `delete`. `show data` carries
`filter: Option<Expr>` and `limit: Option<u64>`.

### 5. The commands

```
update <T> set <assignments> ( where <expr> | --all-rows )
delete from <T>               ( where <expr> | --all-rows )
show data <T> [ where <expr> ] [ limit <n> ]
```

- `update` / `delete` keep ADR-0014's mandatory
  where-or-`--all-rows` choice; a complex expression
  satisfies the `where` side.
- **`show data` gains an optional `where`.** Reading every
  row stays the safe default for a read, so no `--all-rows`
  opt-in is needed there — the clause is simply optional.
- **`show data` gains an optional `limit <n>`** (`<n>` a
  non-negative integer). When `limit` is present the query
  is implicitly ordered by the table's primary key, so
  `limit 20` is a stable "first 20 by primary key" rather
  than an arbitrary subset — every table created through
  the DSL has a primary key. Explicit `order by` is out of
  scope (§10).

### 6. SQL generation

The `Expr` is compiled to a parameterised SQL `WHERE`
string:

- Every literal becomes a `?` placeholder bound as a
  parameter — never spliced into the SQL text.
  Identifiers are `quote_ident`-quoted.
- A literal compared against a column is converted to that
  column's storage representation through the existing
  `bind_for_column` path, exactly as the current
  `where col = val` does.
- Connectives, `NOT`, and parentheses are emitted from the
  tree structure.
- `limit` emits `LIMIT ?` with the bound count, plus the
  implicit `ORDER BY` over the primary-key column(s) (§5).

The application never evaluates the expression itself —
the database does, and re-derives precedence from the
operators. The expression is "passed through" only in
that sense; the raw user text is never forwarded.

### 7. Type handling — permissive and advisory

A type mismatch in a comparison is **flagged, not
blocked**. This matches the app's ambient-assistance
posture (ADR-0022): the tool indicates problems, it does
not refuse input.

- A literal in `column OP literal` is type-checked against
  the column. When the types are compatible the literal is
  converted and bound per the column's type (§6).
- When they are **not** compatible — `Name > 5` on a text
  column, `Age LIKE '5%'` on an int column — the mismatch
  is surfaced through the existing highlight and hint
  channels as an error-class annotation, but the command
  still parses, still submits, and still runs. The literal
  binds by its own syntactic type and the database's
  comparison rules take over — which is precisely the
  behaviour a learner is experimenting to observe.
- **This relaxes current behaviour.** Today
  `bind_for_column` *rejects* a type-mismatched `WHERE`
  literal; under this ADR it does not. The relaxation is
  scoped to `WHERE` comparisons. Writes (`insert`,
  `update … set`) stay strict: STRICT storage genuinely
  cannot hold a mistyped value, so a mistyped write is a
  real error, not an experiment.
- **`= NULL` / `!= NULL`** is a specific flagged case. It
  is valid syntax that almost never does what the user
  intends (in SQL it is never true). The walker
  special-cases a comparison whose operator is `=` / `!=`
  and whose operand is the `NULL` literal: it is
  highlighted as an error, and the hint points at
  `IS NULL` / `IS NOT NULL`. As with type mismatches it
  still runs if submitted — a learner who wants to see
  what `x = NULL` does may.

Always-on submit-time signalling of flagged-but-runnable
input (an `(INVALID)` / `WARNING` marker at the input
field's edge) is a separate, general concern — see §10.

### 8. Completion, hints, highlighting

Because the expression is parsed *in-grammar* — not handed
to an opaque sub-parser — the ambient-assistance machinery
(ADR-0022) works inside an expression with no separate
implementation:

- column-name completion resolves against the command's
  table;
- value positions carry per-type hints, as they do for the
  current `where col = val`;
- operator keywords (`and`, `or`, `like`, `between`, …)
  surface as completion candidates where the grammar
  allows them;
- syntax highlighting walks the same tree.

Type-mismatch and `= NULL` flagging (§7) surface through
these same highlight and hint channels.

### 9. Errors

Parse errors continue to route through the existing
`ParseError` shape, so ADR-0021's per-command usage help
and the hint panel keep working for the new clauses. The
depth-cap breach (§1) is a friendly error of the same
kind.

### 10. Out of scope

- **`ORDER BY`.** `limit` uses implicit primary-key
  ordering for determinism; explicit `order by` is a clean
  future addition, tracked separately.
- **`LIMIT … OFFSET`**, and `limit` anywhere other than
  `show data`.
- **Operands beyond a column or a literal** — arithmetic
  (`a + b`), string concatenation, scalar functions,
  subqueries, `EXISTS`. The playground's `WHERE` compares
  columns and literals.
- **A bare column as a boolean** (`where Active`).
- **The input-field validity indicator.** An always-
  visible `(INVALID)` / `WARNING` marker at the edge of
  the input field — signalling, before submit, that the
  current input would error or is flagged — is a general
  feature spanning every command, not just `WHERE`. It
  gets its own small ADR; this ADR defines only *what*
  inside a `WHERE` is flagged. The indicator is to carry
  two severities: a hard `ERROR` / `INVALID` (the input
  cannot run) and a softer `WARNING` (it runs but is
  probably not intended — type mismatches, `= NULL`).
- **`CHECK` constraints.** The constraints ADR (`C3`) will
  reuse this expression grammar; it is not built here.

## Consequences

- `C5a` is satisfied; `show data` gains filtering and
  `limit` (advancing `C5` / `V5`); `QA1` is unblocked; the
  future `CHECK` constraint has an expression grammar to
  reuse.
- The grammar gains one node variant (`Subgrammar`) and a
  recursion-depth counter.
- `MatchedPath` is unchanged; the expression fragment
  carries an AST-builder that produces an `Expr`, carried
  as a single matched item. Existing command builders are
  untouched.
- A new recursive `Expr` AST joins the command AST;
  `RowFilter` changes from `Where { column, value }` to
  `Where(Expr)`.
- Type-mismatched `WHERE` comparisons change from
  *refused* to *flagged but runnable* — a deliberate,
  scoped behaviour change (§7).
- Old `history.log` lines using the previous
  `where col = val` form remain valid — that form is a
  strict subset of the new grammar — so `replay` is
  unaffected. (Not a design driver; noted.)
- Forward-look toward `Q1`: advanced-mode SQL will be
  parsed by `sqlparser-rs`, a separate parser, so this
  `Expr` AST is not literally shared with it. The value of
  the DSL expression being a SQL subset is pedagogical —
  learner knowledge transfers — not code reuse.

## Implementation notes

A sensible build order, each step guarded by the test
suite and the typing-surface matrix:

1. The `Subgrammar` node and the recursion-depth counter
   — the walker capability for a recursive fragment.
   `MatchedPath` is unchanged. No user-visible change.
2. The expression grammar fragment, the `Expr` AST, and
   the fragment-builder the walker invokes to produce the
   `Expr`.
3. Wire the fragment into `update` / `delete` (replacing
   the old `where`) and into `show data` (new `where`,
   new `limit`).
4. `Expr` → parameterised SQL generation; the implicit
   primary-key `ORDER BY` for `limit`.
5. Schema-aware type-mismatch and `= NULL` flagging in the
   walker.
6. Typing-surface matrix cells for the new surface.

### As-built notes (2026-05-18)

Steps 1–4 are implemented and committed; step 5 (the §7
diagnostic flagging) is deferred — see below. Realization
choices, and where they deviate from the design sketch above:

- **§3 builder — option 1 ("reconstruct in builder"),**
  chosen with the project owner before implementation. The
  stratified grammar is walked normally; its terminals flow
  into the flat `MatchedPath` unchanged (driving highlight /
  completion / the expected-set). `grammar::expr::build_expr`
  then folds that flat terminal slice into the `Expr` — a
  deterministic recursive descent mirroring the grammar
  tiers, run only at submit-time dispatch, never per
  keystroke. Two honest deviations from the §3 wording:
  - **No `MatchedKind::Expr` variant.** `MatchedPath` stays
    purely terminals — arguably more faithful to "MatchedPath
    stays flat" than carrying a built `Expr` in it. The
    `Expr` is assembled in the command `ast_builder`s
    (`build_update` / `build_delete` / `build_show`), which
    already reconstruct structured `Command`s from the flat
    path; `build_expr` is the same pattern, one tier deeper.
  - **There is a second structural pass** over the
    expression tokens, scoped to submit-time dispatch. "No
    second parse" is read as "no separate parser framework":
    the walk validates and drives assistance, `build_expr`
    is the single `ast_builder` for the fragment — the same
    category as `build_insert`.
- **Grammar shape.** `predicate` is factored as `operand
  predicate_tail` (shared operand prefix), and the infix
  `NOT` is factored in front of the `LIKE` / `BETWEEN` /
  `IN` choice — so the walker's first-commit-wins `Choice`
  semantics discriminate branches on a cleanly-failing first
  token.
- **`Subgrammar` depth.** `MAX_SUBGRAMMAR_DEPTH = 64` counts
  active `Subgrammar` recursion frames. The stratified
  grammar descends ~4–5 frames per parenthesis level, so the
  effective parenthesis-nesting limit is roughly a dozen —
  far past any hand-written filter; the cap is purely a
  stack-overflow guard.
- **§8 hints.** The expression's right-hand operands resolve
  through a schema-aware `DynamicSubgrammar` (`where_rhs_
  operand`) so the hint panel narrows to the compared
  column's type, exactly as the pre-ADR `where col = val`
  slot did. The operand grammar carries no validators —
  permissive per §7.
- **Step 5 — done, as part of ADR-0027.** The §7
  *behaviour* relaxation landed with steps 1–4:
  `bind_where_literal` binds a type-mismatched WHERE literal
  by its syntactic shape, and the pre-ADR bind-time
  rejection is gone. The §7 *diagnostic flagging* — a
  type-mismatched comparison or `= NULL` as a flagged
  finding — was folded into ADR-0027's walker
  diagnostics-severity model rather than built as a
  standalone mechanism (ADR-0027's WARNING severity is
  defined to have "no triggers until ADR-0026 is
  implemented" — those triggers are exactly these). It
  surfaces as a `Severity::Warning` `Diagnostic`, computed
  post-walk from the built `Expr` against the table's
  column types — see ADR-0027.

## See also

- ADR-0009 — DSL command syntax conventions (`--` flags,
  keyword clauses).
- ADR-0014 — data operations, the `Value` model,
  `bind_for_column`, the mandatory where-or-`--all-rows`
  rule, auto-show.
- ADR-0021 — the parser as source of truth for H1a
  parse-error help.
- ADR-0022 — ambient typing assistance: the highlight,
  hint, and completion machinery the expression plugs
  into.
- ADR-0023 / ADR-0024 — the unified grammar tree this
  extends; ADR-0023 sketched the `SubgrammarRef` node
  realised here as `Subgrammar`.
- ADR-0025 — indexes; the reason `QA1`, and thus a
  filtered query, is now worthwhile.