Files
rdbms-playground/docs/adr/0036-typed-dml-values-vs-verbatim.md
T
claude@clouddev1 49ea03b0d5 feat: ADR-0036 Phase 3a — live typed-slot hints + highlighting for SQL SET values
Wire the DSL's column-typed value slots into the advanced-mode SQL
UPDATE/UPSERT `SET col = <rhs>` value position so a learner gets the same
per-column hint ("for `Email`: type a quoted string") and live numeric-
shape mismatch highlight the simple-mode DSL gives.

Discriminate literal-vs-expression with a boundary-aware lookahead
(shared::SET_VALUE), NOT the naive `Choice(typed-slot, sql_expr)` the ADR
originally sketched: the walker's Choice is first-match-wins with no
backtrack, so a typed slot would greedily match the leading `1` of `1 + 2`
and commit, regressing valid SQL (e.g. the existing `values (1, 1 + 2)`
test). The lookahead peeks the whole value position: a literal routes to
the typed slot only when it fills the position up to the next
`,`/`)`/`;`/`where`/`returning`/end; everything else falls through to the
full sql_expr grammar unchanged. The SET column ident gets
`writes_column: true` so `current_column` drives the slot + hint.

Scope: Phase 3a covers UPDATE's assignment list and INSERT's ON CONFLICT
DO UPDATE SET. Phase 3b (INSERT VALUES — needs a per-position grammar
restructure + multi-row) is deferred. Records ADR-0036 Amendment 1 with
the mechanism correction + the 3a/3b split.

Tests: 1939 passing (+5), 0 failed, 0 skipped, 1 ignored; clippy clean.
2026-05-26 22:48:46 +00:00

23 KiB
Raw Blame History

ADR-0036: Value validation for advanced-mode DML — validate literals, keep execution and identity mode-specific

Status

Accepted (design agreed with the user in conversation, 2026-05-26; /runda verification pass completed 2026-05-26; the mechanism was then deliberately narrowed during the same conversation — see below — from "bind literal values through the DSL's path" to the surgical "validate-and-retain, execute verbatim" after the user pushed back on consolidating the two modes and a concrete auto-fill difference confirmed that even the single-row literal case is not identical across modes). Phase 1 implemented 2026-05-26 (INSERT … VALUES literal validation + offending-value retention; capture-at-parse, no grammar change, execution unchanged). Phase 2 implemented 2026-05-26 (UPDATE … SET literal validation + offending-value retention; the same capture-at-parse technique on the SET assignment list — capture_set_literals in data.rs — classifying each top-level RHS literal-vs-expression, validating literals in do_sql_update, and reading them in user_value_for_column; WHERE is not validated, execution stays verbatim). Phase 3a implemented 2026-05-26 — live typed-slot hints + numeric-shape highlighting for advanced-mode UPDATE/UPSERT SET col = <literal> value positions, via a boundary-aware lookahead (not the naive Choice this ADR originally sketched in §5 — see Amendment 1). Phase 3b (INSERT … VALUES typed slots — needs a per-position grammar restructure + multi-row) pending.

Augments ADR-0030 §4 and ADR-0033 §10 — it does not supersede them and does not change the execution model. Advanced-mode DML still executes the validated SQL verbatim; ADR-0033 Amendment 3's two-command identity (Command::Insert vs Command::SqlInsert) stands unchanged. What this ADR adds is a value-validation step: the word "validated" in "executed as the validated SQL itself" (ADR-0030 §4) is extended to mean value-validated, not merely syntactically validated — the literal data values in an advanced-mode INSERT/UPDATE are checked against the playground type system (and retained for error reporting) before the statement runs.

Builds on the ADR-0035 precedent (DDL executes structurally, not verbatim): there, structure was the first place "grammar as text" was too broad. This ADR makes a narrower correction for DML — not to how it executes, but to what gets checked before it does.

Conversation note (the principle this records). The first instinct was to consolidate — bind literals via the DSL path, even emit Command::Insert from the advanced surface. That was rejected, for a reason worth preserving: simple- and advanced-mode commands are kept distinct because they can legitimately differ, and they do — e.g. auto-fill: simple-mode do_insert fills an omitted non-PK serial with MAX(col)+1, advanced-mode does not (requirements.md X4, flagged as a possible bug to investigate separately). Collapsing the commands would silently drag in such differences. The durable principle (also requirements.md X5): keep a distinct command per distinct case; share execution mechanics as library helpers, never by fusing command identity. This ADR shares exactly one mechanic — the per-type value validators — and nothing else.

Context

How we got here

ADR-0030 §4 set the advanced-mode execute path: DDL lowers to a typed Command and runs the structural executor (to preserve the playground type vocabulary, named relationships, metadata tables, and STRICT); DML and SELECT execute "as the validated SQL itself," on the stated rationale that "they change no schema, so modelling them as a typed Command buys nothing." ADR-0033 implemented that for DML: SqlInsert/SqlUpdate/SqlDelete carry the validated statement text (row_source, the raw sql) and the worker hands it to the engine.

ADR-0035 already found the rationale too broad for DDL and went structural. This ADR finds it too broad for one more case: the literal data values inside DML.

What "verbatim text for literal values" actually costs

The simple-mode DSL never did it this way. do_insert parses each value into a typed Value, validates/normalises it (Value::bind_for_columnvalidate_date, shortid::validate, …), and executes INSERT INTO T (…) VALUES (?1, ?2, …) with the values bound as parameters. The value never becomes SQL text. The advanced-mode SQL path, by contrast, splices the user's literal into SQL text and lets a STRICT engine be the only check.

A date column is STRICT TEXT; a shortid is TEXT; a bool is an int — the engine's storage types do not enforce the playground's semantic types. So the two paths diverge, and advanced mode is materially weaker. Investigated 2026-05-26; the matrix:

Feedback for a DML value DSL (simple) SQL (advanced)
Column-type hint in completion typed slots (incl. date format examples) raw sql_expr
Value-vs-column highlighting numeric-shape mismatch at parse none
Validation at parse ⚠️ numeric shape only (int/decimal/bool); date/shortid format deferred to bind none
Validation at execution (bind) full semantic type none (verbatim)

Precise reading (verified 2026-05-26): the DSL typed slots (shared.rs) validate numeric shape at parse — INT_SLOT rejects decimals, DECIMAL_SLOT checks format, BOOL_SLOT restricts to boolean literals — and surface a per-type hint for every type (the DATE_SLOT carries the YYYY-MM-DD example prose). Full semantic validation — date/shortid/datetime format — happens at bind time (Value::bind_for_columnvalidate_date / shortid::validate). So the DSL catches a bad value somewhere (parse for numeric shape, bind for the rest); advanced-mode SQL catches it nowhere but the engine's storage-type floor. That asymmetry — "DSL always catches it, SQL never does" — is the gap, and it holds across all semantic types.

The execution-layer gap is proven by a characterization test (tests/sql_insert.rs::sql_dml_skips_app_level_value_validation_that_the_dsl_enforces): the DSL rejects the malformed date 2025/01/15; advanced-mode SQL accepts it and writes the bad row. The only advanced-mode DML diagnostics are structural (insert_arity_mismatch, auto_column_overridden, not_null_missing) — never value-vs-type.

The machinery to fix this already exists and is live for the DSL: column_value_list unfolds a per-column TypedValueSlot when the walker has schema (data.rs:141/189/269; slots in shared.rs). The SQL DML grammar simply was never wired to it — every value position is Node::Subgrammar(&sql_expr::SQL_OR_EXPR) (sql_insert.rs:75), type-blind by construction. So the asymmetry is not a deliberate "advanced mode doesn't need this" decision — no ADR says so — it is an un-wired surface. (A stale header comment at data.rs:8-17 still describes the DSL slots themselves as "deferred"; it predates the wiring that data.rs:141/189/269 now show, and should be corrected as part of this work.) For a teaching tool, where the whole point is to catch a learner's mistake and explain it, silently accepting a malformed value is a pedagogy failure, not a feature.

The same root cause behind the error-value gap

A separate symptom shares this root cause. When a SQL INSERT/UPDATE violates a UNIQUE/CHECK constraint, the friendly-error layer cannot show the offending value — because the value was discarded (only row_source text survives), so enrich_unique_violation / enrich_check_violation come up empty and degrade to a neutral "that value" (ADR-0035 Amendment 1, F2 follow-up). Validation, hinting, highlighting, and the offending-value-in-errors display are four faces of one defect: literal values are thrown away instead of owned.

The sharp edge — why we do not go fully structural

ADR-0030 §4's text choice was not gratuitous. It deliberately keeps DML/SELECT/CHECK expressions out of the DSL's intentionally limited Expr (ADR-0026), so advanced mode delivers the full SQL expression surface — arithmetic, functions, subqueries, nested boolean operands — that docs/simple-mode-limitations.md records as the inverse of the simple subset. Lowering SQL expressions into the DSL Expr would regress that surface; building a full typed SQL-expression AST + serializer is a large undertaking that ADR-0031 explicitly declined (sql_expr is validate-only, no Expr AST).

And SELECT is the proof that text-to-engine is the right tool for queries: ADR-0032 already delivers rich feedback for SELECT — completion, qualified-name resolution, predicate warnings, post-prepare type recovery — entirely from walking the validated parse, with the engine executing the text. Queries have no data values to validate against columns; owning them buys nothing and costs enormously.

So the dividing line is not "DDL vs DML." It is a static literal value (which we can validate) vs an engine-evaluated expression-or-query (which we cannot).

Decision

1. The principle

In advanced-mode INSERT/UPDATE, validate each literal data value against its target column's type before executing, and retain the literal so a constraint error can name it. Execute the statement verbatim, exactly as today. Do not bind, do not reconstruct, do not touch auto-fill, do not collapse command identity.

Only the value validation is shared between simple and advanced mode — via the existing per-type validators (Value::bind_for_column / validate_date / shortid::validate). Everything else stays mode-specific: execution is still verbatim text-to-engine, plan_shortid_autofill is untouched, and Command::SqlInsert / Command::SqlUpdate remain distinct from their DSL counterparts.

What counts as a literal (the set we validate — matching the null/true/false words plus number/string literals as the walker tokenises them): NULL, a boolean literal, a string literal, and a signed numeric literal (-5, 3.14). A signed numeric counts as a literal even though sql_expr tokenises the sign separately (Punct('-') then NumberLit) — a leading sign at the start of a value position is part of the literal, not an operator. Anything else in a value position — arithmetic, function calls, CASE, subqueries, column references — is an expression: there is no static value to validate, so it is left to the engine (unchanged).

Why not bind / converge. Binding was the first instinct and is rejected. The two proven gaps (a malformed literal slipping through; the offending value missing from errors) are closed by validation + retention alone — binding adds nothing to either. Meanwhile, executing the user's own text verbatim is already safe (their quoting stands; no re-quoting risk because we do not reconstruct), and binding/convergence would risk dragging in genuinely mode-specific behaviour (auto-fill — X4; natural-order column mapping) that must stay separate. So we share the validators and nothing else. This keeps the modes cleanly apart (requirements.md X5) while fixing the bug that they should not differ on: whether a learner's malformed value is caught.

2. What this means per statement

Execution is unchanged for every statement below; the only addition is a pre-execution validation of literal value positions.

  • INSERT … VALUES — every literal position (single- or multi-row) is validated against its column type before the verbatim insert runs; a malformed literal is refused with the same friendly wording the DSL uses (shared bind_for_column). Expression positions are skipped (nothing to validate). RETURNING / ON CONFLICT / INSERT … SELECT need no special handling — validation simply applies to whatever literal VALUES are present, and the statement still executes verbatim.
  • UPDATE … SETSET col = <literal> is validated; SET col = <expr> is skipped. (Phase 2 — see §5.)
  • WHERE (UPDATE/DELETE)not validated. WHERE is an expression in general; the value-feedback motivation is met by VALUES/SET (a constraint error names a written value). Deliberate scope choice, not an oversight.
  • SELECT — entirely unchanged. No data values to validate.

3. What it fixes

Validating the literal closes the validation gap (the malformed date 2025/01/15 is now refused in advanced mode, as proven by the characterization test). Retaining the literal on the command closes the error-value gap (enrich_* reads it, so a constraint error shows the real value instead of the neutral "that value"). Completion hinting / highlighting is not delivered here — it needs a grammar-level change (§5, Phase 2). The neutral "that value" safety net (ADR-0035 Amendment 1) remains correct for genuinely-computed expression values — there is no input literal to show.

4. Explicit requirement — retain the literals, change nothing else

Command::SqlInsert (and later SqlUpdate) gains a captured-literals payload (per row, per position; None for an expression position) in addition to the existing raw text. The executor validates from it and the error enricher reads it. The original source text is unchanged and is still what history.log records and replay re-runs (ADR-0034). The command variant, its execution, and plan_shortid_autofill are not modified. Validation reuses the existing value-binding helper (impl_value_for / Value::bind_for_column) for wording parity with the DSL — the resulting bound value is discarded (we do not bind for execution), only its Result is used.

5. Mechanism + phasing

  • Phase 1 (this ADR's immediate work) — capture + validate + retain. At parse, build_sql_insert classifies each VALUES position from the matched path (a single literal token, or a signed number → a typed Value; anything else → an expression marker) and stores the per-row result on the command — no grammar change, no reparse. The executor validates the captured literals against the resolved column types before the verbatim insert; the enricher reads them. Covers single- and multi-row, with or without RETURNING/ON CONFLICT, because execution is untouched.
  • Phase 2 (implemented 2026-05-26) — UPDATE … SET literal validation. The same capture-at-parse technique on the SET assignment list: build_sql_update calls capture_set_literals, which walks the matched tokens (no reparse) and classifies each top-level SET col = <rhs> into (col, Some(Value)) for a bare literal (incl. a signed number) or (col, None) for an expression — using paren depth so a comma inside a function call or a where inside a scalar subquery is never mistaken for an assignment/clause boundary, and so the trailing top-level WHERE predicate is excluded. Command::SqlUpdate gains a set_literals payload; do_sql_update validates the literals against their column types (via the shared impl_value_for) before the still verbatim update; user_value_for_column reads them so a constraint error names the offending value. WHERE is deliberately not validated (§2).
  • Phase 3 — completion hinting / highlighting. This is the only part that needs a grammar change: a typed-literal slot vs sql_expr at each value position (reusing the DSL's live column_value_list / TypedValueSlots — data.rs:141/189/269), so the column type drives a live hint and a mismatch highlights while typing. When Phase 3 lands, the typed slot supersedes Phase 1/2's classification of literals (the validation/enrichment built on top is unaffected — that is the only throwaway, by design). The literal-vs-expression discriminator is a boundary-aware lookahead, not a naive Choice(typed-slot, sql_expr) — see Amendment 1, which corrects this section's mechanism and splits Phase 3 into 3a (SET, implemented 2026-05-26) and 3b (VALUES, pending).

6. Non-goals

  • Binding / statement reconstruction. Explicitly out. Execution stays verbatim. (This was the rejected first instinct.)
  • Collapsing command identity. Command::Insert and Command::SqlInsert stay distinct; ADR-0033 Amendment 3 stands.
  • Changing auto-fill. The simple-vs-advanced serial/shortid auto-fill difference (requirements.md X4) is untouched here and tracked separately as a possible bug.
  • A structural SELECT and a full typed SQL-expression AST — both out (queries and expressions stay text; ADR-0031's "no Expr AST" and ADR-0030 §4's full-surface guarantee stand).

Consequences

  • Advanced mode stops being a feedback-free zone for data values. A learner typing a malformed date/shortid/int literal in a SQL INSERT gets the same catch-and-explain they get in simple mode — via the shared validator, not a shared command.
  • The modes stay cleanly separate. Execution, auto-fill, and command identity are all unchanged; the only thing now shared is the value validators. This is the requirements.md X5 principle in practice (share a mechanic, not a command) and avoids the consolidation traps (X4 auto-fill) that the bind/converge approach would have hit.
  • Small, low-risk, no execution reconstruction. Because we do not rebuild the statement, there is no "mixed VALUES (?1, expr, ?2)" splicing problem, no multi-row execution change, and no RETURNING/ ON CONFLICT/INSERT … SELECT special-casing — they keep working as the existing ADR-0033 tests assert.
  • One new seam to keep honest: the literal-vs-expression classification at parse. It must be tested (single literal / signed literal / NULL/true/false → validated; arithmetic / function / subquery → skipped), or it will drift.
  • A normalization difference is avoided, not introduced. We validate the literal but do not rewrite it; the engine stores the user's text as written. (Had we bound/normalized, advanced inserts might store a canonicalised value — a behaviour change we sidestep.)
  • Phase 3 will revisit literal detection (swapping the parse-time classification for typed slots that also drive hints). The validation/enrichment built on it is permanent; only the detection is provisional — a deliberate, documented small throwaway.

Amendment 1 — Phase 3 mechanism is a boundary-aware lookahead, not a naive Choice; Phase 3 split into 3a/3b (2026-05-26)

Status: Accepted (agreed with the user in conversation, 2026-05-26). Phase 3a implemented the same day.

§5 Phase 3 sketched the mechanism as "a Choice(typed-literal-slot, sql_expr) at each value position." Implementation found that sketch is wrong as written and would regress valid SQL, so it is corrected here.

Why the naive Choice is broken. The walker's Node::Choice is first-match-wins with no cross-branch backtracking once a branch has committed a Matched (a later failure in the enclosing Seq does not re-enter the Choice). At, say, an int column, the value 1 + 2:

  • Branch 0 (the typed slot) matches just the 1 and commits, leaving + 2 dangling — the enclosing tuple/assignment then fails on +.
  • The Choice never falls through to sql_expr, so a valid, currently parsing SQL expression is rejected.

This is not hypothetical: tests/sql_insert.rs::sql_insert_expression_value_is_not_validated_and_runs exercises exactly values (1, 1 + 2). Putting sql_expr first instead makes the typed slot unreachable (sql_expr matches bare literals too), defeating the purpose. So the discriminator must know whether a literal fills the whole value position before choosing the typed slot.

The correction. Discriminate by a boundary-aware lookahead (shared::SET_VALUEset_value_node): peek the value position and route to the column-typed slot only when it is empty, a partial string still being typed, or a single complete literal token whose next token is a position boundary (, / ) / ; / where / returning / end); otherwise route to Subgrammar(sql_expr). The empty case still routes to the slot so the per-column hint shows while the cursor sits right after =. A leading sign folds into the literal (the slot's NumberLit uses the same consume_number_literal that eats a -), so signed literals get typed treatment too. Node::Lookahead already exists and is used the same way by insert_first_paren (data.rs). The validation/enrichment from Phases 12 is unchanged; only the live-feedback detection uses this lookahead — consistent with §5's note that Phase 3's detection is the one deliberate throwaway.

Phase 3 split into 3a + 3b. The two halves differ structurally:

  • Phase 3a (implemented) — UPDATE / UPSERT SET col = <rhs>. Low risk: the preceding SET column ident gets writes_column: true so current_column (and the pending_value_column hint framing) is set per assignment; the RHS becomes shared::SET_VALUE. Covers both sql_update's assignment list and sql_insert's ON CONFLICT … DO UPDATE SET. Mismatch examples now caught live (e.g. set k = 3.14 at an int column), matching what simple mode already does — earlier, better feedback than Phase 2's execution-time catch.
  • Phase 3b (pending) — INSERT … VALUES (…). Harder: the values list is Repeated(VALUE_EXPR) with no per-position column identity, and multi-row values (..),(..) must be handled. It needs the DSL-style per-position restructure (a DynamicSubgrammar emitting one boundary-aware position per column), tracked as its own step.

Known limitation (both phases, matches the DSL). date / shortid / datetime format is still not validated at parse — those slots accept any quoted string; the format is checked at bind/execution time (Phase 2). So the live highlight catches numeric-shape mismatches (int/decimal/ bool), not malformed dates. The column-type hint still shows for every type.

See also

  • ADR-0030 §4 / ADR-0033 §10 — the execute-path this ADR augments (adds value validation); the verbatim execution model and the SELECT/expression text path both stand.
  • ADR-0033 Amendment 3 — the two-command identity, preserved (this ADR does not collapse Insert/SqlInsert).
  • ADR-0035 — the DDL precedent (structural, not verbatim); this ADR is the narrower DML analogue (validate, don't restructure).
  • ADR-0026 — the DSL's deliberately-limited Expr; not imposed on the SQL surface. ADR-0031 — sql_expr is validate-only; unchanged.
  • ADR-0032 — SELECT feedback-from-walk; the proof that text-to-engine is right for queries.
  • ADR-0029 — the column type/constraint model the shared validators enforce.
  • ADR-0035 Amendment 1 (F2 follow-up) — the neutral "that value" safety net, correct for computed values.
  • requirements.md X4 (auto-fill difference — possible bug, untouched here) and X5 (framework cohesion / share-mechanics-not-commands — the principle this ADR follows).