Files

T

claude@clouddev1 8c3b13b313 feat: ADR-0036 Phase 2 — validate advanced-mode UPDATE SET literals + retain the value

Mirror Phase 1's capture-at-parse technique on the UPDATE SET assignment
list. build_sql_update calls the new capture_set_literals (data.rs), which
walks the matched tokens (no reparse, no grammar change) and classifies
each top-level `SET col = <rhs>` as a literal (Some, incl. signed numbers)
or an expression (None), using paren depth so a comma inside a function
call or a `where` inside a scalar subquery is not mistaken for a boundary,
and the trailing top-level WHERE is excluded.

Command::SqlUpdate gains set_literals; do_sql_update validates the literals
against their column types via the shared impl_value_for before the still
verbatim update; user_value_for_column reads them so a constraint error
names the offending value. WHERE stays unvalidated; execution and command
identity are unchanged.

Also corrects the stale data.rs header comment (DSL typed slots are wired,
not "deferred") and flips ADR-0036 + README to Phases 1–2 implemented.

Tests: 1934 passing (+4), 0 failed, 0 skipped, 1 ignored; clippy clean.

2026-05-26 22:20:12 +00:00

19 KiB

Raw Blame History

ADR-0036: Value validation for advanced-mode DML — validate literals, keep execution and identity mode-specific

Status

Accepted (design agreed with the user in conversation, 2026-05-26; /runda verification pass completed 2026-05-26; the mechanism was then deliberately narrowed during the same conversation — see below — from "bind literal values through the DSL's path" to the surgical "validate-and-retain, execute verbatim" after the user pushed back on consolidating the two modes and a concrete auto-fill difference confirmed that even the single-row literal case is not identical across modes). Phase 1 implemented 2026-05-26 (INSERT … VALUES literal validation + offending-value retention; capture-at-parse, no grammar change, execution unchanged). Phase 2 implemented 2026-05-26 (UPDATE … SET literal validation + offending-value retention; the same capture-at-parse technique on the SET assignment list — capture_set_literals in data.rs — classifying each top-level RHS literal-vs-expression, validating literals in do_sql_update, and reading them in user_value_for_column; WHERE is not validated, execution stays verbatim). Phase 3 (completion hinting/highlighting — the only part needing a grammar change) pending.

Augments ADR-0030 §4 and ADR-0033 §10 — it does not supersede them and does not change the execution model. Advanced-mode DML still executes the validated SQL verbatim; ADR-0033 Amendment 3's two-command identity (Command::Insert vs Command::SqlInsert) stands unchanged. What this ADR adds is a value-validation step: the word "validated" in "executed as the validated SQL itself" (ADR-0030 §4) is extended to mean value-validated, not merely syntactically validated — the literal data values in an advanced-mode INSERT/UPDATE are checked against the playground type system (and retained for error reporting) before the statement runs.

Builds on the ADR-0035 precedent (DDL executes structurally, not verbatim): there, structure was the first place "grammar as text" was too broad. This ADR makes a narrower correction for DML — not to how it executes, but to what gets checked before it does.

Conversation note (the principle this records). The first instinct was to consolidate — bind literals via the DSL path, even emit Command::Insert from the advanced surface. That was rejected, for a reason worth preserving: simple- and advanced-mode commands are kept distinct because they can legitimately differ, and they do — e.g. auto-fill: simple-mode do_insert fills an omitted non-PK serial with MAX(col)+1, advanced-mode does not (requirements.md X4, flagged as a possible bug to investigate separately). Collapsing the commands would silently drag in such differences. The durable principle (also requirements.md X5): keep a distinct command per distinct case; share execution mechanics as library helpers, never by fusing command identity. This ADR shares exactly one mechanic — the per-type value validators — and nothing else.

Context

How we got here

ADR-0030 §4 set the advanced-mode execute path: DDL lowers to a typed Command and runs the structural executor (to preserve the playground type vocabulary, named relationships, metadata tables, and STRICT); DML and SELECT execute "as the validated SQL itself," on the stated rationale that "they change no schema, so modelling them as a typed Command buys nothing." ADR-0033 implemented that for DML: SqlInsert/SqlUpdate/SqlDelete carry the validated statement text (row_source, the raw sql) and the worker hands it to the engine.

ADR-0035 already found the rationale too broad for DDL and went structural. This ADR finds it too broad for one more case: the literal data values inside DML.

What "verbatim text for literal values" actually costs

The simple-mode DSL never did it this way. do_insert parses each value into a typed Value, validates/normalises it (Value::bind_for_column → validate_date, shortid::validate, …), and executes INSERT INTO T (…) VALUES (?1, ?2, …) with the values bound as parameters. The value never becomes SQL text. The advanced-mode SQL path, by contrast, splices the user's literal into SQL text and lets a STRICT engine be the only check.

A date column is STRICT TEXT; a shortid is TEXT; a bool is an int — the engine's storage types do not enforce the playground's semantic types. So the two paths diverge, and advanced mode is materially weaker. Investigated 2026-05-26; the matrix:

Feedback for a DML value	DSL (simple)	SQL (advanced)
Column-type hint in completion	✅ typed slots (incl. `date` format examples)	❌ raw `sql_expr`
Value-vs-column highlighting	✅ numeric-shape mismatch at parse	❌ none
Validation at parse	⚠️ numeric shape only (`int`/`decimal`/`bool`); `date`/`shortid` format deferred to bind	❌ none
Validation at execution (bind)	✅ full semantic type	❌ none (verbatim)

Precise reading (verified 2026-05-26): the DSL typed slots (shared.rs) validate numeric shape at parse — INT_SLOT rejects decimals, DECIMAL_SLOT checks format, BOOL_SLOT restricts to boolean literals — and surface a per-type hint for every type (the DATE_SLOT carries the YYYY-MM-DD example prose). Full semantic validation — date/shortid/datetime format — happens at bind time (Value::bind_for_column → validate_date / shortid::validate). So the DSL catches a bad value somewhere (parse for numeric shape, bind for the rest); advanced-mode SQL catches it nowhere but the engine's storage-type floor. That asymmetry — "DSL always catches it, SQL never does" — is the gap, and it holds across all semantic types.

The execution-layer gap is proven by a characterization test (tests/sql_insert.rs::sql_dml_skips_app_level_value_validation_that_the_dsl_enforces): the DSL rejects the malformed date 2025/01/15; advanced-mode SQL accepts it and writes the bad row. The only advanced-mode DML diagnostics are structural (insert_arity_mismatch, auto_column_overridden, not_null_missing) — never value-vs-type.

The machinery to fix this already exists and is live for the DSL: column_value_list unfolds a per-column TypedValueSlot when the walker has schema (data.rs:141/189/269; slots in shared.rs). The SQL DML grammar simply was never wired to it — every value position is Node::Subgrammar(&sql_expr::SQL_OR_EXPR) (sql_insert.rs:75), type-blind by construction. So the asymmetry is not a deliberate "advanced mode doesn't need this" decision — no ADR says so — it is an un-wired surface. (A stale header comment at data.rs:8-17 still describes the DSL slots themselves as "deferred"; it predates the wiring that data.rs:141/189/269 now show, and should be corrected as part of this work.) For a teaching tool, where the whole point is to catch a learner's mistake and explain it, silently accepting a malformed value is a pedagogy failure, not a feature.

The same root cause behind the error-value gap

A separate symptom shares this root cause. When a SQL INSERT/UPDATE violates a UNIQUE/CHECK constraint, the friendly-error layer cannot show the offending value — because the value was discarded (only row_source text survives), so enrich_unique_violation / enrich_check_violation come up empty and degrade to a neutral "that value" (ADR-0035 Amendment 1, F2 follow-up). Validation, hinting, highlighting, and the offending-value-in-errors display are four faces of one defect: literal values are thrown away instead of owned.

The sharp edge — why we do not go fully structural

ADR-0030 §4's text choice was not gratuitous. It deliberately keeps DML/SELECT/CHECK expressions out of the DSL's intentionally limited Expr (ADR-0026), so advanced mode delivers the full SQL expression surface — arithmetic, functions, subqueries, nested boolean operands — that docs/simple-mode-limitations.md records as the inverse of the simple subset. Lowering SQL expressions into the DSL Expr would regress that surface; building a full typed SQL-expression AST + serializer is a large undertaking that ADR-0031 explicitly declined (sql_expr is validate-only, no Expr AST).

And SELECT is the proof that text-to-engine is the right tool for queries: ADR-0032 already delivers rich feedback for SELECT — completion, qualified-name resolution, predicate warnings, post-prepare type recovery — entirely from walking the validated parse, with the engine executing the text. Queries have no data values to validate against columns; owning them buys nothing and costs enormously.

So the dividing line is not "DDL vs DML." It is a static literal value (which we can validate) vs an engine-evaluated expression-or-query (which we cannot).

Decision

1. The principle

In advanced-mode INSERT/UPDATE, validate each literal data value against its target column's type before executing, and retain the literal so a constraint error can name it. Execute the statement verbatim, exactly as today. Do not bind, do not reconstruct, do not touch auto-fill, do not collapse command identity.

Only the value validation is shared between simple and advanced mode — via the existing per-type validators (Value::bind_for_column / validate_date / shortid::validate). Everything else stays mode-specific: execution is still verbatim text-to-engine, plan_shortid_autofill is untouched, and Command::SqlInsert / Command::SqlUpdate remain distinct from their DSL counterparts.

What counts as a literal (the set we validate — matching the null/true/false words plus number/string literals as the walker tokenises them): NULL, a boolean literal, a string literal, and a signed numeric literal (-5, 3.14). A signed numeric counts as a literal even though sql_expr tokenises the sign separately (Punct('-') then NumberLit) — a leading sign at the start of a value position is part of the literal, not an operator. Anything else in a value position — arithmetic, function calls, CASE, subqueries, column references — is an expression: there is no static value to validate, so it is left to the engine (unchanged).

Why not bind / converge. Binding was the first instinct and is rejected. The two proven gaps (a malformed literal slipping through; the offending value missing from errors) are closed by validation + retention alone — binding adds nothing to either. Meanwhile, executing the user's own text verbatim is already safe (their quoting stands; no re-quoting risk because we do not reconstruct), and binding/convergence would risk dragging in genuinely mode-specific behaviour (auto-fill — X4; natural-order column mapping) that must stay separate. So we share the validators and nothing else. This keeps the modes cleanly apart (requirements.md X5) while fixing the bug that they should not differ on: whether a learner's malformed value is caught.

2. What this means per statement

Execution is unchanged for every statement below; the only addition is a pre-execution validation of literal value positions.

INSERT … VALUES — every literal position (single- or multi-row) is validated against its column type before the verbatim insert runs; a malformed literal is refused with the same friendly wording the DSL uses (shared bind_for_column). Expression positions are skipped (nothing to validate). RETURNING / ON CONFLICT / INSERT … SELECT need no special handling — validation simply applies to whatever literal VALUES are present, and the statement still executes verbatim.
UPDATE … SET — SET col = <literal> is validated; SET col = <expr> is skipped. (Phase 2 — see §5.)
WHERE (UPDATE/DELETE) — not validated. WHERE is an expression in general; the value-feedback motivation is met by VALUES/SET (a constraint error names a written value). Deliberate scope choice, not an oversight.
SELECT — entirely unchanged. No data values to validate.

3. What it fixes

Validating the literal closes the validation gap (the malformed date 2025/01/15 is now refused in advanced mode, as proven by the characterization test). Retaining the literal on the command closes the error-value gap (enrich_* reads it, so a constraint error shows the real value instead of the neutral "that value"). Completion hinting / highlighting is not delivered here — it needs a grammar-level change (§5, Phase 2). The neutral "that value" safety net (ADR-0035 Amendment 1) remains correct for genuinely-computed expression values — there is no input literal to show.

4. Explicit requirement — retain the literals, change nothing else

Command::SqlInsert (and later SqlUpdate) gains a captured-literals payload (per row, per position; None for an expression position) in addition to the existing raw text. The executor validates from it and the error enricher reads it. The original source text is unchanged and is still what history.log records and replay re-runs (ADR-0034). The command variant, its execution, and plan_shortid_autofill are not modified. Validation reuses the existing value-binding helper (impl_value_for / Value::bind_for_column) for wording parity with the DSL — the resulting bound value is discarded (we do not bind for execution), only its Result is used.

5. Mechanism + phasing

Phase 1 (this ADR's immediate work) — capture + validate + retain. At parse, build_sql_insert classifies each VALUES position from the matched path (a single literal token, or a signed number → a typed Value; anything else → an expression marker) and stores the per-row result on the command — no grammar change, no reparse. The executor validates the captured literals against the resolved column types before the verbatim insert; the enricher reads them. Covers single- and multi-row, with or without RETURNING/ON CONFLICT, because execution is untouched.
Phase 2 (implemented 2026-05-26) — UPDATE … SET literal validation. The same capture-at-parse technique on the SET assignment list: build_sql_update calls capture_set_literals, which walks the matched tokens (no reparse) and classifies each top-level SET col = <rhs> into (col, Some(Value)) for a bare literal (incl. a signed number) or (col, None) for an expression — using paren depth so a comma inside a function call or a where inside a scalar subquery is never mistaken for an assignment/clause boundary, and so the trailing top-level WHERE predicate is excluded. Command::SqlUpdate gains a set_literals payload; do_sql_update validates the literals against their column types (via the shared impl_value_for) before the still verbatim update; user_value_for_column reads them so a constraint error names the offending value. WHERE is deliberately not validated (§2).
Phase 3 — completion hinting / highlighting. This is the only part that needs a grammar change: a Choice(typed-literal-slot, sql_expr) at each value position (reusing the DSL's live column_value_list / TypedValueSlots — data.rs:141/189/269), so the column type drives a live hint and a mismatch highlights while typing. When Phase 3 lands, the typed slot supersedes Phase 1's classification of literals (the validation/enrichment built on top is unaffected — that is the only throwaway, by design).

6. Non-goals

Binding / statement reconstruction. Explicitly out. Execution stays verbatim. (This was the rejected first instinct.)
Collapsing command identity. Command::Insert and Command::SqlInsert stay distinct; ADR-0033 Amendment 3 stands.
Changing auto-fill. The simple-vs-advanced serial/shortid auto-fill difference (requirements.md X4) is untouched here and tracked separately as a possible bug.
A structural SELECT and a full typed SQL-expression AST — both out (queries and expressions stay text; ADR-0031's "no Expr AST" and ADR-0030 §4's full-surface guarantee stand).

Consequences

Advanced mode stops being a feedback-free zone for data values. A learner typing a malformed date/shortid/int literal in a SQL INSERT gets the same catch-and-explain they get in simple mode — via the shared validator, not a shared command.
The modes stay cleanly separate. Execution, auto-fill, and command identity are all unchanged; the only thing now shared is the value validators. This is the requirements.md X5 principle in practice (share a mechanic, not a command) and avoids the consolidation traps (X4 auto-fill) that the bind/converge approach would have hit.
Small, low-risk, no execution reconstruction. Because we do not rebuild the statement, there is no "mixed VALUES (?1, expr, ?2)" splicing problem, no multi-row execution change, and no RETURNING/ ON CONFLICT/INSERT … SELECT special-casing — they keep working as the existing ADR-0033 tests assert.
One new seam to keep honest: the literal-vs-expression classification at parse. It must be tested (single literal / signed literal / NULL/true/false → validated; arithmetic / function / subquery → skipped), or it will drift.
A normalization difference is avoided, not introduced. We validate the literal but do not rewrite it; the engine stores the user's text as written. (Had we bound/normalized, advanced inserts might store a canonicalised value — a behaviour change we sidestep.)
Phase 3 will revisit literal detection (swapping the parse-time classification for typed slots that also drive hints). The validation/enrichment built on it is permanent; only the detection is provisional — a deliberate, documented small throwaway.

19 KiB Raw Blame History