Files
rdbms-playground/docs/adr/0048-seed-fake-data-generation.md
T
claude@clouddev1 deb0948d6c feat(seed): year-as-int + conventional choice-set heuristics (#33, #34)
Two additive D7 catalogue rules, surfaced while writing the website seed
docs. No change to the type fallback, executor, or grammar.

#33 — year-like int columns. `published`/`birth_year` were just `int`, so
they fell to the unbounded int path and produced nonsense (`9419`). Add an
int-gated year rule (after the quantity rule, so `year_count` stays a
count): `year`/`*_year`/`published`/`founded` -> a bounded 1950-2025 year
(new `YearRecent`), or the dob-style birth window 1945-2007 for
`birth`/`born`/`dob` (new `YearBirth`). Plain int; not added to the D9
named-generator vocabulary.

#34 — conventional choice sets. A few enum-ish names have a near-canonical
small set that reads far better than lorem text. Add a type-gated PickFrom
lookup (reusing the existing generator): priority/prio, severity,
rating/stars. `status` is deliberately excluded (values too
domain-specific) and keeps the D12 advisory; a user IN-CHECK still wins.
`priority` leaves ENUM_TOKENS.

ADR-0048 Amendment 1; +8 tests (incl. a column-fill integration test that
also closes a pre-existing gap on that path).
2026-06-12 20:36:20 +00:00

39 KiB
Raw Blame History

ADR-0048: seed — fake-data generation command (SD1, opens SD2)

Status

Accepted (2026-06-11); Phase 1 + Phase 2 implemented (2026-06-11). Design settled with the user across an extended fork dialogue (every decision below was escalated and user-chosen), then hardened by a pre-build /runda Devil's-Advocate pass that found six blockers — undo integration (D15), replay semantics (D16), set value quoting (D2), CHECK-constraint handling (D17), a phase-ordering bug in the advisory (D13), and auto-show flooding (D18) — plus refinements (state-relative reproducibility, compound-FK tuple sampling, column-fill constraint rules, the fake dependency scan), all folded in.

Phase 1 shipped test-first across commits 202e25a (generation library + fake dependency) → f1e9484 (command skeleton) → 73493fa (FK sampling) → 9c13501 (uniqueness / junction / IN-CHECK) → 0b3ab3c (SeedResult / preview / advisory / count cap) → e6ff63d (single-transaction O(N) path) → fbd219b (--seed flag, ambient wiring, and a whole-implementation /runda pass). The post-implementation /runda found eight gaps — FK-sampling determinism (now ORDER BY), shortid reproducibility (now from the seeded RNG, so D4 holds with no exceptions), and six untested ADR decisions (D5/D15/D16/D17 + atomicity + zero-count), all closed. 2358 tests pass / 0 fail / 0 skip; clippy clean.

Implemented in Phase 1: the whole-row seed <table> [count] [--seed <n>] form and every D1D18 decision except the two Phase-2 surfaces.

Phase 2 implemented (2026-06-11): both remaining surfaces — the set override clause (D2: fixed value / pick-list / named generator / range, quoted literals, type-aware) and the <table>.<column> column-fill form (D1 form 2: an UPDATE over existing rows, refusing PK/autogen targets, empty-table no-op, one undo step). The named-generator vocabulary (D9) lives in src/seed (KNOWN_GENERATORS / generator_for_name); a new range Generator (src/seed/generators.rs) backs between; the override clause is folded from the flat matched path (build_seed_overrides, src/dsl/grammar/data.rs) and applied to the per-column plan (apply_seed_overrides, src/db.rs), with column-fill in do_seed_column_fill. Full ambient wiring: completion (the generator vocabulary after as, the set/.col column slots), highlighting (HighlightClass::Functiontok_function, the generator slot), the validity indicator (IdentSource::Generators — an unknown name flagged [ERR]), help, and parse-error pedagogy rows. The D13 advisory now carries its Phase-2/3 wording (points at set and the column-fill repair). A post-implementation /runda pass then added one user-chosen refinement: a bounded override on a UNIQUE column (a fixed value / too-short pick-list) is now a friendly error rather than a silent uniqueness cap (see D2). 2400 tests pass / 0 fail / 0 skip; clippy clean. Two implementation refinements vs. this ADR's wording, both met the user-facing contract: dates in the range form are quoted (the D2 amendment, above — no date-literal token exists); and the set value slots reuse update's typed current_column_value (no spurious column-ref match) rather than the raw expression operand.

Further SD2 increments (custom user generators, NULL injection, multi-locale, recursive parent auto-seed) remain out of scope (see Out of scope).

Closes requirements.md SD1 and delivers the core of SD2 (per-type generators, determinism, the fake-backed catalogue). It also closes one of the two remaining gaps in A1 ("all canonical app-level commands") — seed; the other, hint (H2), is separate.

Builds on: ADR-0014 (data operations, the Value/Bound value model, the auto-show pattern, FK-error enrichment), ADR-0005/0011 (the type vocabulary and Type::fk_target_type()), ADR-0012/0013 (the column / relationship metadata tables, the rebuild-table primitive — read by seed for schema introspection), ADR-0024 (the unified grammar tree / CommandNode registration that gives completion, hints, help-id, usage-id for free), ADR-0022 (ambient typing assistance — the KNOWN_SQL_FUNCTIONS curated-vocabulary pattern that the generator-name list mirrors), ADR-0026 (the in (...) / between ... and ... expression grammar the override clause reuses), ADR-0027 (the validity-indicator diagnostics model), and ADR-0038 (the OutputStyleClass::Hint styled output used for the post-seed advisory). Honours ADR-0003 (both modes, no sigil), ADR-0009 (DSL conventions — keyword grammar, -- flags for opt-in choices, one sigil only), ADR-0002 (no engine name in user-facing strings), and ADR-0015 (per-command write-through persistence).

Context

seed <table> [count] is the last unbuilt data-authoring command in the requirements. The pedagogical value is high: a learner who has just modelled a schema wants rows to query against now, without hand-typing dozens of inserts. A teacher wants a one-liner that fills a demo database with believable data. SD1 commits to "plausible fake data; junction tables seeded with valid foreign-key references drawn from existing parent rows." SD2 deferred the how — "per-type generators, locale, determinism, override hooks" — explicitly pending this ADR.

The design conversation widened the scope deliberately, with the user confirming each step:

  • Realism matters more than minimalism for a teaching tool. Random text_a3f9 values teach nothing; Alice Martinez / alice.m@example.com make queries feel real. → adopt a faker library and make generation name-aware.
  • The column name is the strongest signal for what a value should look like, but it is ambiguous without the table for the name/title family (products.nameusers.name).
  • Heuristics will miss, so a manual override surface is required, not optional — this is SD2's "override hooks", brought forward.
  • Identifiers and enums are special: id-ish columns want uniqueness; status-ish columns have no sensible generic value and should be flagged, not guessed.

The novel work is the generation layer. Everything downstream — type validation, autogen autofill (serial/shortid), FK enforcement, per-command persistence, the auto-show outcome — is reused from the existing insert/update machinery as shared helper functions, per the X5 architecture preference (unique commands, with mechanics shared as library functions — not by emitting Command::Insert to borrow do_insert).

Decision

Add a dedicated seed command (its own AST variant and its own do_seed worker executor) available in both modes, with the surface and behaviour below. Generation is realistic, name- and table-aware, type-gated, with a manual override clause and a reproducibility flag.

Command classification (important, set by the replay decision D16). Although requirements.md A1 lists seed among the "app-level commands" (meaning: part of the canonical command surface, no sigil, both modes), seed is architecturally a data-authoring command — a sibling of insert/update/delete, not an app-lifecycle AppCommand. It is therefore not added to is_app_lifecycle_entry_word / completion's empty_input_offers_app_command_entry_keywords (those mirror the AppCommand set and must match — seed belongs in neither): replay re-runs it as a data write (D16).

D1 — Command surface (fork, user-chosen: "whole-row + column-fill")

Two forms:

  1. Whole-row generationseed <table> [count] Generates count new rows (an INSERT path). count defaults to 20 (D6) when omitted. Every user-fillable column is filled per the generation rules (D7D12); serial/shortid autogen columns are left to the existing autofill helpers.

  2. Column-fill on existing rowsseed <table>.<column> Fills <column> across the table's existing rows (an UPDATE path) — the natural follow-up to add column. Combined with the set clause (D2) this is also the precise repair for a single mis-guessed column: seed users.work_addr set work_addr as email. Column-fill refuses PK columns and autogen (serial/shortid) columns (a friendly error — you don't "fill" an identity column), and respects the same UNIQUE / FK / required rules as whole-row generation (a UNIQUE target gets collision-free values; an FK target samples from the parent, D14). On an empty table it is a friendly no-op ("no rows to fill").

Zero / over-cap counts. seed <table> 0 is a friendly no-op; count over the maximum (D6) is a friendly error.

The column-restricted-insert form (seed t (a, b) — new rows, only some columns filled) was considered and rejected as marginal and constraint-fragile (see Alternatives).

Required-column block guard (user requirement). If seed cannot produce a value for a NOT NULL column — the only real case is a NOT NULL blob column, which has no DSL value path — it refuses the whole operation with a friendly error naming the column, rather than attempting a NULL insert that would violate the constraint. The check is a pre-flight over the resolved per-column plan, before any write.

D2 — Manual override: the set clause (fork, user-chosen: "value + list + generator + range")

An optional, comma-separated set clause overrides generation per column. Four forms, all reusing existing grammar vocabulary so there is nothing new to learn:

Form Example Meaning
Fixed value set status = 'pending' every row gets the constant
Pick-from-list set role in ('admin', 'editor', 'viewer') uniform random choice from the list
Explicit generator set work_addr as email force a named generator (D9)
Range set price between 10 and 100 uniform in range; also datesset signup between '2023-01-01' and '2024-12-31'

Multiple clauses combine: seed users 20 set role in ('admin', 'user'), status = 'active', signup between '2023-01-01' and '2024-12-31'.

Override × UNIQUE capacity (post-implementation /runda, user-chosen: "friendly error"). A bounded override — a fixed value, or a pick-list — on a single-column-UNIQUE target (a UNIQUE column or a single-column PK) that offers fewer distinct values than the row count cannot fill the run; rather than let the D10 uniqueness machinery silently cap it (e.g. seed users 100 set email = 'x' → 1 row), seed refuses up front with a friendly error pointing at the fixes (use a generator, or a longer list). Generators and ranges are treated as effectively unbounded sources — if one genuinely exhausts, the D14 distinct-combination cap still applies. Compound uniqueness is exempt (the other key columns can still vary).

Quoting (fork, user-chosen: "quoted, grammar-consistent"). Text values and list items are quoted string literals ('admin'), exactly as everywhere else in the DSL — only numbers stay unquoted. Amendment (2026-06-11, Phase 2 build): the original wording said "numbers and dates stay unquoted", but this DSL has no date-literal tokenValue is Number/Text only, and a date is a quoted string validated by bind_date ('2023-01-01') everywhere else (insert / update / where). An unquoted 2023-01-01 lexes as 2023,-,01,… and cannot parse. So dates in the range form are quoted (between '2023-01-01' and '2024-12-31') — which is in fact more faithful to this decision's own "quoted, grammar-consistent" principle. Numbers remain unquoted (NumberLit). This reuses the ADR-0026 expression grammar unchanged: the DA pass confirmed that the in (...) form's operands are typed value slots, so a bare admin would parse as a column reference (→ "unknown column"), not a string. Quoting is therefore not a style preference but a correctness requirement of grammar reuse. The range form is type-aware: numeric bounds for numeric columns, date bounds for date/datetime columns; a type-incompatible bound is a friendly error. =, in (...), and between ... and ... are the ADR-0026 expression operators; set is the ADR-0014 UPDATE keyword; as is borrowed from the SQL alias slot. The as <generator> operand is a bare name from the curated generator vocabulary (D9), not a value. The override takes precedence over every heuristic.

D3 — Generation library: fake crate + hand-rolled gaps (fork, user-chosen: "name-aware + realistic")

Add the fake crate (v5.x at time of writing; English locale for v1 per X2) for realistic values: names, emails, usernames, addresses, companies, phone numbers, lorem text, dates. Generation is driven by a per-column generator chosen by the heuristics (D7) or the override (D2), falling back to type-based generation (D8).

Implementation-time verifications (resolved 2026-06-11 when the dependency was added):

  • rand de-duplication — clean. fake 5.1.0 depends on rand = "0.10", the same major as the project's rand 0.10.1, so cargo tree -e normal resolves a single rand 0.10.1 (no runtime duplication; the rand 0.8.6 visible to cargo tree -i rand is only fake's own dev-dependency, never compiled for us). Consequence for D4: one seeded rand 0.10 StdRng can drive both fake's fake_with_rng and the hand-rolled generators — determinism is single-RNG, single-version, and shares shortid.rs's rand version.
  • fake module inventory / features — confirmed. Default features (["either"]) cover the core string fakers used here (Name/Internet/Address/Company/Lorem/PhoneNumber); fake's chrono feature is deliberately omitted (dates generated in-house for D8's bounded windows). No commerce/product module exists → product is hand-rolled (D9). (The exact faker call sites are pinned when the generation library is built.)
  • Security (new-dependency posture) — clean. The fake tree (296 packages total) scanned clean by all three mandated scanners: osv-scanner (no issues), grype (no vulnerabilities), trivy fs --scanners vuln (0). No findings to document or accept.

D4 — Determinism: --seed <n> (fork, user-chosen: "optional flag")

Generation is random by default. The optional --seed <n> flag makes a run reproducible: same database state + same --seed → identical data. The "database state" qualifier matters (DA refinement) — FK sampling (D14), identifier sequencing (D10), and UNIQUE collision-avoidance all read existing rows, so reproducibility is relative to the data already present, not absolute. Value: teachers hand out one dataset; demos are stable; and the feature's own tests can assert exact output (against a known starting state). Implemented with a seedable RNG threaded through every generator (no thread_rng on the seeded path). -- flag per ADR-0009 (opt-in choice). Naming note: the flag --seed and the command seed share a word but never collide grammatically (seed users 20 --seed 42 parses unambiguously). This flag is also the determinism lever for replay (D16): a recorded seed … --seed N line reproduces on replay; a bare seed … line regenerates fresh data.

D5 — Both modes (A1)

seed is a canonical app-level command, available in simple and advanced mode, no sigil — like save/load/export/replay.

D6 — Default count: 20; bounded maximum

Omitted count20 rows: enough to make where, group by, order by, and limit meaningful without flooding the output pane. A maximum is enforced (proposed 10 000) to prevent a typo (seed t 1000000) from hanging the app or bloating the project; over the cap → friendly error stating the limit.

D7 — Name-aware heuristics, type-gated (the catalogue)

A column's name selects a generator, but a name rule only fires when the column's type is compatible (a column named email typed int does not get a string — it falls through to type-based int). Matching is case-insensitive, token-based (split on _, camelCase, kebab), most-specific-first, with documented false-positive guards. The catalogue (representative; full table lives with the implementation):

Column name (tokens) Generator Type gate
first_name/fname · last_name/surname/lname first / last name text
name/full_name · title table-context name (D11) text
email/*_email email text
username/login/handle username text
password/pwd password text
phone/mobile/cell/tel phone number text
city/town · country · state/province address parts text
street/address/addr · zip/postcode/postal address parts text
company/employer/org · job/position/profession company / job text
description/bio/notes/summary/comment sentence / paragraph text
url/website/homepage · color/colour URL / hex colour text
price/amount/cost/salary/balance/total currency-range number numeric
age · quantity/qty/stock/count 1880 · small int numeric
year/*_year/published/founded (Amendment 1) bounded year (birth window for birth/born/dob, else 19502025) int
priority/prio · severity · rating/stars (Amendment 1) built-in PickFrom value set text/int
date/*_date date, recent ~3 yr window date
dob/birthday date, adult window (1880 yr ago) date
timestamp/datetime · created_at/updated_at/*_at datetime, recent window (updated_atcreated_at) datetime
is_*/has_*/active/enabled boolean bool
identifier family (D10) unique sequential int/text
enum-ish family (D12) generic text + flag (text)

False-positive guards (documented): username/filename/ table_name/*_name handled before the bare name rule so they do not resolve to person-name; the bare name/title rule requires a standalone token or a recognised *_name suffix.

D8 — Type-based fallback

When no name rule matches (or to satisfy a name rule's type gate), generate by type: text→realistic words/short phrase, int→ bounded random, real→random double, decimal→formatted number, bool→random, date/datetimebounded recent value (never "any point in all of history" — per the user's date concern), serial/ shortid→omitted (autogen helpers fill them), blob→unsupported (nullable→NULL; NOT NULL→D1 block guard).

D9 — Named generators + the product generator

The generators addressable via set ... as <generator> (D2) and chosen by D7 form a curated, named vocabularyname, first_name, last_name, email, username, phone, city, country, street, zip, company, job, sentence, paragraph, url, color, price, age, date, datetime, bool, product, … — the single source of truth shared by the executor, the completion source, and the highlighter (mirroring KNOWN_SQL_FUNCTIONS, ADR-0022 Amд6).

product is hand-rolled (the fake crate has no commerce/product module — D3): {adjective} {material} {noun} from three small baked-in word lists (~20 each) → "Sleek Bamboo Keyboard", "Vintage Leather Backpack". Seedable through the D4 RNG. Always addressable as set <col> as product, and auto-selected by D11 for the name/title family in product-ish tables.

D10 — Identifier family → unique by name (fork, user-chosen: "unique sequential")

A column in the identifier family — id, *_id that is not an FK, code, sku, ref/reference, number/no, barcode — that is not a serial/shortid autogen column and not the PK is treated as an identifier and gets unique values: int → sequential (MAX(col)+1 ascending, reads like real ids, never collides); text → unique short code (generate-with-retry). Precedence: FK detection wins over this rule (an FK user_id should have duplicates — many children per parent), so *_id only triggers uniqueness when the column is not a foreign key.

Constraint-driven uniqueness is independent and mandatory: any column with a UNIQUE constraint (or a user-fillable single-column PK) gets guaranteed-unique generation regardless of name — a correctness requirement, not a heuristic. Generation for such columns uses retry/sequence to guarantee no collision within the batch and against existing rows.

D11 — Table-context disambiguation for name/title (fork, user-chosen: "table-context-aware")

For the name/title family only, the heuristic also reads the table name token:

  • product/item/goods/merchandise/catalog/inventoryproduct generator (D9)
  • company/companies/vendor/supplier/manufacturer/brand → company name
  • user/customer/person/people/employee/member/contact/ author/student → person name
  • unrecognised table → generic word

This resolves the real ambiguity (products.name → "Sleek Bamboo Keyboard"; users.name → "Alice Martinez"; vendors.name → "Globex Corp"). It is a deliberately scoped use of table context — the only place the table name influences generation.

D12 — Enum-ish names → generic + post-seed advisory (fork, user-chosen: "flag enum-ish only")

Enum-ish names — role, status, type, state, kind, category, level, tier, stage, priority, gender — have no sensible generic generator, so they are not guessed: they fall through to generic text (they must still be filled — a NOT NULL status cannot be left empty). Seed then emits a post-seed advisory (D13) naming them and pointing at the set ... in (...) override.

D13 — Reporting: post-seed advisory (fork, user-chosen: "flag enum-ish only")

After a successful seed, in addition to the normal auto-show outcome (row count + the affected rows, per ADR-0014), seed appends a OutputStyleClass::Hint advisory only when one or more enum-ish columns (D12) — or columns guarded by a CHECK that seed could not derive values from (D17) — were filled generically.

The wording is phase-aware (DA finding: the advisory must not name features that ship later). In Phase 1 (no set clause yet) it names the columns and explains they were filled generically. From Phase 2/3 it points at the concrete repair:

# Phase 1 wording:
✓ Seeded 20 rows into users
   status, role were filled with generic text — they look like
    fixed value sets you may want to choose deliberately.

# Phase 2/3 wording (set clause + column-fill exist):
✓ Seeded 20 rows into users
   status, role filled generically. Fix existing rows with
    seed users.status set status in ('active','inactive'),
    or pass  set …  on the next seed.

Note the repair for already-seeded rows is the column-fill form (seed users.status set …), not "re-seed" (which would add more rows) — DA correction. This is a result-time note (cheap, reusing ADR-0038's hint rendering), not a typing-time warning. The fuller "per-column report" (every column → its generator) was considered and deferred (see Alternatives / Out of scope).

D14 — Foreign keys (SD1; fork on empty-parent, user-chosen: "friendly error")

  • Each FK is filled by sampling uniformly from the existing rows of the parent table's referenced column(s). Duplicates are expected and correct (many children per parent). For a compound FK, the referenced tuple is sampled jointly (a whole existing parent key), never per-column independently — independent sampling could fabricate a (a, b) pair that exists in no parent row and would fail FK enforcement (DA refinement).
  • Empty parent → seed refuses with a friendly error naming the parent and the FK column ("seed users first — orders.user_id references it"). Safe, predictable, teaches FK dependency order. Recursive parent auto-seed is deferred to a future --recursive opt-in (Out of scope).
  • Junction / compound-PK tables (SD1's explicit case): sample distinct combinations of the parent PK tuples to satisfy the compound PK's uniqueness; if count exceeds the number of available distinct combinations, cap at the maximum and note it in the outcome.
  • Self-referential FK (manager_id → id): if nullable, leave NULL or point at an earlier row in the same batch; if NOT NULL on an otherwise-empty table, friendly error. Documented edge case.
  • Nullable FKs are always filled in v1 (predictable); occasional-NULL injection is deferred.

D15 — Undo: one snapshot per seed (DA finding; ADR-0006)

Seed is a mutation, so it must participate in undo. The draft omitted this; the DA found the codebase already has the right primitive — BeginBatch / EndBatch (db.rs), used by replay so a multi-write run collapses to one boundary snapshot. do_seed wraps its generated writes in begin_batch / end_batch, so seed users 20 is a single undo step, not 20 — matching ADR-0006 Amendment 1's batch model. Column-fill's bulk UPDATE is likewise one step. (import remains the only data-affecting op outside undo, per ADR-0015 §11; seed is firmly inside it.)

D16 — Replay: seed re-runs as a data write (fork, user-chosen)

replay re-executes a recorded seed line as a data-write command — it is not in the app-lifecycle skip-set (see Command classification, above). Consequence, accepted by the user: a bare seed users 20 regenerates fresh, divergent data on each replay; a seed users 20 --seed 42 line (the determinism lever, D4) reproduces the original data. This keeps seed faithful to its nature as a data write and puts reproducibility exactly where the --seed flag already lives. (Seeded data is in any case durable independently of replay, via the ADR-0015 CSV store + rebuild; replay is the scripting re-run path, U4.) The DA confirmed the wiring trap: because seed is not an AppCommand, it is correctly absent from is_app_lifecycle_entry_word and replay dispatches it through the normal data path rather than aborting.

D17 — CHECK constraints: derive from simple IN, else friendly-fail (fork, user-chosen)

A CHECK on a generically-filled column would otherwise fail the whole batch (DA finding — the block guard only covered NOT NULL blob). Two-tier handling, per the user:

  1. Derive from simple IN-CHECKs. When a column's CHECK is the common enum-as-CHECK shape — col IN ('a', 'b', …) (the column's own CHECK, single-column, literal list) — seed parses out the allowed values and uses them as the generator (uniform choice). The frequent CHECK (status IN ('active','closed')) case then "just works" with no override needed.
  2. Best-effort + friendly fail for the rest. For CHECKs seed cannot interpret (ranges, expressions, multi-column), it generates best-effort; if a generated row violates the CHECK, the insert fails through the existing H1 friendly-error layer (ADR-0019) naming the constraint and pointing at set. Such CHECK-guarded columns are also pre-flagged in the advisory (D13) alongside enum-ish names, so the user is warned before hitting the failure.

No new CHECK engine — tier 1 is a narrow literal-IN parse over the CHECK text already stored in metadata; tier 2 is the existing failure path.

D18 — Auto-show is capped for large seeds (DA finding)

ADR-0014 auto-show renders "the affected rows" — fine for one insert, a wall for a 10 000-row seed. Seed's outcome shows a capped preview (proposed first 20 rows) with a (showing 20 of N) note, not the full set. The row count is always reported in full; only the rendered table is capped.

Grammar, AST, and cross-cutting wiring

Per ADR-0024, seed is registered as a CommandNode so completion, hints, help, and usage flow from one definition. The wiring, as explicit acceptance criteria (a /runda pass must verify each — ADR-0045 showed "claimed verified" is not verified):

  • AST + executor. A dedicated command variant (Seed { table, target_column: Option<String>, count: Option<u32>, overrides: Vec<SeedOverride>, rng_seed: Option<u64> }) and a dedicated do_seed worker executor. do_seed reuses shared helpers (value binding impl_value_for, autogen autofill, FK enrichment, the multi-row parameterised-insert pattern of plan_autogen_autofill, the UPDATE path for column-fill, per-command persistence, the begin_batch/end_batch undo primitive of D15) as library functions — it does not emit Command::Insert/Command::Update (X5).
  • Replay / undo classification (D15/D16). do_seed brackets its writes in one batch (one undo step). The seed entry word is deliberately absent from is_app_lifecycle_entry_word and completion's empty_input_offers_app_command_entry_keywords (the AppCommand mirror) so replay re-runs it as a data write — an explicit acceptance check, since the default for an unlisted recognised command must be "replayed", not "abort".
  • Completion sources: table-name (existing tables); .column and set-clause column slots (columns of the named table); the generator-name vocabulary (D9) after as; count number; set / = / in / as / between / and keywords; --seed flag.
  • Syntax highlighting: seed keyword; the generator-name vocabulary highlighted as tok_function (reuse the existing ADR-0022 Amд6 blue — no new theme colour).
  • Hints: ambient per-slot "what's next" and usage hints, both modes.
  • Help: help seed topic (help_id + per-command block); the general help list picks it up automatically via REGISTRY.
  • Parse-error pedagogy (ADR-0042): near-miss matrix rows for seed (bare / missing-table / wrong-token / malformed set), both modes.
  • Validity indicator (ADR-0027): typing-time [ERR]/[WRN] for unknown table, unknown column (in .column or set), unknown generator name after as.
  • No DSL→SQL teaching echo (ADR-0038). seed is a utility/app command, not a DSL form of a SQL statement, so the echo does not apply. (A future "show the generated INSERTs" is out of scope — it would dump count statements.)

Implementation phasing

Design is whole; the implementation is phased into reviewable, test-first commits:

  1. Core whole-row seed (done, Phase 1) — grammar/AST/executor; type-based generation + the fake-backed name heuristics (D7/D8/D11); identifier uniqueness (D10) + constraint uniqueness; FK sampling (joint tuples) + empty-parent error + junction distinct-combos (D14); --seed determinism (D4); default count + cap
    • zero-no-op (D6/D1); required-column block guard (D1); undo batch (D15); replay-as-data-write classification (D16); CHECK derive / friendly-fail (D17); capped auto-show (D18); the enum/CHECK advisory in its Phase-1 wording (D12/D13); full ambient wiring; both modes.
  2. The set override clause (D2) (done, Phase 2) — value / list / generator / range, type-aware, with completion + highlight + validity for the generator-name slot.
  3. Column-fill mode (seed <table>.<column>, D1 form 2) (done, Phase 2) — the UPDATE path.

Each phase is independently green before the next. (Phases 2 and 3 landed together — they share the set-override executor machinery, so splitting them risked a state where set parsed but column-fill silently no-op'd.)

Testing (ADR-0008 tiers 13; test-first)

  • Tier 1 (unit, deterministic via --seed): generator selection (name × type-gate matrix, including every false-positive guard of D7); table-context disambiguation (D11); identifier uniqueness and the FK-wins-over-*_id precedence (D10); bounded-date windows (D8); the product generator shape; override resolution + precedence (D2); the required-column block guard (D1); the count cap (D6). Exact-value assertions are possible because --seed fixes the RNG.
  • Tier 2 (insta snapshots): the seeded data table render and the enum advisory (D13) at representative sizes, light + dark.
  • Tier 3 (integration, full event loop): seed users 20 end to end (rows land in db + CSV + history, auto-show, persistence); FK sampling against a populated parent (incl. a compound FK — every child tuple exists in the parent); empty-parent friendly error; junction seeding with distinct combinations and the over-cap note; the set clause forms (quoted literals); column- fill on existing rows (incl. refusal of PK/autogen targets, empty- table no-op); reproducibility (--seed 42 twice → identical data from a fixed state); both modes. Plus the DA-driven cases: one-undo-step (seed then a single undo removes all rows); replay of a bare seed line (divergent) vs a --seed line (reproduced); IN-CHECK auto-derivation ("just works") and a complex-CHECK friendly failure; capped auto-show on a large seed.

"All green, no skips" is the only acceptable end state; the Phase-1 baseline (2290 passing / 0 failing / 0 skipped / 1 ignored doctest) is the regression floor.

Out of scope / deferred (future SD2 work)

  • Recursive parent auto-seed (--recursive) — D14 errors instead.
  • NULL injection for nullable columns (teaching optional relationships / IS NULL) — v1 always fills.
  • Multi-locale generation — English only (X2).
  • User-defined custom generators (true "override hooks" — register a named generator) — the set ... as <builtin> surface covers the common need; custom generators are a later SD2 increment.
  • Full per-column seed report — D13 flags enum-ish only.
  • Column-restricted insert (seed t (a, b)) — rejected (D1).
  • "Show the generated SQL" teaching echo for seed.

Alternatives considered

  • Hand-rolled generators only (no fake): minimal dependency, but synthetic-looking data (text_a3f9) — rejected on pedagogy (pedagogy wins ties).
  • Type-only generation (no name awareness): simpler, but misses the biggest UX win (a users table that reads like real people) — rejected.
  • Column-name-only name (no table context): leaves products.name → person names, requiring a manual override on every product/company table — rejected for the name/title family (D11).
  • No override clause (heuristics + type only): could not answer "the heuristic guessed wrong, fix it" or enum columns — rejected; the set clause (D2) is the answer to the user's Q3.
  • Recursive auto-seed of empty parents: powerful but magical and can seed tables the user did not name — deferred behind a future flag (D14).
  • Always-random (no --seed): simplest, but no reproducible datasets and weaker tests — rejected (D4).
  • Full per-column report by default: a nice teaching artifact but verbose on wide tables — deferred; flag-only advisory chosen (D13).
  • Reuse Command::Insert/do_insert directly from seed: tempting for code reuse, but collapses command identity and violates X5 — rejected in favour of a dedicated do_seed that calls shared helpers.
  • Skip seed on replay (classify as app-lifecycle, D16): consistent with A1's "app-level" label and avoids divergent data, but seed is a data write and silently skipping it on a scripted re-run is surprising — rejected; --seed is the determinism lever instead.
  • Bare-word set list items (in (admin, …), D2): matched the early mockups and reads cleaner, but bare words are column references in the reused grammar (would error) and would force a custom list form — rejected for quoted literals (grammar reuse + DSL consistency).
  • Pre-flight refuse any CHECK-bearing table (D17): safest but blocks seeding too many legitimate tables — rejected for the derive-IN-else-friendly-fail tier.
  • set-driven NULL / per-column report / recursive parent seed: deferred — see Out of scope.

Amendment 1 — year-as-int + conventional choice sets (2026-06-12)

Two SD2-style refinements to the D7 catalogue, surfaced while writing the website seed docs. Both are additive name rules; no change to D8 (type fallback), the executor, or the grammar.

Issue #33 — year-like int columns

A column such as published or birth_year was just an int, so it fell through to the unbounded type-based int path (D8) and produced nonsense like 9419 or 1426 — implausible as years, undercutting the "realistic data" pedagogy. Added an int-gated year rule, placed after the quantity rule (so year_count stays a count):

  • year / *_year / published / foundedYearRecent, a bounded window of 19502025 (75 years relative to the fixed REF_YEAR, wide enough for published books / founding years / release years; matches the issue's own between 1950 and 2020 workaround).
  • the same with a birth / born / dob token (e.g. birth_year) → YearBirth, mirroring the existing dob → DateAdult adult birth window as years (19452007).

Both emit a plain int. published / founded are included (user-confirmed): an int so named is almost always a year (a flag would be is_published). The generators are not added to the D9 named-generator vocabulary — explicit control stays with set <col> between <lo> and <hi>.

Issue #34 — built-in value sets for conventional choice names

D12 deliberately does not guess values for enum-ish names. For a few, though, there is a near-canonical small set that reads far better than lorem text. Added a type-gated PickFrom lookup (reusing the existing generator — no new machinery), placed ahead of the enum-ish fallthrough:

Name (tokens) text int
priority / prio low/medium/high 1/2/3
severity low/medium/high/critical 1/2/3/4
rating / stars 15

A user-declared IN-CHECK (D17) still wins — it is resolved before the heuristics. Any name that gains a set is removed from the enum-ish advisory trigger (priority left ENUM_TOKENS); since the advisory (D13) only fires on Generator::Generic, a PickFrom name is excluded either way, but the removal keeps is_enum_ish semantically "names seed still can't guess".

status is deliberately excluded (user-confirmed on the issue): its real values are too domain-specific (active/inactive, open/closed/pending, draft/published, …), so it keeps the D12 "don't guess" stance — generic text + the advisory pointing at set status in (…). state stays its US-state-name generator (D7); type/kind/category/stage/gender and size/tier/plan were considered and left to the advisory.

Website follow-up (tracked on the website branch, not here): the seed cast exercises a tickets table with priority; it should be re-recorded so the table tightens once priority collapses to a short value — likely subsumed by the pre-publication cast sweep.