Files
rdbms-playground/docs/adr/0048-seed-fake-data-generation.md
T
claude@clouddev1 78c38e8b33 docs: ADR-0048 Phase 1 accepted/implemented + handoff 65
- ADR-0048 status -> Accepted; Phase 1 implemented (commits
  202e25a..fbd219b), with the pre-build and post-implementation /runda
  passes and the 2358-test green state recorded; index entry updated.
- requirements.md: SD1 [x] (whole-row seed + FK/junction, both modes,
  --seed reproducibility with no exceptions), SD2 [/] (core generators /
  determinism done; the set override clause + column-fill are Phase 2),
  A1 14/15 (only hint/H2 remains unregistered).
- Handoff 65: the full seed Phase-1 build, the two /runda passes, where
  the code lives, and Phase-2 / next steps.
2026-06-11 21:49:06 +00:00

625 lines
32 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0048: `seed` — fake-data generation command (SD1, opens SD2)
## Status
**Accepted (2026-06-11); Phase 1 implemented (2026-06-11).** Design
settled with the user across an extended fork dialogue (every decision
below was escalated and user-chosen), then hardened by a pre-build
`/runda` Devil's-Advocate pass that found six blockers — undo
integration (D15), replay semantics (D16), `set` value quoting (D2),
CHECK-constraint handling (D17), a phase-ordering bug in the advisory
(D13), and auto-show flooding (D18) — plus refinements (state-relative
reproducibility, compound-FK tuple sampling, column-fill constraint
rules, the `fake` dependency scan), all folded in.
**Phase 1 shipped** test-first across commits `202e25a` (generation
library + `fake` dependency) → `f1e9484` (command skeleton) →
`73493fa` (FK sampling) → `9c13501` (uniqueness / junction / IN-CHECK)
`0b3ab3c` (`SeedResult` / preview / advisory / count cap) →
`e6ff63d` (single-transaction O(N) path) → `fbd219b` (`--seed` flag,
ambient wiring, and a whole-implementation `/runda` pass). The
post-implementation `/runda` found eight gaps — FK-sampling
determinism (now `ORDER BY`), shortid reproducibility (now from the
seeded RNG, so **D4 holds with no exceptions**), and six untested
ADR decisions (D5/D15/D16/D17 + atomicity + zero-count), all closed.
**2358 tests pass / 0 fail / 0 skip; clippy clean.**
**Implemented in Phase 1:** the whole-row `seed <table> [count]
[--seed <n>]` form and every D1D18 decision *except* the two
deferred-to-Phase-2 surfaces below. **Deferred to Phase 2** (designed
here, not yet built): the **`set` override clause** (D2) and the
**`<table>.<column>` column-fill** form (D1 form 2). Further SD2
increments (custom user generators, NULL injection, multi-locale,
recursive parent auto-seed) remain out of scope (see Out of scope).
Closes `requirements.md` **SD1** and delivers the core of **SD2**
(per-type generators, determinism, the `fake`-backed catalogue). It
also closes one of the two remaining gaps in **A1** ("all canonical
app-level commands") — `seed`; the other, `hint` (**H2**), is
separate.
Builds on: ADR-0014 (data operations, the `Value`/`Bound` value model,
the auto-show pattern, FK-error enrichment), ADR-0005/0011 (the type
vocabulary and `Type::fk_target_type()`), ADR-0012/0013 (the column /
relationship metadata tables, the rebuild-table primitive — *read* by
seed for schema introspection), ADR-0024 (the unified grammar tree /
`CommandNode` registration that gives completion, hints, help-id,
usage-id for free), ADR-0022 (ambient typing assistance — the
`KNOWN_SQL_FUNCTIONS` curated-vocabulary pattern that the
generator-name list mirrors), ADR-0026 (the `in (...)` / `between ...
and ...` expression grammar the override clause reuses), ADR-0027 (the
validity-indicator diagnostics model), and ADR-0038 (the
`OutputStyleClass::Hint` styled output used for the post-seed
advisory). Honours ADR-0003 (both modes, no sigil), ADR-0009 (DSL
conventions — keyword grammar, `--` flags for opt-in choices, one
sigil only), ADR-0002 (no engine name in user-facing strings), and
ADR-0015 (per-command write-through persistence).
## Context
`seed <table> [count]` is the last unbuilt **data-authoring** command
in the requirements. The pedagogical value is high: a learner who has
just modelled a schema wants rows to query against *now*, without
hand-typing dozens of `insert`s. A teacher wants a one-liner that
fills a demo database with believable data. SD1 commits to "plausible
fake data; junction tables seeded with valid foreign-key references
drawn from existing parent rows." SD2 deferred the *how* — "per-type
generators, locale, determinism, override hooks" — explicitly pending
this ADR.
The design conversation widened the scope deliberately, with the user
confirming each step:
- **Realism matters more than minimalism** for a teaching tool. Random
`text_a3f9` values teach nothing; `Alice Martinez` /
`alice.m@example.com` make queries feel real. → adopt a faker
library and make generation **name-aware**.
- **The column *name* is the strongest signal** for what a value should
look like, but it is **ambiguous** without the **table** for the
`name`/`title` family (`products.name``users.name`).
- **Heuristics will miss**, so a **manual override** surface is
required, not optional — this is SD2's "override hooks", brought
forward.
- **Identifiers and enums** are special: `id`-ish columns want
uniqueness; `status`-ish columns have no sensible generic value and
should be *flagged*, not guessed.
The novel work is the **generation layer**. Everything downstream —
type validation, autogen autofill (`serial`/`shortid`), FK
enforcement, per-command persistence, the auto-show outcome — is
reused from the existing insert/update machinery as **shared helper
functions**, per the X5 architecture preference (unique commands, with
mechanics shared as library functions — *not* by emitting
`Command::Insert` to borrow `do_insert`).
## Decision
Add a dedicated **`seed`** command (its own AST variant and its own
`do_seed` worker executor) available in **both modes**, with the
surface and behaviour below. Generation is realistic, name- and
table-aware, type-gated, with a manual override clause and a
reproducibility flag.
**Command classification (important, set by the replay decision
D16).** Although `requirements.md` A1 lists `seed` among the
"app-level commands" (meaning: part of the canonical command surface,
no sigil, both modes), `seed` is architecturally a **data-authoring
command** — a sibling of `insert`/`update`/`delete`, **not** an
app-lifecycle `AppCommand`. It is therefore **not** added to
`is_app_lifecycle_entry_word` / completion's
`empty_input_offers_app_command_entry_keywords` (those mirror the
`AppCommand` set and must match — `seed` belongs in neither): `replay`
re-runs it as a data write (D16).
### D1 — Command surface (fork, user-chosen: "whole-row + column-fill")
Two forms:
1. **Whole-row generation**`seed <table> [count]`
Generates `count` new rows (an INSERT path). `count` **defaults to
20** (D6) when omitted. Every user-fillable column is filled per the
generation rules (D7D12); `serial`/`shortid` autogen columns are
left to the existing autofill helpers.
2. **Column-fill on existing rows**`seed <table>.<column>`
Fills `<column>` across the table's **existing** rows (an UPDATE
path) — the natural follow-up to `add column`. Combined with the
`set` clause (D2) this is also the precise repair for a single
mis-guessed column: `seed users.work_addr set work_addr as email`.
Column-fill **refuses** PK columns and autogen (`serial`/`shortid`)
columns (a friendly error — you don't "fill" an identity column),
and **respects** the same UNIQUE / FK / required rules as whole-row
generation (a UNIQUE target gets collision-free values; an FK
target samples from the parent, D14). On an **empty** table it is a
friendly no-op ("no rows to fill").
**Zero / over-cap counts.** `seed <table> 0` is a friendly no-op;
`count` over the maximum (D6) is a friendly error.
The column-restricted-*insert* form (`seed t (a, b)` — new rows, only
some columns filled) was considered and **rejected** as marginal and
constraint-fragile (see Alternatives).
**Required-column block guard (user requirement).** If seed cannot
produce a value for a `NOT NULL` column — the only real case is a
`NOT NULL blob` column, which has no DSL value path — it **refuses the
whole operation with a friendly error** naming the column, rather than
attempting a NULL insert that would violate the constraint. The check
is a pre-flight over the resolved per-column plan, before any write.
### D2 — Manual override: the `set` clause (fork, user-chosen: "value + list + generator + range")
An optional, comma-separated `set` clause overrides generation per
column. Four forms, all reusing existing grammar vocabulary so there
is nothing new to learn:
| Form | Example | Meaning |
|---|---|---|
| Fixed value | `set status = 'pending'` | every row gets the constant |
| Pick-from-list | `set role in ('admin', 'editor', 'viewer')` | uniform random choice from the list |
| Explicit generator | `set work_addr as email` | force a named generator (D9) |
| Range | `set price between 10 and 100` | uniform in range; **also dates**`set signup between 2023-01-01 and 2024-12-31` |
Multiple clauses combine: `seed users 20 set role in ('admin',
'user'), status = 'active', signup between 2023-01-01 and 2024-12-31`.
**Quoting (fork, user-chosen: "quoted, grammar-consistent").** Text
values and list items are **quoted string literals** (`'admin'`),
exactly as everywhere else in the DSL — numbers and dates stay
unquoted. This reuses the ADR-0026 expression grammar **unchanged**:
the DA pass confirmed that the `in (...)` form's operands are typed
value slots, so a *bare* `admin` would parse as a **column reference**
(→ "unknown column"), not a string. Quoting is therefore not a style
preference but a correctness requirement of grammar reuse. The range
form is **type-aware**: numeric bounds for numeric columns, date
bounds for date/datetime columns; a type-incompatible bound is a
friendly error. `=`, `in (...)`, and `between ... and ...` are the
ADR-0026 expression operators; `set` is the ADR-0014 UPDATE keyword;
`as` is borrowed from the SQL alias slot. The `as <generator>` operand
is a bare name from the curated generator vocabulary (D9), not a
value. The override takes precedence over every heuristic.
### D3 — Generation library: `fake` crate + hand-rolled gaps (fork, user-chosen: "name-aware + realistic")
Add the **`fake`** crate (v5.x at time of writing; English locale for
v1 per X2) for realistic values: names, emails, usernames, addresses,
companies, phone numbers, lorem text, dates. Generation is driven by a
per-column **generator** chosen by the heuristics (D7) or the override
(D2), falling back to **type-based** generation (D8).
**Implementation-time verifications (resolved 2026-06-11 when the
dependency was added):**
- **`rand` de-duplication — clean.** `fake` 5.1.0 depends on
`rand = "0.10"`, the **same major** as the project's `rand 0.10.1`,
so `cargo tree -e normal` resolves a **single** `rand 0.10.1` (no
runtime duplication; the `rand 0.8.6` visible to `cargo tree -i
rand` is only `fake`'s own dev-dependency, never compiled for us).
Consequence for D4: one seeded `rand 0.10` `StdRng` can drive
**both** `fake`'s `fake_with_rng` and the hand-rolled generators —
determinism is single-RNG, single-version, and shares `shortid.rs`'s
`rand` version.
- **`fake` module inventory / features — confirmed.** Default features
(`["either"]`) cover the core string fakers used here
(Name/Internet/Address/Company/Lorem/PhoneNumber); `fake`'s `chrono`
feature is **deliberately omitted** (dates generated in-house for
D8's bounded windows). No commerce/product module exists → `product`
is hand-rolled (D9). (The exact faker call sites are pinned when the
generation library is built.)
- **Security (new-dependency posture) — clean.** The `fake` tree (296
packages total) scanned clean by **all three** mandated scanners:
`osv-scanner` (no issues), `grype` (no vulnerabilities), `trivy fs
--scanners vuln` (0). No findings to document or accept.
### D4 — Determinism: `--seed <n>` (fork, user-chosen: "optional flag")
Generation is **random by default**. The optional `--seed <n>` flag
makes a run **reproducible**: **same database state + same `--seed`
identical data**. The "database state" qualifier matters (DA
refinement) — FK sampling (D14), identifier sequencing (D10), and
UNIQUE collision-avoidance all *read existing rows*, so reproducibility
is relative to the data already present, not absolute. Value: teachers
hand out one dataset; demos are stable; and the feature's own tests
can assert **exact** output (against a known starting state).
Implemented with a seedable RNG threaded through every generator (no
`thread_rng` on the seeded path). `--` flag per ADR-0009 (opt-in
choice). Naming note: the flag `--seed` and the command `seed` share a
word but never collide grammatically (`seed users 20 --seed 42` parses
unambiguously). This flag is also the determinism lever for **replay**
(D16): a recorded `seed … --seed N` line reproduces on replay; a bare
`seed …` line regenerates fresh data.
### D5 — Both modes (A1)
`seed` is a canonical app-level command, available in **simple and
advanced** mode, no sigil — like `save`/`load`/`export`/`replay`.
### D6 — Default count: 20; bounded maximum
Omitted `count`**20** rows: enough to make `where`, `group by`,
`order by`, and `limit` meaningful without flooding the output pane.
A **maximum** is enforced (proposed 10 000) to prevent a typo
(`seed t 1000000`) from hanging the app or bloating the project; over
the cap → friendly error stating the limit.
### D7 — Name-aware heuristics, type-gated (the catalogue)
A column's **name** selects a generator, but a name rule only fires
when the column's **type** is compatible (a column named `email` typed
`int` does **not** get a string — it falls through to type-based int).
Matching is **case-insensitive**, **token-based** (split on `_`,
camelCase, kebab), **most-specific-first**, with documented
false-positive guards. The catalogue (representative; full table lives
with the implementation):
| Column name (tokens) | Generator | Type gate |
|---|---|---|
| `first_name`/`fname` · `last_name`/`surname`/`lname` | first / last name | text |
| `name`/`full_name` · `title` | **table-context** name (D11) | text |
| `email`/`*_email` | email | text |
| `username`/`login`/`handle` | username | text |
| `password`/`pwd` | password | text |
| `phone`/`mobile`/`cell`/`tel` | phone number | text |
| `city`/`town` · `country` · `state`/`province` | address parts | text |
| `street`/`address`/`addr` · `zip`/`postcode`/`postal` | address parts | text |
| `company`/`employer`/`org` · `job`/`position`/`profession` | company / job | text |
| `description`/`bio`/`notes`/`summary`/`comment` | sentence / paragraph | text |
| `url`/`website`/`homepage` · `color`/`colour` | URL / hex colour | text |
| `price`/`amount`/`cost`/`salary`/`balance`/`total` | currency-range number | numeric |
| `age` · `quantity`/`qty`/`stock`/`count` | 1880 · small int | numeric |
| `date`/`*_date` | date, recent ~3 yr window | date |
| `dob`/`birthday` | date, adult window (1880 yr ago) | date |
| `timestamp`/`datetime` · `created_at`/`updated_at`/`*_at` | datetime, recent window (`updated_at``created_at`) | datetime |
| `is_*`/`has_*`/`active`/`enabled` | boolean | bool |
| **identifier family** (D10) | unique sequential | int/text |
| **enum-ish family** (D12) | generic text + flag | (text) |
**False-positive guards (documented):** `username`/`filename`/
`table_name`/`*_name` handled before the bare `name` rule so they do
**not** resolve to person-name; the bare `name`/`title` rule requires a
standalone token or a recognised `*_name` suffix.
### D8 — Type-based fallback
When no name rule matches (or to satisfy a name rule's type gate),
generate by **type**: `text`→realistic words/short phrase, `int`
bounded random, `real`→random double, `decimal`→formatted number,
`bool`→random, `date`/`datetime`→**bounded recent** value (never "any
point in all of history" — per the user's date concern), `serial`/
`shortid`→omitted (autogen helpers fill them), `blob`→unsupported
(nullable→NULL; `NOT NULL`→D1 block guard).
### D9 — Named generators + the `product` generator
The generators addressable via `set ... as <generator>` (D2) and
chosen by D7 form a **curated, named vocabulary**`name`,
`first_name`, `last_name`, `email`, `username`, `phone`, `city`,
`country`, `street`, `zip`, `company`, `job`, `sentence`, `paragraph`,
`url`, `color`, `price`, `age`, `date`, `datetime`, `bool`, `product`,
… — the single source of truth shared by the executor, the completion
source, and the highlighter (mirroring `KNOWN_SQL_FUNCTIONS`,
ADR-0022 Amд6).
**`product`** is **hand-rolled** (the `fake` crate has no
commerce/product module — D3): `{adjective} {material} {noun}` from
three small baked-in word lists (~20 each) → "Sleek Bamboo Keyboard",
"Vintage Leather Backpack". Seedable through the D4 RNG. Always
addressable as `set <col> as product`, and auto-selected by D11 for
the `name`/`title` family in product-ish tables.
### D10 — Identifier family → unique by name (fork, user-chosen: "unique sequential")
A column in the identifier family — `id`, `*_id` **that is not an FK**,
`code`, `sku`, `ref`/`reference`, `number`/`no`, `barcode` — that is
**not** a serial/shortid autogen column and **not** the PK is treated
as an identifier and gets **unique** values: **int → sequential**
(`MAX(col)+1` ascending, reads like real ids, never collides);
**text → unique short code** (generate-with-retry). Precedence:
**FK detection wins** over this rule (an FK `user_id` *should* have
duplicates — many children per parent), so `*_id` only triggers
uniqueness when the column is not a foreign key.
**Constraint-driven uniqueness is independent and mandatory:** any
column with a `UNIQUE` constraint (or a user-fillable single-column
PK) gets guaranteed-unique generation regardless of name — a
correctness requirement, not a heuristic. Generation for such columns
uses retry/sequence to guarantee no collision within the batch and
against existing rows.
### D11 — Table-context disambiguation for `name`/`title` (fork, user-chosen: "table-context-aware")
For the `name`/`title` family **only**, the heuristic also reads the
**table** name token:
- `product`/`item`/`goods`/`merchandise`/`catalog`/`inventory`
`product` generator (D9)
- `company`/`companies`/`vendor`/`supplier`/`manufacturer`/`brand`
company name
- `user`/`customer`/`person`/`people`/`employee`/`member`/`contact`/
`author`/`student` → person name
- unrecognised table → generic word
This resolves the real ambiguity (`products.name` → "Sleek Bamboo
Keyboard"; `users.name` → "Alice Martinez"; `vendors.name` → "Globex
Corp"). It is a deliberately **scoped** use of table context — the only
place the table name influences generation.
### D12 — Enum-ish names → generic + post-seed advisory (fork, user-chosen: "flag enum-ish only")
Enum-ish names — `role`, `status`, `type`, `state`, `kind`,
`category`, `level`, `tier`, `stage`, `priority`, `gender` — have **no
sensible generic generator**, so they are **not guessed**: they fall
through to generic text (they must still be filled — a `NOT NULL`
status cannot be left empty). Seed then emits a **post-seed advisory**
(D13) naming them and pointing at the `set ... in (...)` override.
### D13 — Reporting: post-seed advisory (fork, user-chosen: "flag enum-ish only")
After a successful seed, in addition to the normal auto-show outcome
(row count + the affected rows, per ADR-0014), seed appends a
**`OutputStyleClass::Hint`** advisory **only** when one or more
enum-ish columns (D12) — **or columns guarded by a CHECK that seed
could not derive values from** (D17) — were filled generically.
The wording is **phase-aware** (DA finding: the advisory must not name
features that ship later). In **Phase 1** (no `set` clause yet) it
names the columns and explains they were filled generically. From
**Phase 2/3** it points at the concrete repair:
```
# Phase 1 wording:
✓ Seeded 20 rows into users
status, role were filled with generic text — they look like
fixed value sets you may want to choose deliberately.
# Phase 2/3 wording (set clause + column-fill exist):
✓ Seeded 20 rows into users
status, role filled generically. Fix existing rows with
seed users.status set status in ('active','inactive'),
or pass set … on the next seed.
```
Note the repair for **already-seeded rows** is the **column-fill**
form (`seed users.status set …`), not "re-seed" (which would add more
rows) — DA correction. This is a **result-time** note (cheap, reusing
ADR-0038's hint rendering), not a typing-time warning. The fuller
"per-column report" (every column → its generator) was considered and
**deferred** (see Alternatives / Out of scope).
### D14 — Foreign keys (SD1; fork on empty-parent, user-chosen: "friendly error")
- **Each FK** is filled by sampling **uniformly** from the **existing
rows** of the parent table's referenced column(s). Duplicates are
expected and correct (many children per parent). For a **compound
FK**, the referenced **tuple is sampled jointly** (a whole existing
parent key), never per-column independently — independent sampling
could fabricate a `(a, b)` pair that exists in no parent row and
would fail FK enforcement (DA refinement).
- **Empty parent** → seed **refuses with a friendly error** naming the
parent and the FK column ("seed `users` first — `orders.user_id`
references it"). Safe, predictable, teaches FK dependency order.
Recursive parent auto-seed is **deferred** to a future `--recursive`
opt-in (Out of scope).
- **Junction / compound-PK tables** (SD1's explicit case): sample
**distinct combinations** of the parent PK tuples to satisfy the
compound PK's uniqueness; if `count` exceeds the number of available
distinct combinations, **cap** at the maximum and note it in the
outcome.
- **Self-referential FK** (`manager_id → id`): if nullable, leave NULL
or point at an earlier row in the same batch; if `NOT NULL` on an
otherwise-empty table, friendly error. Documented edge case.
- **Nullable FKs** are **always filled** in v1 (predictable);
occasional-NULL injection is deferred.
### D15 — Undo: one snapshot per seed (DA finding; ADR-0006)
Seed is a mutation, so it must participate in undo. The draft omitted
this; the DA found the codebase already has the right primitive —
`BeginBatch` / `EndBatch` (`db.rs`), used by `replay` so a multi-write
run collapses to **one** boundary snapshot. `do_seed` wraps its
generated writes in `begin_batch` / `end_batch`, so **`seed users 20`
is a single undo step**, not 20 — matching ADR-0006 Amendment 1's
batch model. Column-fill's bulk UPDATE is likewise one step. (`import`
remains the only data-affecting op outside undo, per ADR-0015 §11;
seed is firmly inside it.)
### D16 — Replay: seed re-runs as a data write (fork, user-chosen)
`replay` re-executes a recorded `seed` line as a **data-write
command** — it is **not** in the app-lifecycle skip-set (see Command
classification, above). Consequence, accepted by the user: a **bare**
`seed users 20` regenerates **fresh, divergent** data on each replay;
a `seed users 20 --seed 42` line (the determinism lever, D4)
**reproduces** the original data. This keeps seed faithful to its
nature as a data write and puts reproducibility exactly where the
`--seed` flag already lives. (Seeded *data* is in any case durable
independently of replay, via the ADR-0015 CSV store + `rebuild`;
replay is the scripting re-run path, U4.) The DA confirmed the wiring
trap: because seed is *not* an `AppCommand`, it is correctly absent
from `is_app_lifecycle_entry_word` and replay dispatches it through
the normal data path rather than aborting.
### D17 — CHECK constraints: derive from simple `IN`, else friendly-fail (fork, user-chosen)
A CHECK on a generically-filled column would otherwise fail the whole
batch (DA finding — the block guard only covered `NOT NULL blob`).
Two-tier handling, per the user:
1. **Derive from simple `IN`-CHECKs.** When a column's CHECK is the
common enum-as-CHECK shape — `col IN ('a', 'b', …)` (the column's
own CHECK, single-column, literal list) — seed **parses out the
allowed values and uses them as the generator** (uniform choice).
The frequent `CHECK (status IN ('active','closed'))` case then
"just works" with no override needed.
2. **Best-effort + friendly fail for the rest.** For CHECKs seed
cannot interpret (ranges, expressions, multi-column), it generates
best-effort; if a generated row violates the CHECK, the insert
fails through the existing **H1 friendly-error layer** (ADR-0019)
naming the constraint and pointing at `set`. Such CHECK-guarded
columns are also **pre-flagged in the advisory** (D13) alongside
enum-ish names, so the user is warned before hitting the failure.
No new CHECK engine — tier 1 is a narrow literal-`IN` parse over the
CHECK text already stored in metadata; tier 2 is the existing failure
path.
### D18 — Auto-show is capped for large seeds (DA finding)
ADR-0014 auto-show renders "the affected rows" — fine for one insert,
a wall for a 10 000-row seed. Seed's outcome shows a **capped
preview** (proposed first **20** rows) with a `(showing 20 of N)`
note, not the full set. The row **count** is always reported in full;
only the rendered table is capped.
## Grammar, AST, and cross-cutting wiring
Per ADR-0024, `seed` is registered as a `CommandNode` so completion,
hints, help, and usage flow from one definition. The wiring, as
**explicit acceptance criteria** (a `/runda` pass must verify each —
ADR-0045 showed "claimed verified" is not verified):
- **AST + executor.** A dedicated command variant (`Seed { table,
target_column: Option<String>, count: Option<u32>, overrides:
Vec<SeedOverride>, rng_seed: Option<u64> }`) and a dedicated
`do_seed` worker executor. `do_seed` **reuses shared helpers**
(value binding `impl_value_for`, autogen autofill, FK enrichment,
the multi-row parameterised-insert pattern of `plan_autogen_autofill`,
the UPDATE path for column-fill, per-command persistence, the
`begin_batch`/`end_batch` undo primitive of D15) as library
functions — it does **not** emit `Command::Insert`/`Command::Update`
(X5).
- **Replay / undo classification (D15/D16).** `do_seed` brackets its
writes in one batch (one undo step). The `seed` entry word is
**deliberately absent** from `is_app_lifecycle_entry_word` and
completion's `empty_input_offers_app_command_entry_keywords` (the
`AppCommand` mirror) so replay re-runs it as a data write — an
explicit acceptance check, since the default for an unlisted
recognised command must be "replayed", not "abort".
- **Completion sources:** table-name (existing tables); `.column` and
`set`-clause column slots (columns of the named table); the
generator-name vocabulary (D9) after `as`; `count` number; `set` /
`=` / `in` / `as` / `between` / `and` keywords; `--seed` flag.
- **Syntax highlighting:** `seed` keyword; the generator-name
vocabulary highlighted as **`tok_function`** (reuse the existing
ADR-0022 Amд6 blue — no new theme colour).
- **Hints:** ambient per-slot "what's next" and usage hints, both
modes.
- **Help:** `help seed` topic (`help_id` + per-command block); the
general `help` list picks it up automatically via REGISTRY.
- **Parse-error pedagogy (ADR-0042):** near-miss matrix rows for `seed`
(bare / missing-table / wrong-token / malformed `set`), both modes.
- **Validity indicator (ADR-0027):** typing-time `[ERR]`/`[WRN]` for
unknown table, unknown column (in `.column` or `set`), unknown
generator name after `as`.
- **No DSL→SQL teaching echo (ADR-0038).** `seed` is a utility/app
command, not a DSL form of a SQL statement, so the echo does not
apply. (A future "show the generated INSERTs" is out of scope —
it would dump `count` statements.)
## Implementation phasing
Design is whole; the **implementation** is phased into reviewable,
test-first commits:
1. **Core whole-row seed** — grammar/AST/executor; type-based
generation + the `fake`-backed name heuristics (D7/D8/D11);
identifier uniqueness (D10) + constraint uniqueness; FK sampling
(joint tuples) + empty-parent error + junction distinct-combos
(D14); `--seed` determinism (D4); default count + cap + zero-no-op
(D6/D1); required-column block guard (D1); **undo batch (D15)**;
**replay-as-data-write classification (D16)**; **CHECK derive /
friendly-fail (D17)**; **capped auto-show (D18)**; the enum/CHECK
advisory in its **Phase-1 wording** (D12/D13); full ambient wiring;
both modes.
2. **The `set` override clause** (D2) — value / list / generator /
range, type-aware, with completion + highlight + validity for the
generator-name slot.
3. **Column-fill mode** (`seed <table>.<column>`, D1 form 2) — the
UPDATE path.
Each phase is independently green before the next.
## Testing (ADR-0008 tiers 13; test-first)
- **Tier 1 (unit, deterministic via `--seed`):** generator selection
(name × type-gate matrix, including every false-positive guard of
D7); table-context disambiguation (D11); identifier uniqueness and
the FK-wins-over-`*_id` precedence (D10); bounded-date windows (D8);
the `product` generator shape; override resolution + precedence (D2);
the required-column block guard (D1); the count cap (D6). Exact-value
assertions are possible because `--seed` fixes the RNG.
- **Tier 2 (insta snapshots):** the seeded data table render and the
enum advisory (D13) at representative sizes, light + dark.
- **Tier 3 (integration, full event loop):** `seed users 20` end to
end (rows land in db + CSV + history, auto-show, persistence);
FK sampling against a populated parent (incl. a **compound FK** —
every child tuple exists in the parent); **empty-parent friendly
error**; **junction** seeding with distinct combinations and the
over-cap note; the `set` clause forms (quoted literals); **column-
fill** on existing rows (incl. refusal of PK/autogen targets, empty-
table no-op); reproducibility (`--seed 42` twice → identical data
from a fixed state); both modes. Plus the DA-driven cases:
**one-undo-step** (seed then a single `undo` removes all rows);
**replay** of a bare `seed` line (divergent) vs a `--seed` line
(reproduced); **`IN`-CHECK auto-derivation** ("just works") and a
**complex-CHECK friendly failure**; **capped auto-show** on a large
seed.
"All green, no skips" is the only acceptable end state; the Phase-1
baseline (2290 passing / 0 failing / 0 skipped / 1 ignored doctest) is
the regression floor.
## Out of scope / deferred (future SD2 work)
- **Recursive parent auto-seed** (`--recursive`) — D14 errors instead.
- **NULL injection** for nullable columns (teaching optional
relationships / `IS NULL`) — v1 always fills.
- **Multi-locale** generation — English only (X2).
- **User-defined custom generators** (true "override hooks" — register
a named generator) — the `set ... as <builtin>` surface covers the
common need; custom generators are a later SD2 increment.
- **Full per-column seed report** — D13 flags enum-ish only.
- **Column-restricted insert** (`seed t (a, b)`) — rejected (D1).
- **"Show the generated SQL"** teaching echo for seed.
## Alternatives considered
- **Hand-rolled generators only (no `fake`):** minimal dependency, but
synthetic-looking data (`text_a3f9`) — rejected on pedagogy
(pedagogy wins ties).
- **Type-only generation (no name awareness):** simpler, but misses
the biggest UX win (a `users` table that reads like real people) —
rejected.
- **Column-name-only `name` (no table context):** leaves
`products.name` → person names, requiring a manual override on every
product/company table — rejected for the `name`/`title` family
(D11).
- **No override clause (heuristics + type only):** could not answer
"the heuristic guessed wrong, fix it" or enum columns — rejected;
the `set` clause (D2) is the answer to the user's Q3.
- **Recursive auto-seed of empty parents:** powerful but magical and
can seed tables the user did not name — deferred behind a future
flag (D14).
- **Always-random (no `--seed`):** simplest, but no reproducible
datasets and weaker tests — rejected (D4).
- **Full per-column report by default:** a nice teaching artifact but
verbose on wide tables — deferred; flag-only advisory chosen (D13).
- **Reuse `Command::Insert`/`do_insert` directly** from seed: tempting
for code reuse, but collapses command identity and violates X5 —
rejected in favour of a dedicated `do_seed` that calls shared
*helpers*.
- **Skip seed on replay** (classify as app-lifecycle, D16): consistent
with A1's "app-level" label and avoids divergent data, but seed is a
data write and silently skipping it on a scripted re-run is
surprising — rejected; `--seed` is the determinism lever instead.
- **Bare-word `set` list items** (`in (admin, …)`, D2): matched the
early mockups and reads cleaner, but bare words are column
references in the reused grammar (would error) and would force a
custom list form — rejected for quoted literals (grammar reuse +
DSL consistency).
- **Pre-flight refuse any CHECK-bearing table** (D17): safest but
blocks seeding too many legitimate tables — rejected for the
derive-`IN`-else-friendly-fail tier.
- **`set`-driven NULL / per-column report / recursive parent seed:**
deferred — see Out of scope.