seed: year-like int columns (*_year, published) get unbounded values #33

Closed
opened 2026-06-12 14:42:19 +01:00 by claude-clouddev1 · 1 comment
Collaborator

Summary

seed generates realistic values from a column's name, but there is
no heuristic for year-like integer columns. A column such as
published or birth_year is just an int to the generator, so it
falls through to the unbounded type-based int path (ADR-0048 D8) and
produces values like 9419, 6187, or 1426 — nonsensical as years.

This was noticed while writing the website docs for seed: the example
library's books.published and authors.birth_year columns produced
implausible years, which undercuts the "realistic data" pedagogy.

Reproduce

create table authors with pk author_id(serial)
add column to authors: name (text)
add column to authors: birth_year (int)
seed authors 5 --seed 7

Observed birth_year values (seed 7): 1426, 1427, 6187, 9512, 7436.

Suggested fix

Add a name heuristic for the year family, type-gated to int:

  • Names: year, *_year (e.g. birth_year, release_year), and
    arguably published / founded.
  • Generator: a bounded year (e.g. a recent-ish window, or an adult/birth
    window for birth_year mirroring the existing dobDateAdult
    rule), emitted as a plain int.

This slots into the D7 catalogue (src/seed/heuristics.rs) next to the
existing date/dob/created_at rules, plus a Generator variant (or a
bounded SmallInt-style range) in src/seed/generators.rs. Tier-1
exact-value tests via a fixed --seed.

Workaround (today)

Pin it explicitly with the set clause:
seed books 6 set published between 1950 and 2020.

Refs

ADR-0048 D7 (name-aware heuristics) / D8 (type-based fallback).
Scope note: this is an SD2-style refinement, not in the shipped Phase 1/2.

## Summary `seed` generates realistic values from a column's **name**, but there is no heuristic for **year-like integer columns**. A column such as `published` or `birth_year` is just an `int` to the generator, so it falls through to the unbounded type-based `int` path (ADR-0048 D8) and produces values like `9419`, `6187`, or `1426` — nonsensical as years. This was noticed while writing the website docs for `seed`: the example library's `books.published` and `authors.birth_year` columns produced implausible years, which undercuts the "realistic data" pedagogy. ## Reproduce ``` create table authors with pk author_id(serial) add column to authors: name (text) add column to authors: birth_year (int) seed authors 5 --seed 7 ``` Observed `birth_year` values (seed 7): `1426, 1427, 6187, 9512, 7436`. ## Suggested fix Add a name heuristic for the year family, type-gated to `int`: - Names: `year`, `*_year` (e.g. `birth_year`, `release_year`), and arguably `published` / `founded`. - Generator: a bounded year (e.g. a recent-ish window, or an adult/birth window for `birth_year` mirroring the existing `dob` → `DateAdult` rule), emitted as a plain `int`. This slots into the D7 catalogue (`src/seed/heuristics.rs`) next to the existing date/`dob`/`created_at` rules, plus a `Generator` variant (or a bounded `SmallInt`-style range) in `src/seed/generators.rs`. Tier-1 exact-value tests via a fixed `--seed`. ## Workaround (today) Pin it explicitly with the `set` clause: `seed books 6 set published between 1950 and 2020`. ## Refs ADR-0048 D7 (name-aware heuristics) / D8 (type-based fallback). Scope note: this is an SD2-style refinement, not in the shipped Phase 1/2.
claude-clouddev1 added the enhancement label 2026-06-12 14:42:19 +01:00
Author
Collaborator

Fixed in deb0948.

Added an int-gated year rule to the D7 catalogue (src/seed/heuristics.rs), placed after the quantity rule so year_count (a count) stays a SmallInt:

  • year / *_year / published / foundedYearRecent, a bounded 1950–2025 window (75 years relative to the fixed REF_YEAR, matching this issue's own between 1950 and 2020 workaround).
  • the same with a birth / born / dob token (e.g. birth_year) → YearBirth, mirroring the existing dob → DateAdult adult window as years (1945–2007).

Both emit a plain int. published / founded are included (user-confirmed — an int so named is almost always a year; a flag would be is_published). Two new Generator variants (YearRecent/YearBirth); deliberately not added to the D9 named-generator vocabulary — explicit control stays with set <col> between <lo> and <hi>.

Repro now: the issue's birth_year example produces plausible years instead of 1426/9512.

Tests: heuristic-selection unit tests (birth_year→YearBirth, published/founded/release_year→YearRecent, type-gate, year_count→SmallInt), generator-window + determinism unit tests, and a fixed-seed integration test asserting membership in the bounded windows.

Decision recorded in ADR-0048 Amendment 1. Full suite green (2433 pass, 1 ignored).

Fixed in `deb0948`. Added an **`int`-gated year rule** to the D7 catalogue (`src/seed/heuristics.rs`), placed *after* the quantity rule so `year_count` (a count) stays a `SmallInt`: - `year` / `*_year` / `published` / `founded` → **`YearRecent`**, a bounded **1950–2025** window (75 years relative to the fixed `REF_YEAR`, matching this issue's own `between 1950 and 2020` workaround). - the same with a `birth` / `born` / `dob` token (e.g. `birth_year`) → **`YearBirth`**, mirroring the existing `dob → DateAdult` adult window as years (**1945–2007**). Both emit a plain `int`. `published` / `founded` are included (user-confirmed — an `int` so named is almost always a year; a flag would be `is_published`). Two new `Generator` variants (`YearRecent`/`YearBirth`); deliberately **not** added to the D9 named-generator vocabulary — explicit control stays with `set <col> between <lo> and <hi>`. **Repro now:** the issue's `birth_year` example produces plausible years instead of `1426`/`9512`. Tests: heuristic-selection unit tests (`birth_year`→YearBirth, `published`/`founded`/`release_year`→YearRecent, type-gate, `year_count`→SmallInt), generator-window + determinism unit tests, and a fixed-seed integration test asserting membership in the bounded windows. Decision recorded in **ADR-0048 Amendment 1**. Full suite green (2433 pass, 1 ignored).
Sign in to join this conversation.