feat(seed): fake-data generation library + fake dependency (ADR-0048 P1.1)

The pure generation half of `seed` — no command wiring yet:
- src/seed/: ColumnSpec + Generator model and a seeded StdRng; the
  type-gated name-heuristic catalogue (D7) with documented
  false-positive guards; table-context name disambiguation (D11);
  identifier (D10) and enum-ish (D12) detection; per-type + bounded-date
  generators (D8); the hand-rolled product generator (D9); and PickFrom
  for IN-CHECK / enum lists.
- Adds the `fake` crate (v5, default features). Verified: single rand
  0.10.1 (no duplication), determinism via one seeded StdRng driving
  both fake and the hand-rolled generators, security-clean across
  osv/grype/trivy.
- ADR-0048 D3 updated to record the dependency verification.

32 Tier-1 tests (exact-value via fixed --seed); 1673 lib tests pass,
clippy all-targets clean.
This commit is contained in:
claude@clouddev1
2026-06-11 15:35:17 +00:00
parent 0af7f56821
commit 202e25a94f
7 changed files with 1072 additions and 16 deletions
Generated
+18
View File
@@ -419,6 +419,12 @@ dependencies = [
"syn 2.0.117", "syn 2.0.117",
] ]
[[package]]
name = "deunicode"
version = "1.6.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "abd57806937c9cc163efc8ea3910e00a62e2aeb0b8119f1793a978088f8f6b04"
[[package]] [[package]]
name = "diff" name = "diff"
version = "0.1.13" version = "0.1.13"
@@ -518,6 +524,17 @@ dependencies = [
"num-traits", "num-traits",
] ]
[[package]]
name = "fake"
version = "5.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ea6be833b323a56361118a747470a45a1bcd5c52a2ec9b1e40c83dafe687e453"
dependencies = [
"deunicode",
"either",
"rand 0.10.1",
]
[[package]] [[package]]
name = "fallible-iterator" name = "fallible-iterator"
version = "0.3.0" version = "0.3.0"
@@ -1527,6 +1544,7 @@ dependencies = [
"crossterm", "crossterm",
"csv", "csv",
"directories", "directories",
"fake",
"futures-util", "futures-util",
"gethostname", "gethostname",
"insta", "insta",
+8
View File
@@ -24,6 +24,14 @@ chrono = { version = "0.4.44", default-features = false, features = ["clock"] }
crossterm = { version = "0.29.0", features = ["event-stream"] } crossterm = { version = "0.29.0", features = ["event-stream"] }
csv = "1.4.0" csv = "1.4.0"
directories = "6.0.0" directories = "6.0.0"
# Realistic fake-data generators for the `seed` command (ADR-0048):
# names, emails, addresses, companies, lorem, etc. Default features
# only — the basic fakers need no flags; date/datetime values are
# generated in-house (rand + the existing `chrono`) for the bounded
# windows ADR-0048 D8 requires, so `fake`'s `chrono` feature is
# deliberately omitted. No commerce/product module exists, so the
# `product` generator is hand-rolled (D9).
fake = "5"
futures-util = "0.3.32" futures-util = "0.3.32"
gethostname = "1.1.0" gethostname = "1.1.0"
rand = "0.10.1" rand = "0.10.1"
+22 -16
View File
@@ -170,23 +170,29 @@ companies, phone numbers, lorem text, dates. Generation is driven by a
per-column **generator** chosen by the heuristics (D7) or the override per-column **generator** chosen by the heuristics (D7) or the override
(D2), falling back to **type-based** generation (D8). (D2), falling back to **type-based** generation (D8).
**Two open implementation-time verifications** (flagged honestly, to **Implementation-time verifications (resolved 2026-06-11 when the
be resolved when the dependency is locked, not assumed here): dependency was added):**
- **`rand` de-duplication.** The project is on `rand 0.10.1`; `fake` - **`rand` de-duplication — clean.** `fake` 5.1.0 depends on
brings its own `rand`. Confirm a single `rand` version resolves (a `rand = "0.10"`, the **same major** as the project's `rand 0.10.1`,
duplicate is harmless but should be a conscious outcome, and so `cargo tree -e normal` resolves a **single** `rand 0.10.1` (no
`shortid.rs` + the seed RNG must share the version we standardise runtime duplication; the `rand 0.8.6` visible to `cargo tree -i
on). rand` is only `fake`'s own dev-dependency, never compiled for us).
- **`fake` module inventory.** Confirm which generators v5 actually Consequence for D4: one seeded `rand 0.10` `StdRng` can drive
ships (strong prior: it has Name/Internet/Address/Company/Lorem/ **both** `fake`'s `fake_with_rng` and the hand-rolled generators —
Chrono/Currency/Job/Color but **no commerce/product module** — see determinism is single-RNG, single-version, and shares `shortid.rs`'s
D9), and the minimal feature-flag set needed (derive, chrono-backed `rand` version.
dates). - **`fake` module inventory / features — confirmed.** Default features
- **Security (new-dependency posture).** `fake` and its transitive (`["either"]`) cover the core string fakers used here
tree must be scanned (`trivy fs`, `grype`, `osv-scanner`) before (Name/Internet/Address/Company/Lorem/PhoneNumber); `fake`'s `chrono`
merge, per the global new-dependency rule; findings documented, not feature is **deliberately omitted** (dates generated in-house for
silently accepted. D8's bounded windows). No commerce/product module exists → `product`
is hand-rolled (D9). (The exact faker call sites are pinned when the
generation library is built.)
- **Security (new-dependency posture) — clean.** The `fake` tree (296
packages total) scanned clean by **all three** mandated scanners:
`osv-scanner` (no issues), `grype` (no vulnerabilities), `trivy fs
--scanners vuln` (0). No findings to document or accept.
### D4 — Determinism: `--seed <n>` (fork, user-chosen: "optional flag") ### D4 — Determinism: `--seed <n>` (fork, user-chosen: "optional flag")
+1
View File
@@ -23,6 +23,7 @@ pub mod output_render;
pub mod persistence; pub mod persistence;
pub mod project; pub mod project;
pub mod runtime; pub mod runtime;
pub mod seed;
pub mod theme; pub mod theme;
pub mod type_change; pub mod type_change;
pub mod ui; pub mod ui;
+383
View File
@@ -0,0 +1,383 @@
//! Value production: turn a [`Generator`] + a seeded RNG into a
//! [`Value`] (ADR-0048 D8/D9). Realistic generators come from the
//! `fake` crate (English locale); `product` is hand-rolled (D9, no
//! commerce module exists); dates are generated against a **fixed
//! reference epoch** so a `--seed` run is fully reproducible without
//! depending on the wall clock (D8 bounded windows).
//!
//! The stateful markers ([`Generator::IdentitySequential`],
//! [`Generator::ForeignKeySample`]) are resolved by the executor with
//! database context; if one reaches here un-intercepted it falls back
//! to type-based generation rather than panicking.
use chrono::{Datelike, NaiveDate};
use fake::Fake;
use rand::RngExt;
use crate::dsl::types::Type;
use crate::dsl::value::Value;
use crate::seed::{Generator, SeedRng};
/// Fixed anchor for bounded date/datetime windows. Using a constant
/// (rather than `now()`) keeps `--seed` output reproducible across days
/// and makes tests deterministic. It advances with releases.
const REF_YEAR: i32 = 2025;
const REF_MONTH: u32 = 6;
const REF_DAY: u32 = 1;
/// `~3 years` window for "recent" dates, in days.
const RECENT_WINDOW_DAYS: i64 = 3 * 365;
/// Adult birth window (≈1880 years ago), in days.
const ADULT_MIN_DAYS: i64 = 18 * 365;
const ADULT_MAX_DAYS: i64 = 80 * 365;
/// Produce one value for `generator` against destination type `ty`.
#[must_use]
pub fn generate_value(generator: &Generator, ty: Type, rng: &mut SeedRng) -> Value {
use fake::faker::address::en as addr;
use fake::faker::company::en as company;
use fake::faker::internet::en as net;
use fake::faker::job::en as job;
use fake::faker::lorem::en as lorem;
use fake::faker::name::en as name;
use fake::faker::phone_number::en as phone;
match generator {
Generator::FirstName => Value::Text(name::FirstName().fake_with_rng(rng)),
Generator::LastName => Value::Text(name::LastName().fake_with_rng(rng)),
Generator::FullName => Value::Text(name::Name().fake_with_rng(rng)),
Generator::Email => Value::Text(net::FreeEmail().fake_with_rng(rng)),
Generator::Username => Value::Text(net::Username().fake_with_rng(rng)),
Generator::Password => Value::Text(net::Password(8..16).fake_with_rng(rng)),
Generator::Phone => Value::Text(phone::PhoneNumber().fake_with_rng(rng)),
Generator::City => Value::Text(addr::CityName().fake_with_rng(rng)),
Generator::Country => Value::Text(addr::CountryName().fake_with_rng(rng)),
Generator::StateName => Value::Text(addr::StateName().fake_with_rng(rng)),
Generator::Street => Value::Text(addr::StreetName().fake_with_rng(rng)),
Generator::ZipCode => Value::Text(addr::ZipCode().fake_with_rng(rng)),
Generator::Company => Value::Text(company::CompanyName().fake_with_rng(rng)),
Generator::JobTitle => Value::Text(job::Title().fake_with_rng(rng)),
Generator::ProductName => Value::Text(product_name(rng)),
Generator::Sentence => Value::Text(lorem::Sentence(5..12).fake_with_rng(rng)),
Generator::Paragraph => Value::Text(lorem::Paragraph(2..4).fake_with_rng(rng)),
Generator::Url => {
let word: String = lorem::Word().fake_with_rng(rng);
let suffix: String = net::DomainSuffix().fake_with_rng(rng);
Value::Text(format!("https://{word}.{suffix}"))
}
// Hand-rolled — `fake`'s color module is feature-gated (it pulls
// an extra crate); a hex colour is trivial from the RNG.
Generator::HexColor => Value::Text(format!("#{:06X}", rng.random_range(0..0x0100_0000))),
Generator::CurrencyAmount => currency_amount(ty, rng),
Generator::Age => Value::Number(rng.random_range(18..=80).to_string()),
Generator::SmallInt => Value::Number(rng.random_range(1..=100).to_string()),
Generator::DateRecent => Value::Text(format_date(random_past_date(rng, 0, RECENT_WINDOW_DAYS))),
Generator::DateAdult => {
Value::Text(format_date(random_past_date(rng, ADULT_MIN_DAYS, ADULT_MAX_DAYS)))
}
Generator::DateTimeRecent => Value::Text(random_recent_datetime(rng)),
Generator::Boolean => Value::Bool(rng.random_range(0..2) == 1),
Generator::PickFrom(values) if !values.is_empty() => {
let chosen: &String = pick(rng, values);
literal_to_value(chosen, ty)
}
// Un-intercepted markers + an empty pick list → type-based.
Generator::PickFrom(_)
| Generator::IdentitySequential
| Generator::ForeignKeySample
| Generator::Generic => generic_for_type(ty, rng),
}
}
/// Type-based fallback generation (D8). Never produces NULL for a
/// generatable type; `blob`/`serial`/`shortid` are handled by the
/// executor (autogen / block guard) and yield NULL here only as a
/// last resort.
fn generic_for_type(ty: Type, rng: &mut SeedRng) -> Value {
use fake::faker::lorem::en as lorem;
match ty {
Type::Text => {
let words: Vec<String> = lorem::Words(2..4).fake_with_rng(rng);
Value::Text(words.join(" "))
}
Type::ShortId => Value::Text(crate::dsl::shortid::generate()),
Type::Int => Value::Number(rng.random_range(1..=10_000).to_string()),
Type::Serial => Value::Number(rng.random_range(1..=10_000).to_string()),
Type::Real => {
let n: f64 = rng.random_range(0..100_000) as f64 / 100.0;
Value::Number(format!("{n:.2}"))
}
Type::Decimal => {
let dollars = rng.random_range(0..10_000);
let cents = rng.random_range(0..100);
Value::Number(format!("{dollars}.{cents:02}"))
}
Type::Bool => Value::Bool(rng.random_range(0..2) == 1),
Type::Date => Value::Text(format_date(random_past_date(rng, 0, RECENT_WINDOW_DAYS))),
Type::DateTime => Value::Text(random_recent_datetime(rng)),
Type::Blob => Value::Null,
}
}
/// Wrap a fixed-list literal as the right `Value` shape for `ty` (used
/// by `PickFrom` — enum / `IN`-CHECK values).
fn literal_to_value(s: &str, ty: Type) -> Value {
match ty {
Type::Int | Type::Serial | Type::Real | Type::Decimal => Value::Number(s.to_string()),
Type::Bool => Value::Bool(matches!(s.to_ascii_lowercase().as_str(), "true" | "1")),
_ => Value::Text(s.to_string()),
}
}
/// A money-shaped amount: whole for `int`/`serial`, two-decimal for the
/// fractional numeric types.
fn currency_amount(ty: Type, rng: &mut SeedRng) -> Value {
match ty {
Type::Real | Type::Decimal => {
let dollars = rng.random_range(1..=1_000);
let cents = rng.random_range(0..100);
Value::Number(format!("{dollars}.{cents:02}"))
}
// int / serial / anything else numeric → whole amount.
_ => Value::Number(rng.random_range(1..=1_000).to_string()),
}
}
// — the hand-rolled `product` generator (D9) —
const PRODUCT_ADJECTIVES: &[&str] = &[
"Sleek", "Rustic", "Ergonomic", "Handcrafted", "Refined", "Modern",
"Vintage", "Compact", "Premium", "Lightweight", "Durable", "Elegant",
"Sturdy", "Smooth", "Gorgeous", "Intelligent", "Practical", "Awesome",
"Incredible", "Recycled",
];
const PRODUCT_MATERIALS: &[&str] = &[
"Wooden", "Copper", "Granite", "Cotton", "Steel", "Leather", "Bamboo",
"Plastic", "Ceramic", "Glass", "Concrete", "Rubber", "Bronze", "Marble",
"Linen", "Silk", "Aluminum", "Wool", "Gold", "Carbon",
];
const PRODUCT_NOUNS: &[&str] = &[
"Chair", "Lamp", "Table", "Bottle", "Backpack", "Keyboard", "Mug",
"Shoes", "Jacket", "Watch", "Wallet", "Bench", "Hat", "Gloves",
"Towel", "Ball", "Bike", "Knife", "Pillow", "Blanket",
];
fn product_name(rng: &mut SeedRng) -> String {
format!(
"{} {} {}",
pick(rng, PRODUCT_ADJECTIVES),
pick(rng, PRODUCT_MATERIALS),
pick(rng, PRODUCT_NOUNS),
)
}
// — bounded dates (D8) —
const fn reference_date() -> NaiveDate {
match NaiveDate::from_ymd_opt(REF_YEAR, REF_MONTH, REF_DAY) {
Some(d) => d,
None => panic!("reference date constants must be valid"),
}
}
/// A date between `min_days_ago` and `max_days_ago` before the
/// reference epoch (inclusive).
fn random_past_date(rng: &mut SeedRng, min_days_ago: i64, max_days_ago: i64) -> NaiveDate {
let days_ago = rng.random_range(min_days_ago..=max_days_ago);
let ce = reference_date().num_days_from_ce();
let target = ce - i32::try_from(days_ago).unwrap_or(0);
NaiveDate::from_num_days_from_ce_opt(target).unwrap_or_else(reference_date)
}
fn format_date(date: NaiveDate) -> String {
date.format("%Y-%m-%d").to_string()
}
/// A recent datetime: a recent date plus a random time-of-day, rendered
/// as `YYYY-MM-DDTHH:MM:SS`.
fn random_recent_datetime(rng: &mut SeedRng) -> String {
let date = random_past_date(rng, 0, RECENT_WINDOW_DAYS);
let h = rng.random_range(0..24);
let m = rng.random_range(0..60);
let s = rng.random_range(0..60);
format!("{}T{h:02}:{m:02}:{s:02}", format_date(date))
}
/// Pick a uniformly random element from a non-empty slice.
fn pick<'a, T>(rng: &mut SeedRng, items: &'a [T]) -> &'a T {
&items[rng.random_range(0..items.len())]
}
#[cfg(test)]
mod tests {
use super::*;
use crate::seed::make_rng;
use pretty_assertions::assert_eq;
fn gen_once(generator: &Generator, ty: Type, seed: u64) -> Value {
let mut rng = make_rng(Some(seed));
generate_value(generator, ty, &mut rng)
}
#[test]
fn generation_is_deterministic_for_a_fixed_seed() {
for generator in [
Generator::FullName,
Generator::Email,
Generator::ProductName,
Generator::DateRecent,
Generator::CurrencyAmount,
] {
let a = gen_once(&generator, Type::Text, 7);
let b = gen_once(&generator, Type::Text, 7);
assert_eq!(a, b, "{generator:?} must reproduce for a fixed seed");
}
}
#[test]
fn text_generators_produce_nonempty_text() {
for generator in [
Generator::FirstName,
Generator::LastName,
Generator::FullName,
Generator::Email,
Generator::Username,
Generator::Company,
Generator::City,
Generator::ProductName,
] {
let v = gen_once(&generator, Type::Text, 3);
match v {
Value::Text(s) => assert!(!s.trim().is_empty(), "{generator:?} produced empty text"),
other => panic!("{generator:?} produced non-text {other:?}"),
}
}
}
#[test]
fn email_looks_like_an_email() {
let v = gen_once(&Generator::Email, Type::Text, 11);
let Value::Text(s) = v else { panic!("not text") };
assert!(s.contains('@'), "email should contain @: {s}");
}
#[test]
fn product_name_is_three_capitalised_words() {
let v = gen_once(&Generator::ProductName, Type::Text, 99);
let Value::Text(s) = v else { panic!("not text") };
let words: Vec<&str> = s.split(' ').collect();
assert_eq!(words.len(), 3, "product name should be 3 words: {s}");
for w in words {
assert!(w.chars().next().unwrap().is_ascii_uppercase(), "word `{w}` not capitalised");
}
}
#[test]
fn recent_dates_fall_within_the_bounded_window() {
let mut rng = make_rng(Some(1));
let earliest = reference_date()
.checked_sub_days(chrono::Days::new(RECENT_WINDOW_DAYS as u64))
.unwrap();
let latest = reference_date();
for _ in 0..200 {
let v = generate_value(&Generator::DateRecent, Type::Date, &mut rng);
let Value::Text(s) = v else { panic!("date not text") };
let d = NaiveDate::parse_from_str(&s, "%Y-%m-%d").expect("valid ISO date");
assert!(d >= earliest && d <= latest, "date {d} outside recent window");
}
}
#[test]
fn dob_dates_fall_within_the_adult_window() {
let mut rng = make_rng(Some(2));
let earliest = reference_date()
.checked_sub_days(chrono::Days::new(ADULT_MAX_DAYS as u64))
.unwrap();
let latest = reference_date()
.checked_sub_days(chrono::Days::new(ADULT_MIN_DAYS as u64))
.unwrap();
for _ in 0..200 {
let v = generate_value(&Generator::DateAdult, Type::Date, &mut rng);
let Value::Text(s) = v else { panic!("date not text") };
let d = NaiveDate::parse_from_str(&s, "%Y-%m-%d").expect("valid ISO date");
assert!(d >= earliest && d <= latest, "dob {d} outside adult window");
}
}
#[test]
fn datetime_is_iso_shaped() {
let v = gen_once(&Generator::DateTimeRecent, Type::DateTime, 5);
let Value::Text(s) = v else { panic!("not text") };
assert!(s.contains('T'), "datetime needs a T separator: {s}");
// Parses as a naive datetime.
chrono::NaiveDateTime::parse_from_str(&s, "%Y-%m-%dT%H:%M:%S")
.unwrap_or_else(|e| panic!("invalid datetime {s}: {e}"));
}
#[test]
fn currency_is_whole_for_int_and_fractional_for_decimal() {
let Value::Number(int_amt) = gen_once(&Generator::CurrencyAmount, Type::Int, 4) else {
panic!("not a number")
};
assert!(!int_amt.contains('.'), "int currency should be whole: {int_amt}");
let Value::Number(dec_amt) = gen_once(&Generator::CurrencyAmount, Type::Decimal, 4) else {
panic!("not a number")
};
assert!(dec_amt.contains('.'), "decimal currency should have cents: {dec_amt}");
}
#[test]
fn age_is_in_human_range() {
let mut rng = make_rng(Some(8));
for _ in 0..100 {
let Value::Number(a) = generate_value(&Generator::Age, Type::Int, &mut rng) else {
panic!("age not a number")
};
let n: i64 = a.parse().unwrap();
assert!((18..=80).contains(&n), "age {n} out of range");
}
}
#[test]
fn pick_from_chooses_a_listed_value() {
let generator = Generator::PickFrom(vec!["active".into(), "closed".into()]);
let mut rng = make_rng(Some(6));
for _ in 0..50 {
let Value::Text(s) = generate_value(&generator, Type::Text, &mut rng) else {
panic!("not text")
};
assert!(matches!(s.as_str(), "active" | "closed"), "unexpected pick {s}");
}
}
#[test]
fn pick_from_wraps_numeric_values_as_numbers() {
let generator = Generator::PickFrom(vec!["1".into(), "2".into(), "3".into()]);
let mut rng = make_rng(Some(6));
let v = generate_value(&generator, Type::Int, &mut rng);
assert!(matches!(v, Value::Number(_)), "numeric pick should be a Number: {v:?}");
}
#[test]
fn markers_fall_back_to_type_based_generation() {
// An un-intercepted marker must not panic; it generates by type.
let v = gen_once(&Generator::IdentitySequential, Type::Text, 1);
assert!(matches!(v, Value::Text(_)));
let v = gen_once(&Generator::ForeignKeySample, Type::Int, 1);
assert!(matches!(v, Value::Number(_)));
}
#[test]
fn generic_fallback_matches_each_type() {
let mut rng = make_rng(Some(0));
assert!(matches!(generate_value(&Generator::Generic, Type::Text, &mut rng), Value::Text(_)));
assert!(matches!(generate_value(&Generator::Generic, Type::Int, &mut rng), Value::Number(_)));
assert!(matches!(generate_value(&Generator::Generic, Type::Bool, &mut rng), Value::Bool(_)));
assert!(matches!(generate_value(&Generator::Generic, Type::Blob, &mut rng), Value::Null));
// shortid fallback is a valid base58 id.
let Value::Text(sid) = generate_value(&Generator::Generic, Type::ShortId, &mut rng) else {
panic!("shortid not text")
};
assert!(crate::dsl::shortid::validate(&sid).is_ok(), "invalid shortid {sid}");
}
}
+440
View File
@@ -0,0 +1,440 @@
//! Generator selection: the name-aware, type-gated catalogue (ADR-0048
//! D7), table-context disambiguation for `name`/`title` (D11), the
//! identifier-family rule (D10), and enum-ish detection (D12).
//!
//! Selection is **token-based**: a column name is split on `_`, `-` and
//! camelCase boundaries, lowercased, and matched against an
//! ordered, most-specific-first list. Each rule is **type-gated** — a
//! name match only fires when the column's type is compatible, so a
//! column called `email` typed `int` falls through to type-based
//! generation rather than producing a string. Documented false-positive
//! guards keep `username`/`filename` away from the bare person-name
//! rule.
use tracing::trace;
use crate::dsl::types::Type;
use crate::seed::{ColumnSpec, Generator};
/// Choose the generator for a column (ADR-0048 D7/D10/D11/D12).
///
/// Precedence: foreign keys and `IN`-CHECK columns are resolved first
/// (the executor / a fixed list), then the ordered name catalogue, then
/// the type-based fallback.
#[must_use]
pub fn choose_generator(table: &str, col: &ColumnSpec) -> Generator {
let generator = choose_generator_inner(table, col);
trace!(
table = table,
column = %col.name,
ty = %col.ty,
chosen = ?generator,
"seed: chose generator for column"
);
generator
}
fn choose_generator_inner(table: &str, col: &ColumnSpec) -> Generator {
// FK columns are filled by sampling existing parent rows (D14) —
// the executor owns that; generation here would be wrong.
if col.is_foreign_key {
return Generator::ForeignKeySample;
}
// A simple `col IN (…)` CHECK becomes the value source (D17), so the
// common enum-as-CHECK pattern just works.
if let Some(values) = &col.check_in_values
&& !values.is_empty()
{
return Generator::PickFrom(values.clone());
}
let toks = tokens(&col.name);
match_name_generator(table, &toks, col.ty).unwrap_or(Generator::Generic)
}
/// Whether a column name looks like an enum / fixed-value set that has
/// no sensible generic generator (D12). Used by the executor to drive
/// the post-seed advisory; such columns still receive generic text.
#[must_use]
pub fn is_enum_ish(name: &str) -> bool {
const ENUM_TOKENS: &[&str] = &[
"role", "status", "state", "type", "kind", "category", "level",
"tier", "stage", "priority", "gender",
];
let toks = tokens(name);
toks.iter().any(|t| ENUM_TOKENS.contains(&t.as_str()))
}
/// The ordered, most-specific-first name catalogue. Returns `None` when
/// nothing matches (→ type-based fallback) or when a name matches but
/// its type gate fails.
fn match_name_generator(table: &str, toks: &[String], ty: Type) -> Option<Generator> {
let text = type_is_text(ty);
let numeric = ty.is_numeric();
// — Person —
if text && (has_any(toks, &["fname", "firstname"]) || has_seq(toks, "first", "name")) {
return Some(Generator::FirstName);
}
if text
&& (has_any(toks, &["lname", "lastname", "surname"]) || has_seq(toks, "last", "name"))
{
return Some(Generator::LastName);
}
if text && (has_any(toks, &["username", "login", "handle"]) || has_seq(toks, "user", "name")) {
return Some(Generator::Username);
}
if text && has_any(toks, &["email", "emails"]) {
return Some(Generator::Email);
}
if text && has_any(toks, &["password", "passwd", "pwd"]) {
return Some(Generator::Password);
}
if text && has_any(toks, &["phone", "mobile", "cell", "tel", "telephone"]) {
return Some(Generator::Phone);
}
// — bare `name` / `title` → table-context (D11) —
// Guarded against the `*_name` false positives handled above (those
// returned already) plus structural names like `filename`/`table_name`.
if text && has_any(toks, &["name", "title"]) && !is_name_false_positive(toks) {
return Some(name_by_table_context(table));
}
// — Address —
if text && has_any(toks, &["city", "town"]) {
return Some(Generator::City);
}
if text && has_token(toks, "country") {
return Some(Generator::Country);
}
// `province` / explicit `state_name`/`state_abbr` → a real state name.
// Bare `state` is left to enum-ish (it usually means status), so we
// require `province` or a `state` token paired with name/abbr.
if text && (has_token(toks, "province") || (has_token(toks, "state") && has_any(toks, &["name", "abbr"]))) {
return Some(Generator::StateName);
}
if text && has_any(toks, &["street", "address", "addr"]) {
return Some(Generator::Street);
}
if text && has_any(toks, &["zip", "zipcode", "postcode", "postal"]) {
return Some(Generator::ZipCode);
}
// — Organisation / job —
if text && has_any(toks, &["company", "employer", "org", "organization", "organisation"]) {
return Some(Generator::Company);
}
if text && has_any(toks, &["job", "position", "profession", "occupation"]) {
return Some(Generator::JobTitle);
}
// — Free text —
if text && has_any(toks, &["description", "bio", "notes", "note", "summary", "comment", "comments", "about"]) {
return Some(Generator::Sentence);
}
if text && has_any(toks, &["url", "website", "homepage", "link"]) {
return Some(Generator::Url);
}
if text && has_any(toks, &["color", "colour"]) {
return Some(Generator::HexColor);
}
// — Numeric —
if numeric && has_any(toks, &["price", "amount", "cost", "salary", "balance", "total", "fee", "revenue"]) {
return Some(Generator::CurrencyAmount);
}
if numeric && has_token(toks, "age") {
return Some(Generator::Age);
}
if numeric && has_any(toks, &["quantity", "qty", "stock", "count"]) {
return Some(Generator::SmallInt);
}
// — Temporal (bounded, D8) —
if matches!(ty, Type::Date) && has_any(toks, &["dob", "birthday", "birthdate"]) {
return Some(Generator::DateAdult);
}
if matches!(ty, Type::Date) && has_token(toks, "date") {
return Some(Generator::DateRecent);
}
if matches!(ty, Type::DateTime) && has_any(toks, &["timestamp", "datetime", "at"]) {
return Some(Generator::DateTimeRecent);
}
// — Boolean —
if matches!(ty, Type::Bool)
&& (toks.first().map(String::as_str) == Some("is")
|| toks.first().map(String::as_str) == Some("has")
|| has_any(toks, &["active", "enabled", "verified", "deleted"]))
{
return Some(Generator::Boolean);
}
// — Identifier family (D10) — late so phone/email/etc. win first.
if matches!(ty, Type::Int | Type::Text) && is_identifier_name(toks) {
return Some(Generator::IdentitySequential);
}
None
}
/// Resolve a bare `name`/`title` column by the **table** it lives in
/// (D11): product-ish → a product name, company-ish → a company name,
/// person-ish → a person name, otherwise a generic person name.
fn name_by_table_context(table: &str) -> Generator {
let toks = tokens(table);
const PRODUCTY: &[&str] = &[
"product", "products", "item", "items", "good", "goods",
"merchandise", "catalog", "catalogue", "inventory", "sku", "skus",
];
const COMPANYISH: &[&str] = &[
"company", "companies", "vendor", "vendors", "supplier",
"suppliers", "manufacturer", "manufacturers", "brand", "brands",
"organization", "organisation",
];
const PERSONISH: &[&str] = &[
"user", "users", "customer", "customers", "person", "people",
"employee", "employees", "member", "members", "contact",
"contacts", "author", "authors", "student", "students",
];
if has_any(&toks, PRODUCTY) {
Generator::ProductName
} else if has_any(&toks, COMPANYISH) {
Generator::Company
} else if has_any(&toks, PERSONISH) {
Generator::FullName
} else {
// Unknown table: a person name is the most generally useful
// default for a bare `name` column.
Generator::FullName
}
}
/// Names ending in `name`/`title` that are NOT person names. The
/// specific `first`/`last`/`user` cases are matched earlier and return
/// before this guard; this catches structural names.
fn is_name_false_positive(toks: &[String]) -> bool {
const NON_PERSON: &[&str] = &[
"file", "table", "host", "domain", "field", "class", "tag",
"event", "path", "col", "column", "db", "schema", "index", "key",
"page", "node", "type",
];
has_any(toks, NON_PERSON) && has_any(toks, &["name", "title"])
}
/// Identifier-family names (D10): treated as unique identifiers. FK
/// columns never reach here (handled in [`choose_generator`]).
fn is_identifier_name(toks: &[String]) -> bool {
const ID_TOKENS: &[&str] = &["id", "code", "sku", "ref", "reference", "barcode"];
if has_any(toks, ID_TOKENS) {
return true;
}
// `*_number` / `*_no` as an identifier, but only when qualified
// (a bare `number`/`no` is too ambiguous, and `phone_number` already
// matched the phone rule earlier).
toks.len() >= 2 && has_any(toks, &["number", "no"])
}
// — token utilities —
/// Split a column/table name into lowercase tokens on `_`, `-`, spaces,
/// and camelCase boundaries. `created_at` → [`created`, `at`];
/// `firstName` → [`first`, `name`]; `DOB` → [`dob`].
fn tokens(name: &str) -> Vec<String> {
let mut out = Vec::new();
let mut cur = String::new();
let mut prev_was_lower_or_digit = false;
for ch in name.chars() {
if ch == '_' || ch == '-' || ch == ' ' {
if !cur.is_empty() {
out.push(std::mem::take(&mut cur));
}
prev_was_lower_or_digit = false;
continue;
}
// camelCase boundary: an uppercase letter following a lowercase
// letter or digit starts a new token.
if ch.is_ascii_uppercase() && prev_was_lower_or_digit && !cur.is_empty() {
out.push(std::mem::take(&mut cur));
}
cur.push(ch.to_ascii_lowercase());
prev_was_lower_or_digit = ch.is_ascii_lowercase() || ch.is_ascii_digit();
}
if !cur.is_empty() {
out.push(cur);
}
out
}
fn has_token(toks: &[String], t: &str) -> bool {
toks.iter().any(|x| x == t)
}
fn has_any(toks: &[String], candidates: &[&str]) -> bool {
candidates.iter().any(|c| has_token(toks, c))
}
/// Whether `a` is immediately followed by `b` in the token list — for
/// matching split compound names like `first name` / `user name`.
fn has_seq(toks: &[String], a: &str, b: &str) -> bool {
toks.windows(2).any(|w| w[0] == a && w[1] == b)
}
/// Text-typed for heuristic purposes — `text`, `shortid`, plus the
/// text-backed `decimal`/`date`/`datetime` are excluded here because
/// those have their own dedicated gates; only `text`/`shortid` accept
/// free-text generators.
const fn type_is_text(ty: Type) -> bool {
matches!(ty, Type::Text | Type::ShortId)
}
#[cfg(test)]
mod tests {
use super::*;
use crate::seed::ColumnSpec;
use pretty_assertions::assert_eq;
fn choose(table: &str, name: &str, ty: Type) -> Generator {
choose_generator(table, &ColumnSpec::plain(name, ty))
}
#[test]
fn person_name_fields_map_to_name_generators() {
assert_eq!(choose("users", "first_name", Type::Text), Generator::FirstName);
assert_eq!(choose("users", "firstName", Type::Text), Generator::FirstName);
assert_eq!(choose("users", "last_name", Type::Text), Generator::LastName);
assert_eq!(choose("users", "surname", Type::Text), Generator::LastName);
}
#[test]
fn contact_fields_map_correctly() {
assert_eq!(choose("users", "email", Type::Text), Generator::Email);
assert_eq!(choose("users", "work_email", Type::Text), Generator::Email);
assert_eq!(choose("users", "username", Type::Text), Generator::Username);
assert_eq!(choose("users", "user_name", Type::Text), Generator::Username);
assert_eq!(choose("users", "phone", Type::Text), Generator::Phone);
assert_eq!(choose("accounts", "password", Type::Text), Generator::Password);
}
#[test]
fn address_fields_map_correctly() {
assert_eq!(choose("a", "city", Type::Text), Generator::City);
assert_eq!(choose("a", "country", Type::Text), Generator::Country);
assert_eq!(choose("a", "street", Type::Text), Generator::Street);
assert_eq!(choose("a", "zip", Type::Text), Generator::ZipCode);
assert_eq!(choose("a", "postcode", Type::Text), Generator::ZipCode);
assert_eq!(choose("a", "province", Type::Text), Generator::StateName);
}
#[test]
fn bare_name_uses_table_context() {
// D11 — the same column name resolves differently by table.
assert_eq!(choose("products", "name", Type::Text), Generator::ProductName);
assert_eq!(choose("items", "title", Type::Text), Generator::ProductName);
assert_eq!(choose("users", "name", Type::Text), Generator::FullName);
assert_eq!(choose("customers", "name", Type::Text), Generator::FullName);
assert_eq!(choose("vendors", "name", Type::Text), Generator::Company);
// Unknown table → person name default.
assert_eq!(choose("widgets", "name", Type::Text), Generator::FullName);
}
#[test]
fn name_false_positives_do_not_become_person_names() {
// These must NOT resolve to a person/product name.
assert_ne!(choose("files", "filename", Type::Text), Generator::FullName);
assert_ne!(choose("meta", "table_name", Type::Text), Generator::FullName);
// They fall through to a generic / non-person generator.
assert_eq!(choose("files", "filename", Type::Text), Generator::Generic);
}
#[test]
fn numeric_name_heuristics_are_type_gated() {
// `price` on a numeric column → currency; on text → falls through.
assert_eq!(choose("p", "price", Type::Int), Generator::CurrencyAmount);
assert_eq!(choose("p", "price", Type::Decimal), Generator::CurrencyAmount);
assert_eq!(choose("p", "price", Type::Text), Generator::Generic);
assert_eq!(choose("u", "age", Type::Int), Generator::Age);
assert_eq!(choose("o", "quantity", Type::Int), Generator::SmallInt);
}
#[test]
fn email_on_wrong_type_falls_through() {
// The type gate: an `email` int column does NOT get a string —
// it falls through to type-based generation.
assert_eq!(choose("u", "email", Type::Int), Generator::Generic);
}
#[test]
fn temporal_fields_are_bounded_and_type_gated() {
assert_eq!(choose("u", "dob", Type::Date), Generator::DateAdult);
assert_eq!(choose("o", "order_date", Type::Date), Generator::DateRecent);
assert_eq!(choose("o", "created_at", Type::DateTime), Generator::DateTimeRecent);
assert_eq!(choose("o", "timestamp", Type::DateTime), Generator::DateTimeRecent);
// Wrong type → not a date generator.
assert_eq!(choose("o", "order_date", Type::Int), Generator::Generic);
}
#[test]
fn boolean_fields_map_to_boolean() {
assert_eq!(choose("u", "is_active", Type::Bool), Generator::Boolean);
assert_eq!(choose("u", "has_paid", Type::Bool), Generator::Boolean);
assert_eq!(choose("u", "enabled", Type::Bool), Generator::Boolean);
}
#[test]
fn identifier_family_is_unique_sequential() {
assert_eq!(choose("t", "code", Type::Text), Generator::IdentitySequential);
assert_eq!(choose("t", "sku", Type::Text), Generator::IdentitySequential);
assert_eq!(choose("t", "order_number", Type::Int), Generator::IdentitySequential);
assert_eq!(choose("t", "external_id", Type::Int), Generator::IdentitySequential);
}
#[test]
fn foreign_key_columns_defer_to_executor() {
let mut spec = ColumnSpec::plain("user_id", Type::Int);
spec.is_foreign_key = true;
assert_eq!(choose_generator("orders", &spec), Generator::ForeignKeySample);
}
#[test]
fn check_in_values_become_pick_from() {
let mut spec = ColumnSpec::plain("status", Type::Text);
spec.check_in_values = Some(vec!["active".into(), "closed".into()]);
assert_eq!(
choose_generator("orders", &spec),
Generator::PickFrom(vec!["active".into(), "closed".into()])
);
}
#[test]
fn enum_ish_names_are_detected_for_the_advisory() {
assert!(is_enum_ish("status"));
assert!(is_enum_ish("role"));
assert!(is_enum_ish("order_state"));
assert!(is_enum_ish("priority"));
assert!(!is_enum_ish("email"));
assert!(!is_enum_ish("first_name"));
}
#[test]
fn enum_ish_columns_fall_through_to_generic() {
// No special generator — generic text + the advisory flags them.
assert_eq!(choose("orders", "status", Type::Text), Generator::Generic);
assert_eq!(choose("users", "role", Type::Text), Generator::Generic);
}
#[test]
fn unmatched_columns_use_type_based_fallback() {
assert_eq!(choose("t", "some_freeform_field", Type::Text), Generator::Generic);
}
#[test]
fn tokenizer_splits_on_all_boundaries() {
assert_eq!(tokens("created_at"), vec!["created", "at"]);
assert_eq!(tokens("firstName"), vec!["first", "name"]);
assert_eq!(tokens("DOB"), vec!["dob"]);
assert_eq!(tokens("user-email"), vec!["user", "email"]);
assert_eq!(tokens("HTTPStatus"), vec!["httpstatus"]);
}
}
+200
View File
@@ -0,0 +1,200 @@
//! Pure fake-data generation library for the `seed` command (ADR-0048).
//!
//! This module is the **generation half** of `seed`: given a column's
//! shape (name, type, constraints), it chooses a *generator* and turns
//! a seeded RNG into plausible [`Value`]s. It is deliberately decoupled
//! from `db.rs` — it knows nothing about SQLite, the worker thread, or
//! persistence — so it stays pure and unit-testable, with exact-value
//! assertions made possible by the seedable RNG (ADR-0048 D4).
//!
//! The executor (`db.rs::do_seed`) adapts the real schema into
//! [`ColumnSpec`]s, calls [`choose_generator`] per column, and then
//! [`generate_value`] per row — except for the *stateful* markers
//! ([`Generator::IdentitySequential`], [`Generator::ForeignKeySample`])
//! which need database context (existing rows, the running sequence)
//! and so are resolved by the executor, not here.
//!
//! Layout:
//! - this file — the public types ([`ColumnSpec`], [`Generator`],
//! [`SeedRng`]) and the RNG constructor.
//! - [`heuristics`] — [`choose_generator`] + the name-aware catalogue
//! (D7), table-context disambiguation (D11), identifier (D10) and
//! enum-ish (D12) detection.
//! - [`generators`] — [`generate_value`]: per-generator value
//! production, the hand-rolled `product` generator (D9) and the
//! bounded date windows (D8).
mod generators;
mod heuristics;
pub use generators::generate_value;
pub use heuristics::{choose_generator, is_enum_ish};
use rand::rngs::StdRng;
use rand::{RngExt, SeedableRng};
use crate::dsl::types::Type;
/// The RNG that drives all seed generation.
///
/// A single seeded `StdRng` feeds both `fake`'s `fake_with_rng` and the
/// hand-rolled generators, so a `--seed` value fully determines the
/// output (ADR-0048 D4). `rand 0.10`'s `StdRng` satisfies `fake`'s
/// `RngExt` bound (it re-exports `rand::RngExt`), so the same handle
/// works on both sides.
pub type SeedRng = StdRng;
/// Build the seed RNG.
///
/// With `Some(seed)` the stream is reproducible; with `None` it is
/// seeded from entropy (via the thread RNG) so each run differs.
/// Seeding `StdRng` from a single `u64` in both cases keeps
/// construction uniform and avoids `rand`'s churn-prone from-entropy
/// constructors.
#[must_use]
pub fn make_rng(seed: Option<u64>) -> SeedRng {
let seed = seed.unwrap_or_else(|| rand::rng().random::<u64>());
StdRng::seed_from_u64(seed)
}
/// A column described in just enough detail to choose and run a
/// generator. Built by the executor from the real schema; kept
/// independent of `db.rs`'s `ReadColumn` so this library stays pure.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct ColumnSpec {
/// The column's name — the primary signal for generator choice.
pub name: String,
/// The user-facing playground type — gates every name heuristic.
pub ty: Type,
/// `NOT NULL` — the executor uses this for the block guard (D1);
/// generation always produces a value, so it is informational here.
pub not_null: bool,
/// Part of the table's primary key.
pub primary_key: bool,
/// Carries a `UNIQUE` constraint (or is a single-column PK).
pub unique: bool,
/// A foreign-key column — generation is the executor's job
/// (sample an existing parent row, D14), so [`choose_generator`]
/// returns [`Generator::ForeignKeySample`].
pub is_foreign_key: bool,
/// Values parsed from a simple `col IN ('a', 'b', …)` CHECK
/// constraint (D17). When present, generation draws from them so
/// the common enum-as-CHECK pattern "just works".
pub check_in_values: Option<Vec<String>>,
}
impl ColumnSpec {
/// Convenience constructor for a plain, unconstrained column —
/// used heavily in tests.
#[cfg(test)]
#[must_use]
pub fn plain(name: &str, ty: Type) -> Self {
Self {
name: name.to_string(),
ty,
not_null: false,
primary_key: false,
unique: false,
is_foreign_key: false,
check_in_values: None,
}
}
}
/// The chosen generation strategy for a column.
///
/// Most variants are *stateless* — [`generate_value`] turns them into a
/// [`Value`] from the RNG alone. Two are *stateful markers* that the
/// executor must intercept (they need database context):
/// [`Self::IdentitySequential`] (the running `MAX+offset` sequence,
/// D10) and [`Self::ForeignKeySample`] (draw from existing parent
/// rows, D14). For safety [`generate_value`] treats an un-intercepted
/// marker as [`Self::Generic`] rather than panicking.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum Generator {
// — Person —
FirstName,
LastName,
/// A full person name (table-context default for `name`/`title`).
FullName,
Email,
Username,
Password,
Phone,
// — Address —
City,
Country,
StateName,
Street,
ZipCode,
// — Organisation / commerce —
Company,
JobTitle,
/// Hand-rolled `{adjective} {material} {noun}` (D9) — `fake` has no
/// commerce module.
ProductName,
// — Free text —
Sentence,
Paragraph,
Url,
HexColor,
// — Numeric —
/// A money-shaped amount (whole for `int`, two-decimal otherwise).
CurrencyAmount,
/// A plausible human age (1880).
Age,
/// A small positive integer (quantities, counts).
SmallInt,
// — Temporal (bounded windows, D8) —
/// A date within the last few years.
DateRecent,
/// A date in an adult birth window (≈1880 years ago) — for `dob`.
DateAdult,
/// A datetime within the last few years.
DateTimeRecent,
// — Boolean —
Boolean,
// — Stateful markers (executor-resolved) —
/// Unique sequential identifier (D10): the executor supplies
/// `MAX(col)+offset`. Chosen for identifier-named non-FK columns.
IdentitySequential,
/// FK column (D14): the executor samples an existing parent key.
ForeignKeySample,
// — List / fallback —
/// Uniform pick from a fixed list — a simple `IN`-CHECK (D17), an
/// enum, or a future `set <col> in (…)` override.
PickFrom(Vec<String>),
/// Type-based fallback (D8) when no name heuristic matches.
Generic,
}
#[cfg(test)]
mod tests {
use super::*;
use pretty_assertions::assert_eq;
#[test]
fn same_seed_yields_identical_rng_streams() {
let mut a = make_rng(Some(42));
let mut b = make_rng(Some(42));
let xs: Vec<u64> = (0..8).map(|_| a.random::<u64>()).collect();
let ys: Vec<u64> = (0..8).map(|_| b.random::<u64>()).collect();
assert_eq!(xs, ys, "a fixed seed must reproduce the stream");
}
#[test]
fn different_seeds_yield_different_streams() {
let mut a = make_rng(Some(1));
let mut b = make_rng(Some(2));
let xs: Vec<u64> = (0..8).map(|_| a.random::<u64>()).collect();
let ys: Vec<u64> = (0..8).map(|_| b.random::<u64>()).collect();
assert_ne!(xs, ys);
}
#[test]
fn unseeded_rng_constructs_without_panicking() {
// Entropy-seeded path: just exercise it.
let mut rng = make_rng(None);
let _ = rng.random::<u64>();
}
}