Skip to content

Synthetic Data

tab2seq ships with a synthetic data generator that produces four registry-style tables with realistic temporal patterns, missing data, and cross-field correlations. It is designed for testing and prototyping the pipeline without real data.

Available registries

Registry Key columns
health diagnosis, procedure, department, cost, length_of_stay
income income_type, sector, income_amount
labour status, occupation, weekly_hours, residence_region, birthday
survey education_level, marital_status, self_rated_health, satisfaction_score

Generating data

from tab2seq.datasets import generate_synthetic_data

data_paths = generate_synthetic_data(
    output_dir="synthetic_data",
    n_entities=10_000,
    seed=742,
    registries=["health", "labour", "survey", "income"],
)
# data_paths → {"health": Path(...), "labour": Path(...), ...}

Generating a SourceCollection directly

from tab2seq.datasets import generate_synthetic_collections

collection = generate_synthetic_collections(
    output_dir="synthetic_data",
    n_entities=10_000,
    seed=742,
)

This returns a ready-to-use SourceCollection wired to the generated files.