Skip to content

tab2seq

PyPI - Version PyPI - Python Version PyPI - Status GitHub License

tab2seq turns multi-source tabular event data (registries, EHR, financial records) into tokenized sequences ready for Transformer-based models. It generalizes the data processing pipeline from the Life2Vec paper to arbitrary domains.

Alpha software

The core pipeline (Sources → Cohort → Vocabulary → EventDataset) is functional but the API is not yet stable. Pin to a specific version if you depend on current behaviour.

Why tab2seq?

Building a Life2Vec-style pipeline from scratch requires solving the same problems every time: multi-source schema alignment, leakage-safe vocabulary fitting, deterministic splits, and efficient Parquet-backed sequence iteration. tab2seq handles all of this so you can focus on modeling:

  • Work with multiple longitudinal data sources (registries, databases)
  • Define and filter cohorts based on inclusion criteria
  • Create deterministic train/val/test splits with static context
  • Fit a vocabulary on training data only (no leakage)
  • Produce tokenized, model-ready event sequences with time features
  • Generate realistic synthetic data for development and testing

Requires: Python ≥ 3.11, NumPy ≥ 2.0, Polars ≥ 1.38, Pydantic v2.

Pipeline

Sources → Cohort → Vocabulary → Tokenizer → EventDataset → Model-ready Parquet
Step Class What it does
1 Source / SourceCollection Schema declaration for each event table (categorical, continuous, temporal columns)
2 Cohort Entity universe + inclusion criteria + deterministic train/val/test splits
3 Vocabulary / Tokenizer Token mappings and bin edges fitted on train split only
4 EventDataset Vectorized token-ID encoding, relative-date features, Parquet persistence

Quick install

pip install tab2seq

See Installation for full setup options and Quick Start to run the full pipeline end-to-end.

Roadmap

  • [x] Synthetic datasets
  • [x] Source / SourceCollection
  • [x] Cohort + splits
  • [x] Vocabulary (leakage-safe)
  • [x] Tokenizer / EventDataset
  • [x] Parquet persistence + caching
  • [ ] Full Life2Vec / Life2Vec-Light preprocessing parity
  • [ ] Subsetting Cohorts for fine-tuning
  • [ ] Example with Tokenization and Transformer training
  • [ ] Documentation site

Citation

If you use tab2seq, please cite:

@software{tab2seq2026,
  author = {Savcisens, Germans},
  title = {tab2seq: Scalable Tabular to Sequential Data Processing},
  year = {2026},
  url = {https://github.com/carlomarxdk/tab2seq}
}

And the original Life2Vec paper that inspired this work:

@article{savcisens2024using,
  title={Using sequences of life-events to predict human lives},
  author={Savcisens, Germans and Eliassi-Rad, Tina and Hansen, Lars Kai and Mortensen, Laust Hvas and Lilleholt, Lau and Rogers, Anna and Zettler, Ingo and Lehmann, Sune},
  journal={Nature computational science},
  volume={4},
  number={1},
  pages={43--56},
  year={2024},
  publisher={Nature Publishing Group US New York}
}

Acknowledgments

Inspired by the data processing pipeline from Life2Vec and Life2Vec-Light. Built with Polars and Pydantic.