tab2seq¶
tab2seq turns multi-source tabular event data (registries, EHR, financial records) into tokenized sequences ready for Transformer-based models. It generalizes the data processing pipeline from the Life2Vec paper to arbitrary domains.
Alpha software
The core pipeline (Sources → Cohort → Vocabulary → EventDataset) is functional but the API is not yet stable. Pin to a specific version if you depend on current behaviour.
Why tab2seq?¶
Building a Life2Vec-style pipeline from scratch requires solving the same problems every time: multi-source schema alignment, leakage-safe vocabulary fitting, deterministic splits, and efficient Parquet-backed sequence iteration. tab2seq handles all of this so you can focus on modeling:
- Work with multiple longitudinal data sources (registries, databases)
- Define and filter cohorts based on inclusion criteria
- Create deterministic train/val/test splits with static context
- Fit a vocabulary on training data only (no leakage)
- Produce tokenized, model-ready event sequences with time features
- Generate realistic synthetic data for development and testing
Requires: Python ≥ 3.11, NumPy ≥ 2.0, Polars ≥ 1.38, Pydantic v2.
Pipeline¶
| Step | Class | What it does |
|---|---|---|
| 1 | Source / SourceCollection |
Schema declaration for each event table (categorical, continuous, temporal columns) |
| 2 | Cohort |
Entity universe + inclusion criteria + deterministic train/val/test splits |
| 3 | Vocabulary / Tokenizer |
Token mappings and bin edges fitted on train split only |
| 4 | EventDataset |
Vectorized token-ID encoding, relative-date features, Parquet persistence |
Quick install¶
See Installation for full setup options and Quick Start to run the full pipeline end-to-end.
Roadmap¶
- [x] Synthetic datasets
- [x]
Source/SourceCollection - [x]
Cohort+ splits - [x]
Vocabulary(leakage-safe) - [x]
Tokenizer/EventDataset - [x] Parquet persistence + caching
- [ ] Full Life2Vec / Life2Vec-Light preprocessing parity
- [ ] Subsetting Cohorts for fine-tuning
- [ ] Example with Tokenization and Transformer training
- [ ] Documentation site
Citation¶
If you use tab2seq, please cite:
@software{tab2seq2026,
author = {Savcisens, Germans},
title = {tab2seq: Scalable Tabular to Sequential Data Processing},
year = {2026},
url = {https://github.com/carlomarxdk/tab2seq}
}
And the original Life2Vec paper that inspired this work:
@article{savcisens2024using,
title={Using sequences of life-events to predict human lives},
author={Savcisens, Germans and Eliassi-Rad, Tina and Hansen, Lars Kai and Mortensen, Laust Hvas and Lilleholt, Lau and Rogers, Anna and Zettler, Ingo and Lehmann, Sune},
journal={Nature computational science},
volume={4},
number={1},
pages={43--56},
year={2024},
publisher={Nature Publishing Group US New York}
}
Acknowledgments¶
Inspired by the data processing pipeline from Life2Vec and Life2Vec-Light. Built with Polars and Pydantic.