Skip to content

tab2seq

Record Formats

carlomarxdk/tab2seq

Record Formats¶

All four EventDataset access methods accept a format parameter that controls the output structure.

Format	Returns	Best for
`"raw"`	Python dicts (one dict per event)	inspection, custom collation
`"frame"`	Polars DataFrames	filtering, feature analysis
`"tensor"`	Flat NumPy arrays + event lengths	custom PyTorch/JAX collation
`"padded_tensor"`	2-D padded NumPy matrix + attention mask	direct DataLoader use

`raw` (default)¶

record = dataset_loaded.sample_entity_record("train", seed=42, format="raw")
# record["entity_id"]  → str
# record["split"]      → "train" | "val" | "test"
# record["static"]     → {"entity_id": ..., "labour__birthday": ..., "token_ids": [...], ...}
# record["events"]     → list of dicts, one per event:
#   event["primary_timestamp"]  → "2015-01-01"
#   event["source_name"]        → "labour"
#   event["token_ids"]          → [105, 86, 98, 110, 3]
#   event["age_years"]          → 28

`frame`¶

Returns Polars DataFrames — avoids to_dicts() overhead for downstream filtering or analysis.

record = dataset_loaded.sample_entity_record("train", seed=7, format="frame")
# record["entity_id"]        → str
# record["static_token_ids"] → list[int]
# record["events"]           → polars.DataFrame with columns:
#   primary_timestamp, source_name, token_ids (list[i64]), age_years, ...

`tensor`¶

Returns flat NumPy arrays. token_ids concatenates all events into a single 1-D array; use event_lengths to split them back per event. temporal stacks time and any relative-date features into a [num_events, T] float array.

Pass include_cls=True to prepend a [CLS] token and include_sep=True to insert [SEP] between events.

record = dataset_loaded.sample_entity_record(
    "train", seed=7, format="tensor", include_cls=True, include_sep=True
)
# record["token_ids"]        → ndarray shape (total_tokens,)  — all events concatenated
# record["event_lengths"]    → ndarray shape (num_events,)    — tokens per event
# record["time"]             → ndarray shape (num_events,)    — days since reference_date
# record["temporal"]         → ndarray shape (num_events, T)  — time + rel-date features
# record["static_token_ids"] → list[int]

# Reconstruct per-event token lists
import numpy as np
per_event = np.split(record["token_ids"], np.cumsum(record["event_lengths"])[:-1])

`padded_tensor`¶

Like tensor but produces a 2-D [num_events, max_event_len] matrix padded with pad_id. Drops directly into a PyTorch DataLoader without further collation.

record = dataset_loaded.sample_entity_record(
    "train", seed=7, format="padded_tensor", pad_id=0
)
# record["token_ids"]        → ndarray shape (num_events, max_event_len)
# record["attention_mask"]   → bool ndarray shape (num_events, max_event_len)
# record["time"]             → ndarray shape (num_events,)
# record["static_token_ids"] → list[int]