Skip to content

Record Formats

All four EventDataset access methods accept a format parameter that controls the output structure.

Format Returns Best for
"raw" Python dicts (one dict per event) inspection, custom collation
"frame" Polars DataFrames filtering, feature analysis
"tensor" Flat NumPy arrays + event lengths custom PyTorch/JAX collation
"padded_tensor" 2-D padded NumPy matrix + attention mask direct DataLoader use

raw (default)

record = dataset_loaded.sample_entity_record("train", seed=42, format="raw")
# record["entity_id"]  → str
# record["split"]      → "train" | "val" | "test"
# record["static"]     → {"entity_id": ..., "labour__birthday": ..., "token_ids": [...], ...}
# record["events"]     → list of dicts, one per event:
#   event["primary_timestamp"]  → "2015-01-01"
#   event["source_name"]        → "labour"
#   event["token_ids"]          → [105, 86, 98, 110, 3]
#   event["age_years"]          → 28

frame

Returns Polars DataFrames — avoids to_dicts() overhead for downstream filtering or analysis.

record = dataset_loaded.sample_entity_record("train", seed=7, format="frame")
# record["entity_id"]        → str
# record["static_token_ids"] → list[int]
# record["events"]           → polars.DataFrame with columns:
#   primary_timestamp, source_name, token_ids (list[i64]), age_years, ...

tensor

Returns flat NumPy arrays. token_ids concatenates all events into a single 1-D array; use event_lengths to split them back per event. temporal stacks time and any relative-date features into a [num_events, T] float array.

Pass include_cls=True to prepend a [CLS] token and include_sep=True to insert [SEP] between events.

record = dataset_loaded.sample_entity_record(
    "train", seed=7, format="tensor", include_cls=True, include_sep=True
)
# record["token_ids"]        → ndarray shape (total_tokens,)  — all events concatenated
# record["event_lengths"]    → ndarray shape (num_events,)    — tokens per event
# record["time"]             → ndarray shape (num_events,)    — days since reference_date
# record["temporal"]         → ndarray shape (num_events, T)  — time + rel-date features
# record["static_token_ids"] → list[int]

# Reconstruct per-event token lists
import numpy as np
per_event = np.split(record["token_ids"], np.cumsum(record["event_lengths"])[:-1])

padded_tensor

Like tensor but produces a 2-D [num_events, max_event_len] matrix padded with pad_id. Drops directly into a PyTorch DataLoader without further collation.

record = dataset_loaded.sample_entity_record(
    "train", seed=7, format="padded_tensor", pad_id=0
)
# record["token_ids"]        → ndarray shape (num_events, max_event_len)
# record["attention_mask"]   → bool ndarray shape (num_events, max_event_len)
# record["time"]             → ndarray shape (num_events,)
# record["static_token_ids"] → list[int]