tab2seq.tokenization.config¶
tab2seq.tokenization.config
¶
Project-level configuration models for vocabulary and tokenization.
VocabularyConfig
¶
Bases: BaseModel
Vocabulary-building configuration.
Attributes:
| Name | Type | Description |
|---|---|---|
max_vocab_size |
int
|
Hard cap on total tokens (includes special tokens). |
min_token_count |
int
|
Minimum train-split occurrences for a token to be retained. |
count_mode |
Literal['overall', 'entity_unique']
|
Token counting mode used for |
pad_token |
str
|
Padding token string. |
unk_token |
str
|
Unknown-value token string. |
cls_token |
str
|
Sequence-start token string. |
sep_token |
str
|
Sequence-end token string. |
mask_token |
str
|
Mask token for MLM pre-training. |
extra_tokens |
list[str]
|
Additional reserved tokens appended after the standard five.
Use for domain-specific sentinels that must always be in the vocabulary
regardless of training-data content (e.g. |
Source code in tab2seq/tokenization/config.py
special_tokens
property
¶
Ordered list of all reserved tokens: standard five followed by extra_tokens.
TokenizerConfig
¶
Bases: BaseModel
Tokenizer configuration.
Note
Vocabulary size and special-token strings are controlled by
:class:VocabularyConfig; this config governs only column filtering.
Attributes:
| Name | Type | Description |
|---|---|---|
id_columns |
list[str]
|
Columns treated as entity identifiers — always excluded from tokenization regardless of vocabulary content. |
exclude_columns |
list[str]
|
Additional columns to skip during encoding. |
vocabulary |
VocabularyConfig
|
Embedded vocabulary configuration (used when building a
:class: |