BPE Tokenizer - 10k Vocabulary (TinyStories)
A Byte-Pair Encoding (BPE) tokenizer trained on the TinyStories dataset with a vocabulary size of 10,000 tokens.
Model Details
- Tokenizer Type: BPE (Byte Pair Encoding), GPT-2 style
- Vocabulary Size: 10,000 tokens
- Training Dataset: roneneldan/TinyStories
- Pre-tokenizer: ByteLevel (handles spaces and bytes like GPT-2)
- Decoder: ByteLevel
- Special Tokens:
<|endoftext|>: BOS/EOS/UNK token<|padding|>: Padding token
Training Configuration
- Minimum Frequency: 2
- Batch Size: 1000
- Training Split: train
Example Output
Training tokenizer on all samples...
[Info] Tokenizer 10000 vocabs saved to './bpe-10.0k-tinystories'
--- Test ---
Input: Once upon a time, there was a tiny dragon.
Tokens: [7013, 2402, 247, 673, 13, 627, 369, 247, 5888, 10295, 15]
Decoded: Once upon a time, there was a tiny dragon.
Usage
Loading the Tokenizer
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("vuiseng9/bpe-10.0k-tinystories")
# Encode text
text = "Once upon a time, there was a tiny dragon."
tokens = tokenizer.encode(text)
print(tokens)
# Decode tokens
decoded = tokenizer.decode(tokens)
print(decoded)
Reproduce the Tokenizer
# Full training on entire dataset
python train_bpe.py --real
# see more options
python train_bpe.py -h
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support