vuiseng9
/

bpe-10.0k-tinystories

Model card Files Files and versions

BPE Tokenizer - 10k Vocabulary (TinyStories)

A Byte-Pair Encoding (BPE) tokenizer trained on the TinyStories dataset with a vocabulary size of 10,000 tokens.

Model Details

Tokenizer Type: BPE (Byte Pair Encoding), GPT-2 style
Vocabulary Size: 10,000 tokens
Training Dataset: roneneldan/TinyStories
Pre-tokenizer: ByteLevel (handles spaces and bytes like GPT-2)
Decoder: ByteLevel
Special Tokens:
- <|endoftext|>: BOS/EOS/UNK token
- <|padding|>: Padding token

Training Configuration

Minimum Frequency: 2
Batch Size: 1000
Training Split: train

Example Output

Training tokenizer on all samples...
[Info] Tokenizer 10000 vocabs saved to './bpe-10.0k-tinystories'

--- Test ---
Input:   Once upon a time, there was a tiny dragon.
Tokens:  [7013, 2402, 247, 673, 13, 627, 369, 247, 5888, 10295, 15]
Decoded: Once upon a time, there was a tiny dragon.

Usage

Loading the Tokenizer

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("vuiseng9/bpe-10.0k-tinystories")

# Encode text
text = "Once upon a time, there was a tiny dragon."
tokens = tokenizer.encode(text)
print(tokens)

# Decode tokens
decoded = tokenizer.decode(tokens)
print(decoded)

Reproduce the Tokenizer

# Full training on entire dataset
python train_bpe.py --real 

# see more options
python train_bpe.py -h

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support