BPE Tokenizer - 10k Vocabulary (TinyStories)

A Byte-Pair Encoding (BPE) tokenizer trained on the TinyStories dataset with a vocabulary size of 10,000 tokens.

Model Details

  • Tokenizer Type: BPE (Byte Pair Encoding), GPT-2 style
  • Vocabulary Size: 10,000 tokens
  • Training Dataset: roneneldan/TinyStories
  • Pre-tokenizer: ByteLevel (handles spaces and bytes like GPT-2)
  • Decoder: ByteLevel
  • Special Tokens:
    • <|endoftext|>: BOS/EOS/UNK token
    • <|padding|>: Padding token

Training Configuration

  • Minimum Frequency: 2
  • Batch Size: 1000
  • Training Split: train

Example Output

Training tokenizer on all samples...
[Info] Tokenizer 10000 vocabs saved to './bpe-10.0k-tinystories'

--- Test ---
Input:   Once upon a time, there was a tiny dragon.
Tokens:  [7013, 2402, 247, 673, 13, 627, 369, 247, 5888, 10295, 15]
Decoded: Once upon a time, there was a tiny dragon.

Usage

Loading the Tokenizer

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("vuiseng9/bpe-10.0k-tinystories")

# Encode text
text = "Once upon a time, there was a tiny dragon."
tokens = tokenizer.encode(text)
print(tokens)

# Decode tokens
decoded = tokenizer.decode(tokens)
print(decoded)

Reproduce the Tokenizer

# Full training on entire dataset
python train_bpe.py --real 

# see more options
python train_bpe.py -h
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support