BPE Lab

Watch your text become tokens.

A byte-pair encoding tokenizer trained from scratch on a Wikipedia corpus. Type below to see exactly how it splits your text — then compare it against GPT-2.

Your text

Type something above to see it tokenized.

Compare with GPT-2

Enter text above and run a comparison to see how this tokenizer stacks up against GPT-2.

Benchmark results

Tokens per word, byte compression, and throughput measured against standard language-modeling benchmarks.

No evaluation report loaded yet. Run evaluation/evaluate.py to generate one.

Trained on a Wikipedia corpus (CC BY-SA 4.0). Tokenizer: byte-level BPE via Hugging Face tokenizers.