---
title: Tokenizer Playground
emoji: 🔤
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: true
license: mit
models:
- Qwen/Qwen3-0.6B
- Qwen/Qwen2.5-7B
- meta-llama/Llama-3.1-8B
- openai-community/gpt2
- mistralai/Mistral-7B-v0.1
- google/gemma-7b
tags:
- tokenizer
- nlp
- text-processing
- research-tool
short_description: Interactive tokenizer tool for NLP researchers
---
# 🔤 Tokenizer Playground
An interactive web application for experimenting with various Hugging Face tokenizers. Perfect for NLP researchers and developers who need to quickly test and compare different tokenization strategies.
## Features
### 🔤 Tokenize Tab
- Convert any text into tokens using popular models
- View tokens, token IDs, and detailed token information
- See tokenization statistics (tokens per character, vocabulary size, etc.)
- Support for adding/removing special tokens
- Custom model support via Hugging Face model IDs
### 🔄 Detokenize Tab
- Convert token IDs back to text
- Support for various input formats (list, comma-separated, space-separated)
- Option to skip special tokens
- Verification of round-trip tokenization
### 📊 Compare Tab
- Compare tokenization across multiple models simultaneously
- See token count differences and efficiency metrics
- Identify which tokenizer is most efficient for your use case
- Sort results by token count
### 📖 Vocabulary Tab
- Explore tokenizer vocabulary details
- View special tokens and their configurations
- See vocabulary size and tokenizer type
- Browse first 100 tokens in the vocabulary
## Supported Models
### Pre-configured Models
- **Qwen Series**: Qwen 3, Qwen 2.5, Qwen 2, Qwen 1 (multiple sizes)
- **Llama Series**: Llama 3.2, Llama 3.1, Llama 2 (multiple sizes)
- **GPT Models**: GPT-2, GPT-NeoX
- **Google Models**: Gemma, T5, BERT
- **Mistral Models**: Mistral 7B, Mixtral 8x7B
- **Other Models**: DeepSeek, Phi, Yi, BLOOM, OPT, StableLM
### Custom Models
You can use any tokenizer available on the Hugging Face Hub by entering its model ID in the "Custom Model ID" field. Examples:
- `facebook/bart-base`
- `EleutherAI/gpt-j-6b`
- `bigscience/bloom`
- `stabilityai/stablelm-2-1_6b`
## Technical Details
- Built with Gradio for an intuitive web interface
- Uses Hugging Face Transformers for tokenizer support
- Supports both fast (Rust-based) and slow (Python-based) tokenizers
- Caches loaded tokenizers for improved performance
- Handles special tokens and custom vocabularies
## Quick Start
1. **Select a tokenizer** from the dropdown or enter a custom model ID
2. **Enter your text** in the input field
3. **Click the action button** (Tokenize, Decode, Compare, or Analyze)
4. **View the results** in the output fields
## Tips
- Different tokenizers can produce significantly different token counts for the same text
- Special tokens (like `[CLS]`, `[SEP]`, ``, ``) are model-specific
- Subword tokenization allows handling of out-of-vocabulary words
- Token efficiency directly impacts model inference costs and API usage
## Local Development
To run this application locally:
```bash
# Clone the repository
git clone
cd tokenizer-playground
# Install dependencies
pip install -r requirements.txt
# Run the application
python app.py
```
The application will be available at `http://localhost:7860`
## License
This project is licensed under the MIT License.