--- title: Tokenizer Playground emoji: 🔤 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.44.1 app_file: app.py pinned: true license: mit models: - Qwen/Qwen3-0.6B - Qwen/Qwen2.5-7B - meta-llama/Llama-3.1-8B - openai-community/gpt2 - mistralai/Mistral-7B-v0.1 - google/gemma-7b tags: - tokenizer - nlp - text-processing - research-tool short_description: Interactive tokenizer tool for NLP researchers --- # 🔤 Tokenizer Playground An interactive web application for experimenting with various Hugging Face tokenizers. Perfect for NLP researchers and developers who need to quickly test and compare different tokenization strategies. ## Features ### 🔤 Tokenize Tab - Convert any text into tokens using popular models - View tokens, token IDs, and detailed token information - See tokenization statistics (tokens per character, vocabulary size, etc.) - Support for adding/removing special tokens - Custom model support via Hugging Face model IDs ### 🔄 Detokenize Tab - Convert token IDs back to text - Support for various input formats (list, comma-separated, space-separated) - Option to skip special tokens - Verification of round-trip tokenization ### 📊 Compare Tab - Compare tokenization across multiple models simultaneously - See token count differences and efficiency metrics - Identify which tokenizer is most efficient for your use case - Sort results by token count ### 📖 Vocabulary Tab - Explore tokenizer vocabulary details - View special tokens and their configurations - See vocabulary size and tokenizer type - Browse first 100 tokens in the vocabulary ## Supported Models ### Pre-configured Models - **Qwen Series**: Qwen 3, Qwen 2.5, Qwen 2, Qwen 1 (multiple sizes) - **Llama Series**: Llama 3.2, Llama 3.1, Llama 2 (multiple sizes) - **GPT Models**: GPT-2, GPT-NeoX - **Google Models**: Gemma, T5, BERT - **Mistral Models**: Mistral 7B, Mixtral 8x7B - **Other Models**: DeepSeek, Phi, Yi, BLOOM, OPT, StableLM ### Custom Models You can use any tokenizer available on the Hugging Face Hub by entering its model ID in the "Custom Model ID" field. Examples: - `facebook/bart-base` - `EleutherAI/gpt-j-6b` - `bigscience/bloom` - `stabilityai/stablelm-2-1_6b` ## Technical Details - Built with Gradio for an intuitive web interface - Uses Hugging Face Transformers for tokenizer support - Supports both fast (Rust-based) and slow (Python-based) tokenizers - Caches loaded tokenizers for improved performance - Handles special tokens and custom vocabularies ## Quick Start 1. **Select a tokenizer** from the dropdown or enter a custom model ID 2. **Enter your text** in the input field 3. **Click the action button** (Tokenize, Decode, Compare, or Analyze) 4. **View the results** in the output fields ## Tips - Different tokenizers can produce significantly different token counts for the same text - Special tokens (like `[CLS]`, `[SEP]`, ``, ``) are model-specific - Subword tokenization allows handling of out-of-vocabulary words - Token efficiency directly impacts model inference costs and API usage ## Local Development To run this application locally: ```bash # Clone the repository git clone cd tokenizer-playground # Install dependencies pip install -r requirements.txt # Run the application python app.py ``` The application will be available at `http://localhost:7860` ## License This project is licensed under the MIT License.