--- title: Omnilingual ASR Media Transcription emoji: 🌍 colorFrom: blue colorTo: green sdk: docker app_port: 7860 pinned: false license: mit suggested_hardware: a100-large --- # Experimental Omnilingual ASR Media Transcription Demo A media transcription tool with a web interface for multilingual audio and video transcription using Meta's Omnilingual ASR model. Transcriptions are supported for 1600+ languages. This application is designed primarily as a **web-based media transcription tool** with an intuitive frontend interface. While you can interact directly with the API endpoints, the recommended usage is through the web interface at `http://localhost:7860`. ## HuggingFace Space Configuration This application is configured to run as a HuggingFace Space, however has resource limitations as it is a public. In order to have your own dedicated space, please clone with the following recommended specifications: - **Hardware**: A100 GPU (80GB) - Required for loading the 7B parameter Omnilingual ASR model - _Alternative_: Machines with lower GPU memory can use smaller models by setting the `MODEL_NAME` environment variable in HuggingFace Space settings, e.g. `omniASR_LLM_300M` (requires ~8GB GPU memory) - **Persistent Storage**: Enabled for model caching and improved loading times. Medium (150GB) - **Docker Runtime**: Uses custom Dockerfile for fairseq2 and PyTorch integration - **Port**: 7860 (HuggingFace standard) The A100 machine is specifically chosen to accommodate the large Omnilingual ASR model (~14GB) in GPU memory, ensuring fast inference and real-time transcription capabilities. ## Running Outside HuggingFace While this application is designed for HuggingFace Spaces, **it can be run on any machine with Docker and GPU support** with similar hardware requirements as the machines on HuggingFace. ## Getting Started ### Running with Docker 1. Build and run the container: ```bash docker build -t omnilingual-asr-transcriptions . docker run --rm -p 7860:7860 --gpus all \ -e MODEL_NAME=omniASR_LLM_300M \ -v {your cache directory}:/home/user/app/models \ omnilingual-asr-transcriptions ``` The media transcription app will be available at `http://localhost:7860` #### Docker Run Parameters Explained: - `--rm`: Automatically remove the container when it exits - `-p 7860:7860`: Map host port 7860 to container port 7860 - `--gpus all`: Enable GPU access for CUDA acceleration - `-e MODEL_NAME=omniASR_LLM_300M`: Set the Omnilingual ASR model variant to use - Options: `omniASR_LLM_1B` (default, 1B parameters), `omniASR_LLM_300M` (300M parameters, faster) - `-e ENABLE_TOXIC_FILTERING=true`: Enable filtering of toxic words from transcription results (optional) - `-v {your cache directory}:/home/user/app/models`: Mount local models directory - **Purpose**: Persist downloaded models between container runs (14GB+ cache) - **Benefits**: Avoid re-downloading models on each container restart - **Path**: Adjust `{your cache directory}` to your local models directory ### Available API Endpoints #### Core Transcription Routes - `GET /health` - Comprehensive health check with GPU/CUDA status, FFmpeg availability, and transcription status - `GET /status` - Get current transcription status (busy/idle, progress, operation type) - `POST /transcribe` - Audio transcription with automatic chunking for files of any length #### Additional Routes - `POST /combine-video-subtitles` - Combine video files with subtitle tracks - `GET /` - Serve the web application frontend - `GET /assets/` - Serve frontend static assets ### Usage Recommendations **Primary Usage**: Access the web interface at `http://localhost:7860` for an intuitive media transcription experience with drag-and-drop file upload, real-time progress tracking, and downloadable results. **API Usage**: For programmatic access or integration with other tools, you can call the API endpoints directly as shown in the examples below. ### Environment Variables You are free to change these if you clone the space and set them in the Huggingface space settings or in your own server environment. In the public shared demo these are controled for an optimal experience. #### Server Environment Variables - `API_LOG_LEVEL` - Set logging level (DEBUG, INFO, WARNING, ERROR) - `MODEL_NAME` - Omnilingual ASR model to use (default: omniASR_LLM_1B) - `USE_CHUNKING` - Enable/disable audio chunking (default: true) - `ENABLE_TOXIC_FILTERING` - Enable toxic word filtering from transcription results (default: false) #### Frontend Environment Variables - `VITE_ALLOW_ALL_LANGUAGES` - Set to `true` to show all 1,400+ supported languages in the language selector, or `false` to only show languages with error rates < 10% for public demo (default: false) - `VITE_ENABLE_ANALYTICS` - Set to `true` to enable Google Analytics tracking, or `false` to disable analytics (default: false) - `VITE_REACT_APP_GOOGLE_ANALYTICS_ID` - Your Google Analytics measurement ID (e.g., `G-XXXXXXXXXX`) for tracking usage when analytics are enabled ### API Examples (For Developers) For programmatic access or integration with other tools, you can call the API endpoints directly: ```bash # Health check curl http://localhost:7860/health # Get transcription status curl http://localhost:7860/status # Transcribe audio file curl -X POST http://localhost:7860/transcribe \ -F "audio=@path/to/your/audio.wav" ``` ## Project Structure ``` omnilingual-asr-transcriptions/ ├── Dockerfile # Multi-stage build with frontend + backend ├── README.md ├── requirements.txt # Python dependencies ├── deploy.sh # Deployment script ├── run_docker.sh # Local Docker run script ├── frontend/ # Web interface (React/Vite) │ ├── package.json │ ├── src/ │ └── dist/ # Built frontend (served by Flask) ├── models/ # Model files (automatically downloaded) │ ├── ctc_alignment_mling_uroman_model.pt │ ├── ctc_alignment_mling_uroman_model_dict.txt │ └── [Additional model files downloaded at runtime] └── server/ # Flask API backend ├── server.py # Main Flask application ├── transcriptions_blueprint.py # API routes ├── audio_transcription.py # Core transcription logic ├── media_transcription_processor.py # Media processing ├── transcription_status.py # Status tracking ├── env_vars.py # Environment configuration ├── run.sh # Production startup script ├── download_models.sh # Model download script ├── wheels/ # Pre-built Omnilingual ASR wheel packages └── inference/ # Model inference components ├── mms_model_pipeline.py # Omnilingual ASR model wrapper ├── audio_chunker.py # Audio chunking logic └── audio_sentence_alignment.py # Forced alignment ``` ### Key Features - **Simplified Architecture**: Single Docker container with built-in model management - **Auto Model Download**: Models are downloaded automatically during container startup - **Omnilingual ASR Integration**: Uses the latest Omnilingual ASR library with 1600+ language support - **GPU Acceleration**: CUDA-enabled inference with automatic device detection - **Web Interface**: Modern React frontend for easy testing and usage - **Smart Transcription**: Single endpoint handles files of any length with automatic chunking - **Intelligent Processing**: Automatic audio format detection and conversion **Note**: Model files are large (14GB+ total) and are downloaded automatically when the container starts. The first run may take longer due to model downloads.