Spaces:
Running
on
A100
title: Omnilingual ASR Media Transcription
emoji: π
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
suggested_hardware: a100-large
Experimental Omnilingual ASR Media Transcription Demo
A media transcription tool with a web interface for multilingual audio and video transcription using Meta's Omnilingual ASR model. Transcriptions are supported for 1600+ languages.
This application is designed primarily as a web-based media transcription tool with an intuitive frontend interface. While you can interact directly with the API endpoints, the recommended usage is through the web interface at http://localhost:7860.
HuggingFace Space Configuration
This application is configured to run as a HuggingFace Space, however has resource limitations as it is a public. In order to have your own dedicated space, please clone with the following recommended specifications:
- Hardware: A100 GPU (80GB) - Required for loading the 7B parameter Omnilingual ASR model
- Alternative: Machines with lower GPU memory can use smaller models by setting the
MODEL_NAMEenvironment variable in HuggingFace Space settings, e.g.omniASR_LLM_300M(requires ~8GB GPU memory)
- Alternative: Machines with lower GPU memory can use smaller models by setting the
- Persistent Storage: Enabled for model caching and improved loading times. Medium (150GB)
- Docker Runtime: Uses custom Dockerfile for fairseq2 and PyTorch integration
- Port: 7860 (HuggingFace standard)
The A100 machine is specifically chosen to accommodate the large Omnilingual ASR model (~14GB) in GPU memory, ensuring fast inference and real-time transcription capabilities.
Running Outside HuggingFace
While this application is designed for HuggingFace Spaces, it can be run on any machine with Docker and GPU support with similar hardware requirements as the machines on HuggingFace.
Getting Started
Running with Docker
- Build and run the container:
docker build -t omnilingual-asr-transcriptions .
docker run --rm -p 7860:7860 --gpus all \
-e MODEL_NAME=omniASR_LLM_300M \
-v {your cache directory}:/home/user/app/models \
omnilingual-asr-transcriptions
The media transcription app will be available at http://localhost:7860
Docker Run Parameters Explained:
--rm: Automatically remove the container when it exits-p 7860:7860: Map host port 7860 to container port 7860--gpus all: Enable GPU access for CUDA acceleration-e MODEL_NAME=omniASR_LLM_300M: Set the Omnilingual ASR model variant to use- Options:
omniASR_LLM_1B(default, 1B parameters),omniASR_LLM_300M(300M parameters, faster)
- Options:
-e ENABLE_TOXIC_FILTERING=true: Enable filtering of toxic words from transcription results (optional)-v {your cache directory}:/home/user/app/models: Mount local models directory- Purpose: Persist downloaded models between container runs (14GB+ cache)
- Benefits: Avoid re-downloading models on each container restart
- Path: Adjust
{your cache directory}to your local models directory
Available API Endpoints
Core Transcription Routes
GET /health- Comprehensive health check with GPU/CUDA status, FFmpeg availability, and transcription statusGET /status- Get current transcription status (busy/idle, progress, operation type)POST /transcribe- Audio transcription with automatic chunking for files of any length
Additional Routes
POST /combine-video-subtitles- Combine video files with subtitle tracksGET /- Serve the web application frontendGET /assets/<filename>- Serve frontend static assets
Usage Recommendations
Primary Usage: Access the web interface at http://localhost:7860 for an intuitive media transcription experience with drag-and-drop file upload, real-time progress tracking, and downloadable results.
API Usage: For programmatic access or integration with other tools, you can call the API endpoints directly as shown in the examples below.
Environment Variables
You are free to change these if you clone the space and set them in the Huggingface space settings or in your own server environment. In the public shared demo these are controled for an optimal experience.
Server Environment Variables
API_LOG_LEVEL- Set logging level (DEBUG, INFO, WARNING, ERROR)MODEL_NAME- Omnilingual ASR model to use (default: omniASR_LLM_1B)USE_CHUNKING- Enable/disable audio chunking (default: true)ENABLE_TOXIC_FILTERING- Enable toxic word filtering from transcription results (default: false)
Frontend Environment Variables
VITE_ALLOW_ALL_LANGUAGES- Set totrueto show all 1,400+ supported languages in the language selector, orfalseto only show languages with error rates < 10% for public demo (default: false)VITE_ENABLE_ANALYTICS- Set totrueto enable Google Analytics tracking, orfalseto disable analytics (default: false)VITE_REACT_APP_GOOGLE_ANALYTICS_ID- Your Google Analytics measurement ID (e.g.,G-XXXXXXXXXX) for tracking usage when analytics are enabled
API Examples (For Developers)
For programmatic access or integration with other tools, you can call the API endpoints directly:
# Health check
curl http://localhost:7860/health
# Get transcription status
curl http://localhost:7860/status
# Transcribe audio file
curl -X POST http://localhost:7860/transcribe \
-F "audio=@path/to/your/audio.wav"
Project Structure
omnilingual-asr-transcriptions/
βββ Dockerfile # Multi-stage build with frontend + backend
βββ README.md
βββ requirements.txt # Python dependencies
βββ deploy.sh # Deployment script
βββ run_docker.sh # Local Docker run script
βββ frontend/ # Web interface (React/Vite)
β βββ package.json
β βββ src/
β βββ dist/ # Built frontend (served by Flask)
βββ models/ # Model files (automatically downloaded)
β βββ ctc_alignment_mling_uroman_model.pt
β βββ ctc_alignment_mling_uroman_model_dict.txt
β βββ [Additional model files downloaded at runtime]
βββ server/ # Flask API backend
βββ server.py # Main Flask application
βββ transcriptions_blueprint.py # API routes
βββ audio_transcription.py # Core transcription logic
βββ media_transcription_processor.py # Media processing
βββ transcription_status.py # Status tracking
βββ env_vars.py # Environment configuration
βββ run.sh # Production startup script
βββ download_models.sh # Model download script
βββ wheels/ # Pre-built Omnilingual ASR wheel packages
βββ inference/ # Model inference components
βββ mms_model_pipeline.py # Omnilingual ASR model wrapper
βββ audio_chunker.py # Audio chunking logic
βββ audio_sentence_alignment.py # Forced alignment
Key Features
- Simplified Architecture: Single Docker container with built-in model management
- Auto Model Download: Models are downloaded automatically during container startup
- Omnilingual ASR Integration: Uses the latest Omnilingual ASR library with 1600+ language support
- GPU Acceleration: CUDA-enabled inference with automatic device detection
- Web Interface: Modern React frontend for easy testing and usage
- Smart Transcription: Single endpoint handles files of any length with automatic chunking
- Intelligent Processing: Automatic audio format detection and conversion
Note: Model files are large (14GB+ total) and are downloaded automatically when the container starts. The first run may take longer due to model downloads.