|
|
--- |
|
|
title: InferenceProviderTestingBackend |
|
|
emoji: 📈 |
|
|
colorFrom: yellow |
|
|
colorTo: indigo |
|
|
sdk: gradio |
|
|
sdk_version: 5.49.1 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
--- |
|
|
|
|
|
# Inference Provider Testing Dashboard |
|
|
|
|
|
A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API. |
|
|
|
|
|
## Setup |
|
|
|
|
|
### Prerequisites |
|
|
|
|
|
- Python 3.8+ |
|
|
- Hugging Face account with API token |
|
|
- Access to the `IPTesting` namespace on Hugging Face |
|
|
|
|
|
### Installation |
|
|
|
|
|
1. Clone or navigate to this repository: |
|
|
```bash |
|
|
cd InferenceProviderTestingBackend |
|
|
``` |
|
|
|
|
|
2. Install dependencies: |
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
3. Set up your Hugging Face token as an environment variable: |
|
|
```bash |
|
|
export HF_TOKEN="your_huggingface_token_here" |
|
|
``` |
|
|
|
|
|
**Important**: Your HF_TOKEN must have: |
|
|
- Permission to call inference providers |
|
|
- Write access to the `IPTesting` organization |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Starting the Dashboard |
|
|
|
|
|
Run the Gradio app: |
|
|
```bash |
|
|
python app.py |
|
|
``` |
|
|
|
|
|
### Initialize Models and Providers |
|
|
|
|
|
1. Click the **"Fetch and Initialize Models/Providers"** button to automatically populate the `models_providers.txt` file with popular models and their available inference providers. |
|
|
|
|
|
2. Alternatively, manually edit `models_providers.txt` with your desired model-provider combinations: |
|
|
``` |
|
|
meta-llama/Llama-3.2-3B-Instruct fireworks-ai |
|
|
meta-llama/Llama-3.2-3B-Instruct together-ai |
|
|
Qwen/Qwen2.5-7B-Instruct fireworks-ai |
|
|
mistralai/Mistral-7B-Instruct-v0.3 together-ai |
|
|
``` |
|
|
|
|
|
Format: `model_name provider_name` (separated by spaces or tabs) |
|
|
|
|
|
### Launching Jobs |
|
|
|
|
|
1. Enter the evaluation tasks in the **Tasks** field (e.g., `lighteval|mmlu|0|0`) |
|
|
2. Verify the config file path (default: `models_providers.txt`) |
|
|
3. Click **"Launch Jobs"** |
|
|
|
|
|
The system will: |
|
|
- Read all model-provider combinations from the config file |
|
|
- Launch a separate evaluation job for each combination |
|
|
- Log the job ID and status |
|
|
- Monitor job progress automatically |
|
|
|
|
|
### Monitoring Jobs |
|
|
|
|
|
The **Job Results** table displays all jobs with: |
|
|
- **Model**: The model being tested |
|
|
- **Provider**: The inference provider |
|
|
- **Last Run**: Timestamp of when the job was last launched |
|
|
- **Status**: Current status (running/complete/failed/cancelled) |
|
|
- **Current Score**: Average score from the most recent run |
|
|
- **Previous Score**: Average score from the prior run (for comparison) |
|
|
- **Latest Job Id**: Latest job id to put in https://huggingface.co/jobs/NAMESPACE/JOBID for inspection |
|
|
|
|
|
The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates. |
|
|
|
|
|
## Configuration |
|
|
|
|
|
### Tasks Format |
|
|
|
|
|
The tasks parameter follows the lighteval format. Examples: |
|
|
- `lighteval|mmlu|0` - MMLU benchmark |
|
|
|
|
|
### Daily Checkpoint |
|
|
|
|
|
The system automatically saves all results to the HuggingFace dataset at **00:00 (midnight)** every day. |
|
|
|
|
|
### Data Persistence |
|
|
|
|
|
All job results are stored in a HuggingFace dataset (`IPTesting/inference-provider-test-results`), which means: |
|
|
- Results persist across app restarts |
|
|
- Historical score comparisons are maintained |
|
|
- Data can be accessed programmatically via the HF datasets library |
|
|
|
|
|
## Architecture |
|
|
|
|
|
- **Main Thread**: Runs the Gradio interface |
|
|
- **Monitor Thread**: Updates job statuses every 30 seconds and extracts scores from completed jobs |
|
|
- **APScheduler**: Background scheduler that handles daily checkpoint saves at midnight (cron-based) |
|
|
- **Thread-safe**: Uses locks to prevent access issues when checking job_results |
|
|
- **HuggingFace Dataset Storage**: Persists results to `IPTesting/inference-provider-test-results` dataset |
|
|
|
|
|
## Troubleshooting |
|
|
|
|
|
### Jobs Not Launching |
|
|
|
|
|
- Verify your `HF_TOKEN` is set and has the required permissions |
|
|
- Check that the `IPTesting` namespace exists and you have access |
|
|
- Review logs for specific error messages |
|
|
|
|
|
### Scores Not Appearing |
|
|
|
|
|
- Scores are extracted from job logs after completion |
|
|
- The extraction parses the results table that appears in job logs |
|
|
- It extracts the score for each task (from the first row where the task name appears) |
|
|
- The final score is the average of all task scores |
|
|
- Example table format: |
|
|
``` |
|
|
| Task | Version | Metric | Value | Stderr | |
|
|
| extended:ifeval:0 | | prompt_level_strict_acc | 0.9100 | 0.0288 | |
|
|
| lighteval:gpqa:diamond:0 | | gpqa_pass@k_with_k | 0.5000 | 0.0503 | |
|
|
``` |
|
|
- If scores don't appear, check console output for extraction errors or parsing issues |
|
|
|
|
|
## Files |
|
|
|
|
|
- [app.py](app.py) - Main Gradio application with UI and job management |
|
|
- [utils/](utils/) - Utility package with helper modules: |
|
|
- [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence |
|
|
- [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction |
|
|
- [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations |
|
|
- [requirements.txt](requirements.txt) - Python dependencies |
|
|
- [README.md](README.md) - This file |
|
|
|
|
|
|