Spaces:

OpenEvals
/

InferenceProviderTesting

Running

App Files Files Community

Clémentine commited on Oct 10

Commit

8dafde0

1 Parent(s): 7f5506e

update

Browse files

Files changed (4) hide show

README.md +4 -51
app.py +1 -1
utils/io.py +2 -1
utils/jobs.py +33 -47

README.md CHANGED Viewed

@@ -13,17 +13,6 @@ pinned: false
 A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.
-## Features
-- **Automatic Model Discovery**: Fetch popular text-generation models with inference providers from Hugging Face Hub
-- **Batch Job Launching**: Run evaluation jobs for multiple model-provider combinations from a configuration file
-- **Results Table Dashboard**: View all jobs with model, provider, last run, status, current score, and previous score
-- **Score Tracking**: Automatically extracts average scores from completed jobs and tracks history
-- **Persistent Storage**: Results saved to HuggingFace dataset for persistence across restarts
-- **Individual Job Relaunch**: Easily relaunch specific model-provider combinations
-- **Real-time Monitoring**: Auto-refresh results table every 30 seconds
-- **Daily Checkpoint**: Automatic daily save at midnight to preserve state
 ## Setup
 ### Prerequisites
@@ -99,30 +88,20 @@ The **Job Results** table displays all jobs with:
 - **Status**: Current status (running/complete/failed/cancelled)
 - **Current Score**: Average score from the most recent run
 - **Previous Score**: Average score from the prior run (for comparison)
 The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.
-### Relaunching Individual Jobs
-To rerun a specific model-provider combination:
-1. Enter the model name (e.g., `meta-llama/Llama-3.2-3B-Instruct`)
-2. Enter the provider name (e.g., `fireworks-ai`)
-3. Optionally modify the tasks
-4. Click "Relaunch Job"
-When relaunching, the current score automatically moves to previous score for comparison.
 ## Configuration
 ### Tasks Format
 The tasks parameter follows the lighteval format. Examples:
-- `lighteval|mmlu|0|0` - MMLU benchmark
-- `lighteval|hellaswag|0|0` - HellaSwag benchmark
 ### Daily Checkpoint
-The system automatically saves all results to the HuggingFace dataset at **00:00 (midnight)** every day. This ensures data persistence and prevents data loss from long-running sessions.
 ### Data Persistence
@@ -131,22 +110,12 @@ All job results are stored in a HuggingFace dataset (`IPTesting/inference-provid
 - Historical score comparisons are maintained
 - Data can be accessed programmatically via the HF datasets library
-## Job Command Details
-Each job runs with the following configuration:
-- **Image**: `hf.co/spaces/OpenEvals/EvalsOnTheHub`
-- **Command**: `lighteval endpoint inference-providers`
-- **Namespace**: `IPTesting`
-- **Flags**: `--push-to-hub --save-details --results-org IPTesting`
-Results are automatically pushed to the `IPTesting` organization on Hugging Face Hub.
 ## Architecture
 - **Main Thread**: Runs the Gradio interface
 - **Monitor Thread**: Updates job statuses every 30 seconds and extracts scores from completed jobs
 - **APScheduler**: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
-- **Thread-safe Operations**: Uses locks to prevent race conditions when accessing job_results
 - **HuggingFace Dataset Storage**: Persists results to `IPTesting/inference-provider-test-results` dataset
 ## Troubleshooting
@@ -157,18 +126,6 @@ Results are automatically pushed to the `IPTesting` organization on Hugging Face
 - Check that the `IPTesting` namespace exists and you have access
 - Review logs for specific error messages
-### Empty Models List
-- Ensure you have internet connectivity
-- The Hugging Face Hub API must be accessible
-- Try running the initialization again
-### Job Status Not Updating
-- Check your internet connection
-- Verify the job IDs are valid
-- Check console output for API errors
 ### Scores Not Appearing
 - Scores are extracted from job logs after completion
@@ -189,11 +146,7 @@ Results are automatically pushed to the `IPTesting` organization on Hugging Face
 - [utils/](utils/) - Utility package with helper modules:
   - [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence
   - [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction
-  - [utils/__init__.py](utils/__init__.py) - Package initialization and exports
 - [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations
 - [requirements.txt](requirements.txt) - Python dependencies
 - [README.md](README.md) - This file
-## License
-This project is provided as-is for evaluation testing purposes.

 A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.
 ## Setup
 ### Prerequisites
 - **Status**: Current status (running/complete/failed/cancelled)
 - **Current Score**: Average score from the most recent run
 - **Previous Score**: Average score from the prior run (for comparison)
+- **Latest Job Id**: Latest job id to put in https://huggingface.co/jobs/NAMESPACE/JOBID for inspection
 The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.
 ## Configuration
 ### Tasks Format
 The tasks parameter follows the lighteval format. Examples:
+- `lighteval|mmlu|0` - MMLU benchmark
 ### Daily Checkpoint
+The system automatically saves all results to the HuggingFace dataset at **00:00 (midnight)** every day.
 ### Data Persistence
 - Historical score comparisons are maintained
 - Data can be accessed programmatically via the HF datasets library
 ## Architecture
 - **Main Thread**: Runs the Gradio interface
 - **Monitor Thread**: Updates job statuses every 30 seconds and extracts scores from completed jobs
 - **APScheduler**: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
+- **Thread-safe**: Uses locks to prevent access issues when checking job_results
 - **HuggingFace Dataset Storage**: Persists results to `IPTesting/inference-provider-test-results` dataset
 ## Troubleshooting
 - Check that the `IPTesting` namespace exists and you have access
 - Review logs for specific error messages
 ### Scores Not Appearing
 - Scores are extracted from job logs after completion
 - [utils/](utils/) - Utility package with helper modules:
   - [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence
   - [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction
 - [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations
 - [requirements.txt](requirements.txt) - Python dependencies
 - [README.md](README.md) - This file

app.py CHANGED Viewed

@@ -44,7 +44,7 @@ def create_app() -> gr.Blocks:
             with gr.Column():
                 gr.Markdown("## Job Results")
                 results_table = gr.Dataframe(
-                    headers=["Model", "Provider", "Last Run", "Status", "Current Score", "Previous Score"],
                     value=get_results_table(),
                     interactive=False,
                     wrap=True

             with gr.Column():
                 gr.Markdown("## Job Results")
                 results_table = gr.Dataframe(
+                    headers=["Model", "Provider", "Last Run", "Status", "Current Score", "Previous Score", "Latest Job Id"],
                     value=get_results_table(),
                     interactive=False,
                     wrap=True

utils/io.py CHANGED Viewed

@@ -142,7 +142,8 @@ def get_results_table() -> List[List[str]]:
                 info["last_run"],
                 info["status"],
                 current_score,
-                previous_score
             ])
         return table_data

                 info["last_run"],
                 info["status"],
                 current_score,
+                previous_score,
+                info.get("job_id", "N/A")
             ])
         return table_data

utils/jobs.py CHANGED Viewed

@@ -1,4 +1,4 @@
-from huggingface_hub import run_job, inspect_job
 import os
 import re
 import time
@@ -16,54 +16,37 @@ def extract_score_from_job(job_id: str) -> Optional[float]:
     """
     try:
         # Inspect the job to get details and logs
-        job_info = inspect_job(job_id=job_id)
-        # Get the logs from the job
-        if hasattr(job_info, 'logs') and job_info.logs:
-            logs = job_info.logs
-            lines = logs.split('\n')
             # Find the results table
             # Look for lines that match the pattern: |task_name|version|metric|value|...|
             # We want to extract the score (value) from lines where the task name is not empty
-            scores = []
-            for line in lines:
-                # Check if we're in a table (contains pipe separators)
-                if '|' in line:
-                    parts = [p.strip() for p in line.split('|')]
-                    # Skip header and separator lines
-                    # Table format: | Task | Version | Metric | Value | | Stderr |
-                    if len(parts) >= 5:
-                        task = parts[1] if len(parts) > 1 else ""
-                        metric = parts[3] if len(parts) > 3 else ""
-                        value = parts[4] if len(parts) > 4 else ""
-                        # We only want lines where the task name is not empty (main metric for that task)
-                        # Skip lines with "Task", "---", or empty task names
-                        if task and task not in ["Task", ""] and not task.startswith("-"):
-                            # Try to extract numeric value
-                            # Remove any extra characters and convert to float
-                            value_clean = value.strip()
-                            try:
-                                # Extract the numeric part (may have ± symbol after)
-                                score_match = re.match(r'([0-9]+\.?[0-9]*)', value_clean)
-                                if score_match:
-                                    score = float(score_match.group(1))
-                                    scores.append(score)
-                                    print(f"Extracted score {score} for task '{task}' metric '{metric}'")
-                            except (ValueError, AttributeError):
-                                continue
-            # Calculate average of all task scores
-            if scores:
-                average_score = sum(scores) / len(scores)
-                print(f"Calculated average score: {average_score:.4f} from {len(scores)} tasks")
-                return average_score
-            else:
-                print("No scores found in job logs")
         return None
@@ -168,9 +151,6 @@ def update_job_statuses() -> None:
         for key in keys:
             try:
                 with globals.results_lock:
-                    if globals.job_results[key]["status"] in ["complete", "failed", "cancelled"]:
-                        continue  # Skip already finished jobs
                     job_id = globals.job_results[key]["job_id"]
                 job_info = inspect_job(job_id=job_id)
@@ -189,6 +169,12 @@ def update_job_statuses() -> None:
                             if score is not None:
                                 globals.job_results[key]["current_score"] = score
             except Exception as e:
                 print(f"Error checking job: {str(e)}")

+from huggingface_hub import run_job, inspect_job, fetch_job_logs
 import os
 import re
 import time
     """
     try:
         # Inspect the job to get details and logs
+        logs = fetch_job_logs(job_id=job_id)
+        scores = []
+        for line in logs:
             # Find the results table
             # Look for lines that match the pattern: |task_name|version|metric|value|...|
             # We want to extract the score (value) from lines where the task name is not empty
+            if '|' in line:
+                parts = [p.strip() for p in line.split('|')]
+                # Skip header and separator lines
+                # Table format: | Task | Version | Metric | Value | | Stderr |
+                if len(parts) == 8:
+                    _, task, _, metric, value, _, _, _  = parts
+                    # Is the task name correct
+                    if task and task in [t.replace("|", ":") for t in globals.TASKS.split(",")]:
+                        # Try to extract numeric value
+                        # Remove any extra characters and convert to float
+                        score = float(value)
+                        scores.append(score)
+                        print(f"Extracted score {score} for task '{task}' metric '{metric}'")
+        # Calculate average of all task scores
+        if scores:
+            average_score = sum(scores) / len(scores)
+            print(f"Calculated average score: {average_score:.4f} from {len(scores)} tasks")
+            return average_score
+        else:
+            print("No scores found in job logs")
         return None
         for key in keys:
             try:
                 with globals.results_lock:
                     job_id = globals.job_results[key]["job_id"]
                 job_info = inspect_job(job_id=job_id)
                             if score is not None:
                                 globals.job_results[key]["current_score"] = score
+                    if new_status == "COMPLETED" and globals.job_results[key]["current_score"] is None:
+                        score = extract_score_from_job(job_id)
+                        if score is not None:
+                            globals.job_results[key]["current_score"] = score
             except Exception as e:
                 print(f"Error checking job: {str(e)}")