Clémentine commited on
Commit
8dafde0
·
1 Parent(s): 7f5506e
Files changed (4) hide show
  1. README.md +4 -51
  2. app.py +1 -1
  3. utils/io.py +2 -1
  4. utils/jobs.py +33 -47
README.md CHANGED
@@ -13,17 +13,6 @@ pinned: false
13
 
14
  A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.
15
 
16
- ## Features
17
-
18
- - **Automatic Model Discovery**: Fetch popular text-generation models with inference providers from Hugging Face Hub
19
- - **Batch Job Launching**: Run evaluation jobs for multiple model-provider combinations from a configuration file
20
- - **Results Table Dashboard**: View all jobs with model, provider, last run, status, current score, and previous score
21
- - **Score Tracking**: Automatically extracts average scores from completed jobs and tracks history
22
- - **Persistent Storage**: Results saved to HuggingFace dataset for persistence across restarts
23
- - **Individual Job Relaunch**: Easily relaunch specific model-provider combinations
24
- - **Real-time Monitoring**: Auto-refresh results table every 30 seconds
25
- - **Daily Checkpoint**: Automatic daily save at midnight to preserve state
26
-
27
  ## Setup
28
 
29
  ### Prerequisites
@@ -99,30 +88,20 @@ The **Job Results** table displays all jobs with:
99
  - **Status**: Current status (running/complete/failed/cancelled)
100
  - **Current Score**: Average score from the most recent run
101
  - **Previous Score**: Average score from the prior run (for comparison)
 
102
 
103
  The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.
104
 
105
- ### Relaunching Individual Jobs
106
-
107
- To rerun a specific model-provider combination:
108
- 1. Enter the model name (e.g., `meta-llama/Llama-3.2-3B-Instruct`)
109
- 2. Enter the provider name (e.g., `fireworks-ai`)
110
- 3. Optionally modify the tasks
111
- 4. Click "Relaunch Job"
112
-
113
- When relaunching, the current score automatically moves to previous score for comparison.
114
-
115
  ## Configuration
116
 
117
  ### Tasks Format
118
 
119
  The tasks parameter follows the lighteval format. Examples:
120
- - `lighteval|mmlu|0|0` - MMLU benchmark
121
- - `lighteval|hellaswag|0|0` - HellaSwag benchmark
122
 
123
  ### Daily Checkpoint
124
 
125
- The system automatically saves all results to the HuggingFace dataset at **00:00 (midnight)** every day. This ensures data persistence and prevents data loss from long-running sessions.
126
 
127
  ### Data Persistence
128
 
@@ -131,22 +110,12 @@ All job results are stored in a HuggingFace dataset (`IPTesting/inference-provid
131
  - Historical score comparisons are maintained
132
  - Data can be accessed programmatically via the HF datasets library
133
 
134
- ## Job Command Details
135
-
136
- Each job runs with the following configuration:
137
- - **Image**: `hf.co/spaces/OpenEvals/EvalsOnTheHub`
138
- - **Command**: `lighteval endpoint inference-providers`
139
- - **Namespace**: `IPTesting`
140
- - **Flags**: `--push-to-hub --save-details --results-org IPTesting`
141
-
142
- Results are automatically pushed to the `IPTesting` organization on Hugging Face Hub.
143
-
144
  ## Architecture
145
 
146
  - **Main Thread**: Runs the Gradio interface
147
  - **Monitor Thread**: Updates job statuses every 30 seconds and extracts scores from completed jobs
148
  - **APScheduler**: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
149
- - **Thread-safe Operations**: Uses locks to prevent race conditions when accessing job_results
150
  - **HuggingFace Dataset Storage**: Persists results to `IPTesting/inference-provider-test-results` dataset
151
 
152
  ## Troubleshooting
@@ -157,18 +126,6 @@ Results are automatically pushed to the `IPTesting` organization on Hugging Face
157
  - Check that the `IPTesting` namespace exists and you have access
158
  - Review logs for specific error messages
159
 
160
- ### Empty Models List
161
-
162
- - Ensure you have internet connectivity
163
- - The Hugging Face Hub API must be accessible
164
- - Try running the initialization again
165
-
166
- ### Job Status Not Updating
167
-
168
- - Check your internet connection
169
- - Verify the job IDs are valid
170
- - Check console output for API errors
171
-
172
  ### Scores Not Appearing
173
 
174
  - Scores are extracted from job logs after completion
@@ -189,11 +146,7 @@ Results are automatically pushed to the `IPTesting` organization on Hugging Face
189
  - [utils/](utils/) - Utility package with helper modules:
190
  - [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence
191
  - [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction
192
- - [utils/__init__.py](utils/__init__.py) - Package initialization and exports
193
  - [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations
194
  - [requirements.txt](requirements.txt) - Python dependencies
195
  - [README.md](README.md) - This file
196
 
197
- ## License
198
-
199
- This project is provided as-is for evaluation testing purposes.
 
13
 
14
  A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.
15
 
 
 
 
 
 
 
 
 
 
 
 
16
  ## Setup
17
 
18
  ### Prerequisites
 
88
  - **Status**: Current status (running/complete/failed/cancelled)
89
  - **Current Score**: Average score from the most recent run
90
  - **Previous Score**: Average score from the prior run (for comparison)
91
+ - **Latest Job Id**: Latest job id to put in https://huggingface.co/jobs/NAMESPACE/JOBID for inspection
92
 
93
  The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.
94
 
 
 
 
 
 
 
 
 
 
 
95
  ## Configuration
96
 
97
  ### Tasks Format
98
 
99
  The tasks parameter follows the lighteval format. Examples:
100
+ - `lighteval|mmlu|0` - MMLU benchmark
 
101
 
102
  ### Daily Checkpoint
103
 
104
+ The system automatically saves all results to the HuggingFace dataset at **00:00 (midnight)** every day.
105
 
106
  ### Data Persistence
107
 
 
110
  - Historical score comparisons are maintained
111
  - Data can be accessed programmatically via the HF datasets library
112
 
 
 
 
 
 
 
 
 
 
 
113
  ## Architecture
114
 
115
  - **Main Thread**: Runs the Gradio interface
116
  - **Monitor Thread**: Updates job statuses every 30 seconds and extracts scores from completed jobs
117
  - **APScheduler**: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
118
+ - **Thread-safe**: Uses locks to prevent access issues when checking job_results
119
  - **HuggingFace Dataset Storage**: Persists results to `IPTesting/inference-provider-test-results` dataset
120
 
121
  ## Troubleshooting
 
126
  - Check that the `IPTesting` namespace exists and you have access
127
  - Review logs for specific error messages
128
 
 
 
 
 
 
 
 
 
 
 
 
 
129
  ### Scores Not Appearing
130
 
131
  - Scores are extracted from job logs after completion
 
146
  - [utils/](utils/) - Utility package with helper modules:
147
  - [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence
148
  - [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction
 
149
  - [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations
150
  - [requirements.txt](requirements.txt) - Python dependencies
151
  - [README.md](README.md) - This file
152
 
 
 
 
app.py CHANGED
@@ -44,7 +44,7 @@ def create_app() -> gr.Blocks:
44
  with gr.Column():
45
  gr.Markdown("## Job Results")
46
  results_table = gr.Dataframe(
47
- headers=["Model", "Provider", "Last Run", "Status", "Current Score", "Previous Score"],
48
  value=get_results_table(),
49
  interactive=False,
50
  wrap=True
 
44
  with gr.Column():
45
  gr.Markdown("## Job Results")
46
  results_table = gr.Dataframe(
47
+ headers=["Model", "Provider", "Last Run", "Status", "Current Score", "Previous Score", "Latest Job Id"],
48
  value=get_results_table(),
49
  interactive=False,
50
  wrap=True
utils/io.py CHANGED
@@ -142,7 +142,8 @@ def get_results_table() -> List[List[str]]:
142
  info["last_run"],
143
  info["status"],
144
  current_score,
145
- previous_score
 
146
  ])
147
 
148
  return table_data
 
142
  info["last_run"],
143
  info["status"],
144
  current_score,
145
+ previous_score,
146
+ info.get("job_id", "N/A")
147
  ])
148
 
149
  return table_data
utils/jobs.py CHANGED
@@ -1,4 +1,4 @@
1
- from huggingface_hub import run_job, inspect_job
2
  import os
3
  import re
4
  import time
@@ -16,54 +16,37 @@ def extract_score_from_job(job_id: str) -> Optional[float]:
16
  """
17
  try:
18
  # Inspect the job to get details and logs
19
- job_info = inspect_job(job_id=job_id)
20
 
21
- # Get the logs from the job
22
- if hasattr(job_info, 'logs') and job_info.logs:
23
- logs = job_info.logs
24
- lines = logs.split('\n')
25
 
 
26
  # Find the results table
27
  # Look for lines that match the pattern: |task_name|version|metric|value|...|
28
  # We want to extract the score (value) from lines where the task name is not empty
29
-
30
- scores = []
31
-
32
- for line in lines:
33
- # Check if we're in a table (contains pipe separators)
34
- if '|' in line:
35
- parts = [p.strip() for p in line.split('|')]
36
-
37
- # Skip header and separator lines
38
- # Table format: | Task | Version | Metric | Value | | Stderr |
39
- if len(parts) >= 5:
40
- task = parts[1] if len(parts) > 1 else ""
41
- metric = parts[3] if len(parts) > 3 else ""
42
- value = parts[4] if len(parts) > 4 else ""
43
-
44
- # We only want lines where the task name is not empty (main metric for that task)
45
- # Skip lines with "Task", "---", or empty task names
46
- if task and task not in ["Task", ""] and not task.startswith("-"):
47
- # Try to extract numeric value
48
- # Remove any extra characters and convert to float
49
- value_clean = value.strip()
50
- try:
51
- # Extract the numeric part (may have ± symbol after)
52
- score_match = re.match(r'([0-9]+\.?[0-9]*)', value_clean)
53
- if score_match:
54
- score = float(score_match.group(1))
55
- scores.append(score)
56
- print(f"Extracted score {score} for task '{task}' metric '{metric}'")
57
- except (ValueError, AttributeError):
58
- continue
59
-
60
- # Calculate average of all task scores
61
- if scores:
62
- average_score = sum(scores) / len(scores)
63
- print(f"Calculated average score: {average_score:.4f} from {len(scores)} tasks")
64
- return average_score
65
- else:
66
- print("No scores found in job logs")
67
 
68
  return None
69
 
@@ -168,9 +151,6 @@ def update_job_statuses() -> None:
168
  for key in keys:
169
  try:
170
  with globals.results_lock:
171
- if globals.job_results[key]["status"] in ["complete", "failed", "cancelled"]:
172
- continue # Skip already finished jobs
173
-
174
  job_id = globals.job_results[key]["job_id"]
175
 
176
  job_info = inspect_job(job_id=job_id)
@@ -189,6 +169,12 @@ def update_job_statuses() -> None:
189
  if score is not None:
190
  globals.job_results[key]["current_score"] = score
191
 
 
 
 
 
 
 
192
  except Exception as e:
193
  print(f"Error checking job: {str(e)}")
194
 
 
1
+ from huggingface_hub import run_job, inspect_job, fetch_job_logs
2
  import os
3
  import re
4
  import time
 
16
  """
17
  try:
18
  # Inspect the job to get details and logs
19
+ logs = fetch_job_logs(job_id=job_id)
20
 
21
+ scores = []
 
 
 
22
 
23
+ for line in logs:
24
  # Find the results table
25
  # Look for lines that match the pattern: |task_name|version|metric|value|...|
26
  # We want to extract the score (value) from lines where the task name is not empty
27
+ if '|' in line:
28
+ parts = [p.strip() for p in line.split('|')]
29
+
30
+ # Skip header and separator lines
31
+ # Table format: | Task | Version | Metric | Value | | Stderr |
32
+ if len(parts) == 8:
33
+ _, task, _, metric, value, _, _, _ = parts
34
+
35
+ # Is the task name correct
36
+ if task and task in [t.replace("|", ":") for t in globals.TASKS.split(",")]:
37
+ # Try to extract numeric value
38
+ # Remove any extra characters and convert to float
39
+ score = float(value)
40
+ scores.append(score)
41
+ print(f"Extracted score {score} for task '{task}' metric '{metric}'")
42
+
43
+ # Calculate average of all task scores
44
+ if scores:
45
+ average_score = sum(scores) / len(scores)
46
+ print(f"Calculated average score: {average_score:.4f} from {len(scores)} tasks")
47
+ return average_score
48
+ else:
49
+ print("No scores found in job logs")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
  return None
52
 
 
151
  for key in keys:
152
  try:
153
  with globals.results_lock:
 
 
 
154
  job_id = globals.job_results[key]["job_id"]
155
 
156
  job_info = inspect_job(job_id=job_id)
 
169
  if score is not None:
170
  globals.job_results[key]["current_score"] = score
171
 
172
+ if new_status == "COMPLETED" and globals.job_results[key]["current_score"] is None:
173
+ score = extract_score_from_job(job_id)
174
+ if score is not None:
175
+ globals.job_results[key]["current_score"] = score
176
+
177
+
178
  except Exception as e:
179
  print(f"Error checking job: {str(e)}")
180