VibecoderMcSwaggins commited on
Commit
e18ea9a
·
1 Parent(s): 599a754

feat: Implement Free Tier synthesis using HuggingFace Inference

Browse files

- Added a new synthesis method in the JudgeHandler to utilize free HuggingFace Inference for narrative generation in Free Tier mode.
- Updated the Orchestrator to check for the judge's synthesis method, ensuring consistent behavior across judging and synthesis.
- Enhanced integration and unit tests to verify the correct usage of the new synthesis method and fallback mechanisms.
- Documented the bug related to synthesis incorrectly using server-side API keys, outlining the root cause and potential fixes.

This change addresses user confusion and improves the Free Tier experience by ensuring expected functionality without requiring an API key.

docs/bugs/P1_SYNTHESIS_BROKEN_KEY_FALLBACK.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # P0 - Free Tier Synthesis Incorrectly Uses Server-Side API Keys
2
+
3
+ **Status:** OPEN
4
+ **Priority:** P0 (Breaks Free Tier Promise)
5
+ **Found:** 2025-11-30
6
+ **Component:** `src/orchestrators/simple.py`, `src/agent_factory/judges.py`
7
+
8
+ ## Symptom
9
+
10
+ When using Simple Mode (Free Tier) without providing a user API key, users see:
11
+
12
+ ```
13
+ > ⚠️ **Note**: AI narrative synthesis unavailable. Showing structured summary.
14
+ > _Error: OpenAIError_
15
+ ```
16
+
17
+ This is confusing because the user didn't configure any OpenAI key - they expected Free Tier to work.
18
+
19
+ ## Root Cause
20
+
21
+ **Architecture bug: Synthesis is decoupled from JudgeHandler selection.**
22
+
23
+ | Component | Paid Tier | Free Tier |
24
+ |-----------|-----------|-----------|
25
+ | Judge | `JudgeHandler` (uses `get_model()`) | `HFInferenceJudgeHandler` (free HF Inference) |
26
+ | Synthesis | `get_model()` | **BUG: Also uses `get_model()`** |
27
+
28
+ **Flow:**
29
+ 1. User selects Simple mode, leaves API key empty
30
+ 2. `app.py` correctly creates `HFInferenceJudgeHandler` for judging (works)
31
+ 3. Search works (no keys needed for PubMed/ClinicalTrials/Europe PMC)
32
+ 4. Judge works (HFInferenceJudgeHandler uses free HuggingFace inference)
33
+ 5. **BUG:** Synthesis calls `get_model()` in `simple.py:547`
34
+ 6. `get_model()` checks `settings.has_openai_key` → reads SERVER-SIDE env vars
35
+ 7. If ANY server-side key is set (even broken), synthesis tries to use it
36
+ 8. This VIOLATES the Free Tier promise - user didn't provide a key!
37
+
38
+ **The bug is NOT about broken keys - it's about synthesis ignoring the Free Tier selection.**
39
+
40
+ ## Impact
41
+
42
+ - **User Confusion**: User didn't provide a key, sees "OpenAIError"
43
+ - **Free Tier Perception**: Makes Free Tier seem broken when it's actually working (template synthesis is still useful)
44
+ - **Demo Quality**: Hackathon judges may think the app is broken
45
+
46
+ ## Fix Options
47
+
48
+ ### Option A: Remove/Fix Admin Key (Quick Fix for Hackathon)
49
+ Remove or update the `OPENAI_API_KEY` secret on HuggingFace Spaces.
50
+ - If removed: Free Tier works as designed (template synthesis)
51
+ - If fixed: OpenAI synthesis works
52
+
53
+ **Pros:** Instant fix, no code changes
54
+ **Cons:** Doesn't fix the underlying UX issue
55
+
56
+ ### Option B: Better Error Message
57
+ Change error message to be more user-friendly:
58
+
59
+ ```python
60
+ # src/orchestrators/simple.py:569-573
61
+ error_note = (
62
+ f"\n\n> ⚠️ **Note**: AI narrative synthesis unavailable. "
63
+ f"Showing structured summary.\n"
64
+ f"> _Tip: Provide your own API key for full synthesis._\n"
65
+ )
66
+ ```
67
+
68
+ **Pros:** Clearer UX
69
+ **Cons:** Hides the real error for debugging
70
+
71
+ ### Option C: Provider Fallback Chain (Best Long-term)
72
+ If primary provider fails, try next provider before falling back to template:
73
+
74
+ ```python
75
+ def get_model_with_fallback() -> Any:
76
+ """Try providers in order, return first that works."""
77
+ from src.utils.exceptions import ConfigurationError
78
+
79
+ providers = []
80
+ if settings.has_openai_key:
81
+ providers.append(("openai", lambda: OpenAIChatModel(...)))
82
+ if settings.has_anthropic_key:
83
+ providers.append(("anthropic", lambda: AnthropicModel(...)))
84
+ if settings.has_huggingface_key:
85
+ providers.append(("huggingface", lambda: HuggingFaceModel(...)))
86
+
87
+ for name, factory in providers:
88
+ try:
89
+ return factory()
90
+ except Exception as e:
91
+ logger.warning(f"Provider {name} failed: {e}")
92
+ continue
93
+
94
+ raise ConfigurationError("No working LLM provider available")
95
+ ```
96
+
97
+ **Pros:** Most robust, graceful degradation
98
+ **Cons:** More complex, may hide real errors
99
+
100
+ ### Option D: Validate Key Before Using (Recommended)
101
+ Add key validation to `get_model()`:
102
+
103
+ ```python
104
+ def get_model() -> Any:
105
+ if settings.has_openai_key:
106
+ # Quick validation - check key format
107
+ key = settings.openai_api_key
108
+ if not key or not key.startswith("sk-"):
109
+ logger.warning("Invalid OpenAI key format, trying next provider")
110
+ else:
111
+ return OpenAIChatModel(...)
112
+ # ... continue to next provider
113
+ ```
114
+
115
+ **Pros:** Catches obviously invalid keys early
116
+ **Cons:** Can't catch quota/permission issues without API call
117
+
118
+ ## Recommended Action (Hackathon)
119
+
120
+ 1. **Immediate**: Remove `OPENAI_API_KEY` from HuggingFace Space secrets, OR replace with valid key
121
+ 2. **If key is valid**: Check if model `gpt-5` is accessible (may need to use `gpt-4o` instead)
122
+
123
+ ## Test Plan
124
+
125
+ 1. Remove all secrets from HuggingFace Space
126
+ 2. Run Simple mode query
127
+ 3. Verify: Search works, Judge works, Synthesis shows template (no error message)
128
+
129
+ ## Related
130
+
131
+ - `docs/bugs/P0_SYNTHESIS_PROVIDER_MISMATCH.md` (RESOLVED - handles "no keys" case)
132
+ - This bug is specifically about "key exists but broken" case
src/agent_factory/judges.py CHANGED
@@ -2,6 +2,7 @@
2
 
3
  import asyncio
4
  import json
 
5
  from typing import Any, ClassVar
6
 
7
  import structlog
@@ -555,6 +556,48 @@ IMPORTANT: Respond with ONLY valid JSON matching this schema:
555
  reasoning=f"HF Inference failed: {error}. Recommend configuring OpenAI/Anthropic key.",
556
  )
557
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
558
 
559
  class MockJudgeHandler:
560
  """
 
2
 
3
  import asyncio
4
  import json
5
+ from functools import partial
6
  from typing import Any, ClassVar
7
 
8
  import structlog
 
556
  reasoning=f"HF Inference failed: {error}. Recommend configuring OpenAI/Anthropic key.",
557
  )
558
 
559
+ async def synthesize(self, system_prompt: str, user_prompt: str) -> str | None:
560
+ """
561
+ Synthesize a research report using free HuggingFace Inference.
562
+
563
+ Uses the same chat_completion API as judging, so Free Tier gets
564
+ consistent behavior across judge AND synthesis.
565
+
566
+ Returns:
567
+ Narrative text if successful, None if all models fail.
568
+ """
569
+ loop = asyncio.get_running_loop()
570
+ models_to_try = [self.model_id] if self.model_id else self.FALLBACK_MODELS
571
+
572
+ messages = [
573
+ {"role": "system", "content": system_prompt},
574
+ {"role": "user", "content": user_prompt},
575
+ ]
576
+
577
+ for model in models_to_try:
578
+ try:
579
+ logger.info("HF synthesis attempt", model=model)
580
+ response = await loop.run_in_executor(
581
+ None,
582
+ partial(
583
+ self.client.chat_completion,
584
+ messages=messages,
585
+ model=model,
586
+ max_tokens=2048, # Longer for synthesis
587
+ temperature=0.7, # More creative for narrative
588
+ ),
589
+ )
590
+ content = response.choices[0].message.content
591
+ if content and len(content.strip()) > 50:
592
+ logger.info("HF synthesis success", model=model, chars=len(content))
593
+ return content.strip()
594
+ except Exception as e:
595
+ logger.warning("HF synthesis model failed", model=model, error=str(e))
596
+ continue
597
+
598
+ logger.error("All HF synthesis models failed")
599
+ return None
600
+
601
 
602
  class MockJudgeHandler:
603
  """
src/orchestrators/simple.py CHANGED
@@ -536,23 +536,34 @@ class Orchestrator:
536
  system_prompt = get_synthesis_system_prompt(self.domain)
537
 
538
  try:
539
- # Import here to avoid circular deps and keep optional
540
- from pydantic_ai import Agent
541
-
542
- from src.agent_factory.judges import get_model
543
-
544
- # Create synthesis agent with retries (matching Judge agent pattern)
545
- # Without retries, transient errors immediately trigger fallback
546
- agent: Agent[None, str] = Agent(
547
- model=get_model(),
548
- output_type=str,
549
- system_prompt=system_prompt,
550
- retries=3, # Match Judge agent - retry on transient errors
551
- )
552
- result = await agent.run(user_prompt)
553
- narrative = result.output
 
 
 
 
 
 
 
 
 
 
 
554
 
555
- logger.info("LLM narrative synthesis completed", chars=len(narrative))
556
 
557
  except Exception as e:
558
  # Fallback to template synthesis if LLM fails
 
536
  system_prompt = get_synthesis_system_prompt(self.domain)
537
 
538
  try:
539
+ # Check if judge has its own synthesize method (Free Tier uses HF Inference)
540
+ # This ensures Free Tier uses consistent free inference for BOTH judge AND synthesis
541
+ if hasattr(self.judge, "synthesize"):
542
+ logger.info("Using judge's free-tier synthesis method")
543
+ narrative = await self.judge.synthesize(system_prompt, user_prompt)
544
+ if narrative:
545
+ logger.info("Free-tier synthesis completed", chars=len(narrative))
546
+ else:
547
+ # Free tier synthesis failed, use template
548
+ raise RuntimeError("Free tier HF synthesis returned no content")
549
+ else:
550
+ # Paid tier: use PydanticAI with get_model()
551
+ from pydantic_ai import Agent
552
+
553
+ from src.agent_factory.judges import get_model
554
+
555
+ # Create synthesis agent with retries (matching Judge agent pattern)
556
+ # Without retries, transient errors immediately trigger fallback
557
+ agent: Agent[None, str] = Agent(
558
+ model=get_model(),
559
+ output_type=str,
560
+ system_prompt=system_prompt,
561
+ retries=3, # Match Judge agent - retry on transient errors
562
+ )
563
+ result = await agent.run(user_prompt)
564
+ narrative = result.output
565
 
566
+ logger.info("LLM narrative synthesis completed", chars=len(narrative))
567
 
568
  except Exception as e:
569
  # Fallback to template synthesis if LLM fails
tests/integration/test_simple_mode_synthesis.py CHANGED
@@ -37,6 +37,10 @@ async def test_simple_mode_synthesizes_before_max_iterations():
37
  # Mock judge to return GOOD scores eventually
38
  # We can use MockJudgeHandler or a pure mock. Let's use a pure mock to control scores precisely.
39
  mock_judge = AsyncMock()
 
 
 
 
40
 
41
  # Iteration 1: Low scores
42
  assess_1 = JudgeAssessment(
@@ -95,7 +99,9 @@ async def test_simple_mode_synthesizes_before_max_iterations():
95
  # SPEC_12: LLM synthesis produces narrative prose, not template with "Drug Candidates" header
96
  # Check for narrative structure (LLM may omit ### prefix) OR template fallback
97
  assert (
98
- "Executive Summary" in complete_event.message or "Drug Candidates" in complete_event.message
 
 
99
  )
100
  assert complete_event.data.get("synthesis_reason") == "high_scores_with_candidates"
101
  assert complete_event.iteration == 2 # Should stop at it 2
 
37
  # Mock judge to return GOOD scores eventually
38
  # We can use MockJudgeHandler or a pure mock. Let's use a pure mock to control scores precisely.
39
  mock_judge = AsyncMock()
40
+ # Since mock_judge has 'synthesize' attr by default (as a Mock),
41
+ # simple mode uses free-tier path.
42
+ # We must mock the return value of synthesize to simulate a successful narrative generation.
43
+ mock_judge.synthesize.return_value = "This is a synthesized report for MagicDrug."
44
 
45
  # Iteration 1: Low scores
46
  assess_1 = JudgeAssessment(
 
99
  # SPEC_12: LLM synthesis produces narrative prose, not template with "Drug Candidates" header
100
  # Check for narrative structure (LLM may omit ### prefix) OR template fallback
101
  assert (
102
+ "Executive Summary" in complete_event.message
103
+ or "Drug Candidates" in complete_event.message
104
+ or "synthesized report" in complete_event.message
105
  )
106
  assert complete_event.data.get("synthesis_reason") == "high_scores_with_candidates"
107
  assert complete_event.iteration == 2 # Should stop at it 2
tests/unit/orchestrators/test_simple_synthesis.py CHANGED
@@ -68,9 +68,10 @@ class TestGenerateSynthesis:
68
  sample_evidence: list[Evidence],
69
  sample_assessment: JudgeAssessment,
70
  ) -> None:
71
- """Synthesis should make an LLM call, not just use a template."""
72
  mock_search = MagicMock()
73
- mock_judge = MagicMock()
 
74
 
75
  orchestrator = Orchestrator(
76
  search_handler=mock_search,
@@ -129,6 +130,40 @@ Long-term safety data is limited.
129
  assert "Background" in result
130
  assert "Evidence Synthesis" in result
131
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
132
  @pytest.mark.asyncio
133
  async def test_falls_back_on_llm_error_with_notice(
134
  self,
@@ -137,7 +172,8 @@ Long-term safety data is limited.
137
  ) -> None:
138
  """Synthesis should fall back to template if LLM fails, WITH error notice."""
139
  mock_search = MagicMock()
140
- mock_judge = MagicMock()
 
141
 
142
  orchestrator = Orchestrator(
143
  search_handler=mock_search,
@@ -171,7 +207,8 @@ Long-term safety data is limited.
171
  ) -> None:
172
  """Synthesis should include full citation list footer."""
173
  mock_search = MagicMock()
174
- mock_judge = MagicMock()
 
175
 
176
  orchestrator = Orchestrator(
177
  search_handler=mock_search,
 
68
  sample_evidence: list[Evidence],
69
  sample_assessment: JudgeAssessment,
70
  ) -> None:
71
+ """Synthesis should make an LLM call using pydantic_ai when judge is paid tier."""
72
  mock_search = MagicMock()
73
+ # Paid tier JudgeHandler has 'assess' but NOT 'synthesize'
74
+ mock_judge = MagicMock(spec=["assess"])
75
 
76
  orchestrator = Orchestrator(
77
  search_handler=mock_search,
 
130
  assert "Background" in result
131
  assert "Evidence Synthesis" in result
132
 
133
+ @pytest.mark.asyncio
134
+ async def test_uses_free_tier_synthesis_when_available(
135
+ self,
136
+ sample_evidence: list[Evidence],
137
+ sample_assessment: JudgeAssessment,
138
+ ) -> None:
139
+ """Synthesis should use judge's synthesize method when in Free Tier."""
140
+ mock_search = MagicMock()
141
+ # Free tier JudgeHandler has 'synthesize' method
142
+ mock_judge = MagicMock()
143
+ # Setup synthesize method
144
+ mock_judge.synthesize = AsyncMock(return_value="Free tier narrative content.")
145
+
146
+ orchestrator = Orchestrator(
147
+ search_handler=mock_search,
148
+ judge_handler=mock_judge,
149
+ )
150
+ orchestrator.history = [{"iteration": 1}]
151
+
152
+ # We don't need to patch Agent or get_model because they shouldn't be called
153
+ result = await orchestrator._generate_synthesis(
154
+ query="test query",
155
+ evidence=sample_evidence,
156
+ assessment=sample_assessment,
157
+ )
158
+
159
+ # Verify judge's synthesize was called
160
+ mock_judge.synthesize.assert_called_once()
161
+
162
+ # Verify result contains the free tier content
163
+ assert "Free tier narrative content" in result
164
+ # Should still include footer
165
+ assert "Full Citation List" in result
166
+
167
  @pytest.mark.asyncio
168
  async def test_falls_back_on_llm_error_with_notice(
169
  self,
 
172
  ) -> None:
173
  """Synthesis should fall back to template if LLM fails, WITH error notice."""
174
  mock_search = MagicMock()
175
+ # Paid tier simulation
176
+ mock_judge = MagicMock(spec=["assess"])
177
 
178
  orchestrator = Orchestrator(
179
  search_handler=mock_search,
 
207
  ) -> None:
208
  """Synthesis should include full citation list footer."""
209
  mock_search = MagicMock()
210
+ # Paid tier simulation
211
+ mock_judge = MagicMock(spec=["assess"])
212
 
213
  orchestrator = Orchestrator(
214
  search_handler=mock_search,