Spaces:

MCP-1st-Birthday
/

DeepBoner

Running

VibecoderMcSwaggins commited on 10 days ago

Commit

e18ea9a

1 Parent(s): 599a754

feat: Implement Free Tier synthesis using HuggingFace Inference

- Added a new synthesis method in the JudgeHandler to utilize free HuggingFace Inference for narrative generation in Free Tier mode.
- Updated the Orchestrator to check for the judge's synthesis method, ensuring consistent behavior across judging and synthesis.
- Enhanced integration and unit tests to verify the correct usage of the new synthesis method and fallback mechanisms.
- Documented the bug related to synthesis incorrectly using server-side API keys, outlining the root cause and potential fixes.

This change addresses user confusion and improves the Free Tier experience by ensuring expected functionality without requiring an API key.

Files changed (5) hide show

docs/bugs/P1_SYNTHESIS_BROKEN_KEY_FALLBACK.md +132 -0
src/agent_factory/judges.py +43 -0
src/orchestrators/simple.py +27 -16
tests/integration/test_simple_mode_synthesis.py +7 -1
tests/unit/orchestrators/test_simple_synthesis.py +41 -4

docs/bugs/P1_SYNTHESIS_BROKEN_KEY_FALLBACK.md ADDED Viewed

	@@ -0,0 +1,132 @@

+# P0 - Free Tier Synthesis Incorrectly Uses Server-Side API Keys
+**Status:** OPEN
+**Priority:** P0 (Breaks Free Tier Promise)
+**Found:** 2025-11-30
+**Component:** `src/orchestrators/simple.py`, `src/agent_factory/judges.py`
+## Symptom
+When using Simple Mode (Free Tier) without providing a user API key, users see:
+```
+> ⚠️ **Note**: AI narrative synthesis unavailable. Showing structured summary.
+> _Error: OpenAIError_
+```
+This is confusing because the user didn't configure any OpenAI key - they expected Free Tier to work.
+## Root Cause
+**Architecture bug: Synthesis is decoupled from JudgeHandler selection.**
+| Component | Paid Tier | Free Tier |
+|-----------|-----------|-----------|
+| Judge | `JudgeHandler` (uses `get_model()`) | `HFInferenceJudgeHandler` (free HF Inference) |
+| Synthesis | `get_model()` | **BUG: Also uses `get_model()`** |
+**Flow:**
+1. User selects Simple mode, leaves API key empty
+2. `app.py` correctly creates `HFInferenceJudgeHandler` for judging (works)
+3. Search works (no keys needed for PubMed/ClinicalTrials/Europe PMC)
+4. Judge works (HFInferenceJudgeHandler uses free HuggingFace inference)
+5. **BUG:** Synthesis calls `get_model()` in `simple.py:547`
+6. `get_model()` checks `settings.has_openai_key` → reads SERVER-SIDE env vars
+7. If ANY server-side key is set (even broken), synthesis tries to use it
+8. This VIOLATES the Free Tier promise - user didn't provide a key!
+**The bug is NOT about broken keys - it's about synthesis ignoring the Free Tier selection.**
+## Impact
+- **User Confusion**: User didn't provide a key, sees "OpenAIError"
+- **Free Tier Perception**: Makes Free Tier seem broken when it's actually working (template synthesis is still useful)
+- **Demo Quality**: Hackathon judges may think the app is broken
+## Fix Options
+### Option A: Remove/Fix Admin Key (Quick Fix for Hackathon)
+Remove or update the `OPENAI_API_KEY` secret on HuggingFace Spaces.
+- If removed: Free Tier works as designed (template synthesis)
+- If fixed: OpenAI synthesis works
+**Pros:** Instant fix, no code changes
+**Cons:** Doesn't fix the underlying UX issue
+### Option B: Better Error Message
+Change error message to be more user-friendly:
+```python
+# src/orchestrators/simple.py:569-573
+error_note = (
+    f"\n\n> ⚠️ **Note**: AI narrative synthesis unavailable. "
+    f"Showing structured summary.\n"
+    f"> _Tip: Provide your own API key for full synthesis._\n"
+)
+```
+**Pros:** Clearer UX
+**Cons:** Hides the real error for debugging
+### Option C: Provider Fallback Chain (Best Long-term)
+If primary provider fails, try next provider before falling back to template:
+```python
+def get_model_with_fallback() -> Any:
+    """Try providers in order, return first that works."""
+    from src.utils.exceptions import ConfigurationError
+    providers = []
+    if settings.has_openai_key:
+        providers.append(("openai", lambda: OpenAIChatModel(...)))
+    if settings.has_anthropic_key:
+        providers.append(("anthropic", lambda: AnthropicModel(...)))
+    if settings.has_huggingface_key:
+        providers.append(("huggingface", lambda: HuggingFaceModel(...)))
+    for name, factory in providers:
+        try:
+            return factory()
+        except Exception as e:
+            logger.warning(f"Provider {name} failed: {e}")
+            continue
+    raise ConfigurationError("No working LLM provider available")
+```
+**Pros:** Most robust, graceful degradation
+**Cons:** More complex, may hide real errors
+### Option D: Validate Key Before Using (Recommended)
+Add key validation to `get_model()`:
+```python
+def get_model() -> Any:
+    if settings.has_openai_key:
+        # Quick validation - check key format
+        key = settings.openai_api_key
+        if not key or not key.startswith("sk-"):
+            logger.warning("Invalid OpenAI key format, trying next provider")
+        else:
+            return OpenAIChatModel(...)
+    # ... continue to next provider
+```
+**Pros:** Catches obviously invalid keys early
+**Cons:** Can't catch quota/permission issues without API call
+## Recommended Action (Hackathon)
+1. **Immediate**: Remove `OPENAI_API_KEY` from HuggingFace Space secrets, OR replace with valid key
+2. **If key is valid**: Check if model `gpt-5` is accessible (may need to use `gpt-4o` instead)
+## Test Plan
+1. Remove all secrets from HuggingFace Space
+2. Run Simple mode query
+3. Verify: Search works, Judge works, Synthesis shows template (no error message)
+## Related
+- `docs/bugs/P0_SYNTHESIS_PROVIDER_MISMATCH.md` (RESOLVED - handles "no keys" case)
+- This bug is specifically about "key exists but broken" case

src/agent_factory/judges.py CHANGED Viewed

@@ -2,6 +2,7 @@
 import asyncio
 import json
 from typing import Any, ClassVar
 import structlog
@@ -555,6 +556,48 @@ IMPORTANT: Respond with ONLY valid JSON matching this schema:
             reasoning=f"HF Inference failed: {error}. Recommend configuring OpenAI/Anthropic key.",
         )
 class MockJudgeHandler:
     """

 import asyncio
 import json
+from functools import partial
 from typing import Any, ClassVar
 import structlog
             reasoning=f"HF Inference failed: {error}. Recommend configuring OpenAI/Anthropic key.",
         )
+    async def synthesize(self, system_prompt: str, user_prompt: str) -> str | None:
+        """
+        Synthesize a research report using free HuggingFace Inference.
+        Uses the same chat_completion API as judging, so Free Tier gets
+        consistent behavior across judge AND synthesis.
+        Returns:
+            Narrative text if successful, None if all models fail.
+        """
+        loop = asyncio.get_running_loop()
+        models_to_try = [self.model_id] if self.model_id else self.FALLBACK_MODELS
+        messages = [
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": user_prompt},
+        ]
+        for model in models_to_try:
+            try:
+                logger.info("HF synthesis attempt", model=model)
+                response = await loop.run_in_executor(
+                    None,
+                    partial(
+                        self.client.chat_completion,
+                        messages=messages,
+                        model=model,
+                        max_tokens=2048,  # Longer for synthesis
+                        temperature=0.7,  # More creative for narrative
+                    ),
+                )
+                content = response.choices[0].message.content
+                if content and len(content.strip()) > 50:
+                    logger.info("HF synthesis success", model=model, chars=len(content))
+                    return content.strip()
+            except Exception as e:
+                logger.warning("HF synthesis model failed", model=model, error=str(e))
+                continue
+        logger.error("All HF synthesis models failed")
+        return None
 class MockJudgeHandler:
     """

src/orchestrators/simple.py CHANGED Viewed

@@ -536,23 +536,34 @@ class Orchestrator:
         system_prompt = get_synthesis_system_prompt(self.domain)
         try:
-            # Import here to avoid circular deps and keep optional
-            from pydantic_ai import Agent
-            from src.agent_factory.judges import get_model
-            # Create synthesis agent with retries (matching Judge agent pattern)
-            # Without retries, transient errors immediately trigger fallback
-            agent: Agent[None, str] = Agent(
-                model=get_model(),
-                output_type=str,
-                system_prompt=system_prompt,
-                retries=3,  # Match Judge agent - retry on transient errors
-            )
-            result = await agent.run(user_prompt)
-            narrative = result.output
-            logger.info("LLM narrative synthesis completed", chars=len(narrative))
         except Exception as e:
             # Fallback to template synthesis if LLM fails

         system_prompt = get_synthesis_system_prompt(self.domain)
         try:
+            # Check if judge has its own synthesize method (Free Tier uses HF Inference)
+            # This ensures Free Tier uses consistent free inference for BOTH judge AND synthesis
+            if hasattr(self.judge, "synthesize"):
+                logger.info("Using judge's free-tier synthesis method")
+                narrative = await self.judge.synthesize(system_prompt, user_prompt)
+                if narrative:
+                    logger.info("Free-tier synthesis completed", chars=len(narrative))
+                else:
+                    # Free tier synthesis failed, use template
+                    raise RuntimeError("Free tier HF synthesis returned no content")
+            else:
+                # Paid tier: use PydanticAI with get_model()
+                from pydantic_ai import Agent
+                from src.agent_factory.judges import get_model
+                # Create synthesis agent with retries (matching Judge agent pattern)
+                # Without retries, transient errors immediately trigger fallback
+                agent: Agent[None, str] = Agent(
+                    model=get_model(),
+                    output_type=str,
+                    system_prompt=system_prompt,
+                    retries=3,  # Match Judge agent - retry on transient errors
+                )
+                result = await agent.run(user_prompt)
+                narrative = result.output
+                logger.info("LLM narrative synthesis completed", chars=len(narrative))
         except Exception as e:
             # Fallback to template synthesis if LLM fails

tests/integration/test_simple_mode_synthesis.py CHANGED Viewed

@@ -37,6 +37,10 @@ async def test_simple_mode_synthesizes_before_max_iterations():
     # Mock judge to return GOOD scores eventually
     # We can use MockJudgeHandler or a pure mock. Let's use a pure mock to control scores precisely.
     mock_judge = AsyncMock()
     # Iteration 1: Low scores
     assess_1 = JudgeAssessment(
@@ -95,7 +99,9 @@ async def test_simple_mode_synthesizes_before_max_iterations():
     # SPEC_12: LLM synthesis produces narrative prose, not template with "Drug Candidates" header
     # Check for narrative structure (LLM may omit ### prefix) OR template fallback
     assert (
-        "Executive Summary" in complete_event.message or "Drug Candidates" in complete_event.message
     )
     assert complete_event.data.get("synthesis_reason") == "high_scores_with_candidates"
     assert complete_event.iteration == 2  # Should stop at it 2

     # Mock judge to return GOOD scores eventually
     # We can use MockJudgeHandler or a pure mock. Let's use a pure mock to control scores precisely.
     mock_judge = AsyncMock()
+    # Since mock_judge has 'synthesize' attr by default (as a Mock),
+    # simple mode uses free-tier path.
+    # We must mock the return value of synthesize to simulate a successful narrative generation.
+    mock_judge.synthesize.return_value = "This is a synthesized report for MagicDrug."
     # Iteration 1: Low scores
     assess_1 = JudgeAssessment(
     # SPEC_12: LLM synthesis produces narrative prose, not template with "Drug Candidates" header
     # Check for narrative structure (LLM may omit ### prefix) OR template fallback
     assert (
+        "Executive Summary" in complete_event.message
+        or "Drug Candidates" in complete_event.message
+        or "synthesized report" in complete_event.message
     )
     assert complete_event.data.get("synthesis_reason") == "high_scores_with_candidates"
     assert complete_event.iteration == 2  # Should stop at it 2

tests/unit/orchestrators/test_simple_synthesis.py CHANGED Viewed

@@ -68,9 +68,10 @@ class TestGenerateSynthesis:
         sample_evidence: list[Evidence],
         sample_assessment: JudgeAssessment,
     ) -> None:
-        """Synthesis should make an LLM call, not just use a template."""
         mock_search = MagicMock()
-        mock_judge = MagicMock()
         orchestrator = Orchestrator(
             search_handler=mock_search,
@@ -129,6 +130,40 @@ Long-term safety data is limited.
             assert "Background" in result
             assert "Evidence Synthesis" in result
     @pytest.mark.asyncio
     async def test_falls_back_on_llm_error_with_notice(
         self,
@@ -137,7 +172,8 @@ Long-term safety data is limited.
     ) -> None:
         """Synthesis should fall back to template if LLM fails, WITH error notice."""
         mock_search = MagicMock()
-        mock_judge = MagicMock()
         orchestrator = Orchestrator(
             search_handler=mock_search,
@@ -171,7 +207,8 @@ Long-term safety data is limited.
     ) -> None:
         """Synthesis should include full citation list footer."""
         mock_search = MagicMock()
-        mock_judge = MagicMock()
         orchestrator = Orchestrator(
             search_handler=mock_search,

         sample_evidence: list[Evidence],
         sample_assessment: JudgeAssessment,
     ) -> None:
+        """Synthesis should make an LLM call using pydantic_ai when judge is paid tier."""
         mock_search = MagicMock()
+        # Paid tier JudgeHandler has 'assess' but NOT 'synthesize'
+        mock_judge = MagicMock(spec=["assess"])
         orchestrator = Orchestrator(
             search_handler=mock_search,
             assert "Background" in result
             assert "Evidence Synthesis" in result
+    @pytest.mark.asyncio
+    async def test_uses_free_tier_synthesis_when_available(
+        self,
+        sample_evidence: list[Evidence],
+        sample_assessment: JudgeAssessment,
+    ) -> None:
+        """Synthesis should use judge's synthesize method when in Free Tier."""
+        mock_search = MagicMock()
+        # Free tier JudgeHandler has 'synthesize' method
+        mock_judge = MagicMock()
+        # Setup synthesize method
+        mock_judge.synthesize = AsyncMock(return_value="Free tier narrative content.")
+        orchestrator = Orchestrator(
+            search_handler=mock_search,
+            judge_handler=mock_judge,
+        )
+        orchestrator.history = [{"iteration": 1}]
+        # We don't need to patch Agent or get_model because they shouldn't be called
+        result = await orchestrator._generate_synthesis(
+            query="test query",
+            evidence=sample_evidence,
+            assessment=sample_assessment,
+        )
+        # Verify judge's synthesize was called
+        mock_judge.synthesize.assert_called_once()
+        # Verify result contains the free tier content
+        assert "Free tier narrative content" in result
+        # Should still include footer
+        assert "Full Citation List" in result
     @pytest.mark.asyncio
     async def test_falls_back_on_llm_error_with_notice(
         self,
     ) -> None:
         """Synthesis should fall back to template if LLM fails, WITH error notice."""
         mock_search = MagicMock()
+        # Paid tier simulation
+        mock_judge = MagicMock(spec=["assess"])
         orchestrator = Orchestrator(
             search_handler=mock_search,
     ) -> None:
         """Synthesis should include full citation list footer."""
         mock_search = MagicMock()
+        # Paid tier simulation
+        mock_judge = MagicMock(spec=["assess"])
         orchestrator = Orchestrator(
             search_handler=mock_search,