VibecoderMcSwaggins commited on
Commit
0049ad7
·
1 Parent(s): c8f7161

docs: ironclad SPEC_12 for narrative synthesis implementation

Browse files

Deep audit against Microsoft Agent Framework patterns:
- Identified root cause: _generate_synthesis() has NO LLM call
- Maps exactly to codebase with line numbers
- Complete implementation plan with code examples
- Test criteria included
- Ready for async agent implementation

Key changes needed:
1. NEW: src/prompts/synthesis.py (narrative prompts)
2. MODIFY: src/orchestrators/simple.py (add LLM call)
3. NEW: tests/unit/prompts/test_synthesis.py
4. NEW: tests/unit/orchestrators/test_simple_synthesis.py

MS Agent Framework reference pattern:
concurrent_custom_aggregator.py shows LLM-based aggregation
vs our current string templating approach

Files changed (1) hide show
  1. SPEC_12_NARRATIVE_SYNTHESIS.md +479 -218
SPEC_12_NARRATIVE_SYNTHESIS.md CHANGED
@@ -1,15 +1,18 @@
1
  # SPEC_12: Narrative Report Synthesis
2
 
3
- **Status**: Draft
4
  **Priority**: P1 - Core deliverable
5
  **Related Issues**: #85, #86
6
  **Related Spec**: SPEC_11 (Sexual Health Focus)
 
 
 
7
 
8
  ## Problem Statement
9
 
10
  DeepBoner's report generation outputs **structured metadata** instead of **synthesized prose**. The current implementation uses string templating with NO LLM call for narrative synthesis.
11
 
12
- ### Current Output (Actual)
13
 
14
  ```markdown
15
  ## Sexual Health Analysis
@@ -20,20 +23,15 @@ Testosterone therapy for hypoactive sexual desire disorder?
20
  ### Drug Candidates
21
  - **Testosterone**
22
  - **LibiGel**
23
- - **Androgel**
24
 
25
  ### Key Findings
26
- - Testosterone therapy improves sexual desire and activity in postmenopausal women with HSDD.
27
- - Transdermal testosterone is a preferred formulation.
28
 
29
  ### Assessment
30
  - **Mechanism Score**: 8/10
31
  - **Clinical Evidence Score**: 9/10
32
  - **Confidence**: 90%
33
 
34
- ### Reasoning
35
- The evidence provides a clear understanding of the mechanism of action...
36
-
37
  ### Citations (33 sources)
38
  1. [Title](url)...
39
  ```
@@ -41,7 +39,7 @@ The evidence provides a clear understanding of the mechanism of action...
41
  ### Expected Output (Professional Research Report)
42
 
43
  ```markdown
44
- ## Sexual Health Research Report: Testosterone Therapy for Hypoactive Sexual Desire Disorder
45
 
46
  ### Executive Summary
47
 
@@ -55,54 +53,41 @@ efficacy-safety profile.
55
 
56
  Hypoactive sexual desire disorder affects an estimated 12% of postmenopausal women
57
  and is characterized by persistent lack of sexual interest causing personal distress.
58
- The International Society for the Study of Women's Sexual Health (ISSWSH) published
59
- clinical guidelines in 2021 establishing testosterone as a recommended intervention...
60
 
61
  ### Evidence Synthesis
62
 
63
  **Mechanism of Action**
64
 
65
  Testosterone exerts its effects on sexual desire through multiple pathways. At the
66
- hypothalamic level, testosterone modulates dopaminergic signaling that underlies
67
- libido. Evidence from Smith et al. (2021) demonstrates that androgen receptor
68
- activation in the central nervous system correlates with subjective measures of
69
- sexual desire (r=0.67, p<0.001)...
70
-
71
- **Clinical Trial Evidence**
72
-
73
- A systematic review of 8 randomized controlled trials (N=3,035) demonstrated that
74
- transdermal testosterone significantly improved:
75
- - Satisfying sexual events: +2.1 per month (95% CI: 1.4-2.8)
76
- - Sexual desire scores: +0.4 on validated scales (p<0.001)
77
-
78
- The Global Consensus Position Statement (2019) and ISSWSH Guidelines (2021) both
79
- recommend transdermal testosterone as first-line therapy...
80
 
81
  ### Recommendations
82
 
83
- Based on this evidence synthesis:
84
  1. **Transdermal testosterone** (300 μg/day) is recommended for postmenopausal
85
  women with HSDD not primarily related to modifiable factors
86
  2. **Duration**: Continue for 6 months to assess efficacy; discontinue if no benefit
87
- 3. **Monitoring**: Lipid profile and liver function at baseline and 3-6 months
88
 
89
- ### Limitations & Future Directions
90
 
91
- - Long-term safety data beyond 24 months remains limited
92
- - Efficacy in premenopausal women less well-established
93
- - Head-to-head comparisons between formulations are needed
94
 
95
  ### References
96
-
97
- 1. Parish SJ et al. (2021). International Society for the Study of Women's Sexual
98
- Health Clinical Practice Guideline for the Use of Systemic Testosterone for
99
- Hypoactive Sexual Desire Disorder in Women. J Sex Med. https://pubmed.ncbi.nlm.nih.gov/33814355/
100
- ...
101
  ```
102
 
 
 
103
  ## Root Cause Analysis
104
 
105
- ### Current Implementation (`src/orchestrators/simple.py:448-505`)
 
 
 
 
106
 
107
  ```python
108
  def _generate_synthesis(
@@ -124,24 +109,52 @@ def _generate_synthesis(
124
  """
125
  ```
126
 
127
- **The problem**: No LLM is ever called to synthesize the report. It's just formatted
128
- data from the JudgeAssessment.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
129
 
130
- ### Microsoft Agent Framework Pattern
131
 
132
- From `reference_repos/agent-framework/python/samples/getting_started/workflows/orchestration/concurrent_custom_aggregator.py`:
 
 
 
 
 
 
 
 
 
 
133
 
134
  ```python
135
  # Define a custom aggregator callback that uses the chat client to SYNTHESIZE
136
  async def summarize_results(results: list[Any]) -> str:
137
- # Collect expert outputs
138
  expert_sections: list[str] = []
139
  for r in results:
140
  messages = getattr(r.agent_run_response, "messages", [])
141
  final_text = messages[-1].text if messages else "(no content)"
142
  expert_sections.append(f"{r.executor_id}:\n{final_text}")
143
 
144
- # Ask the MODEL to synthesize
145
  system_msg = ChatMessage(
146
  Role.SYSTEM,
147
  text=(
@@ -151,147 +164,98 @@ async def summarize_results(results: list[Any]) -> str:
151
  )
152
  user_msg = ChatMessage(Role.USER, text="\n\n".join(expert_sections))
153
 
154
- # ✅ LLM CALL for synthesis
155
  response = await chat_client.get_response([system_msg, user_msg])
156
  return response.messages[-1].text
157
  ```
158
 
159
  **The pattern**: The aggregator makes an **LLM call** to synthesize, not string concatenation.
160
 
 
 
161
  ## Solution Design
162
 
163
- ### Architecture
164
 
165
  ```
166
- Current:
167
  Evidence → Judge → {structured data} → String Template → Bullet Points
168
 
169
- Proposed:
170
- Evidence → Judge → {structured data} → SynthesisAgent → Narrative Prose
171
-
172
- LLM-based synthesis
173
  ```
174
 
175
- ### Components
176
 
177
- #### 1. `SynthesisAgent` (`src/agents/synthesis.py`)
 
 
 
 
 
 
178
 
179
- A new agent dedicated to narrative report generation:
180
 
181
- ```python
182
- from pydantic import BaseModel
183
- from pydantic_ai import Agent
184
 
185
- class NarrativeReport(BaseModel):
186
- """Structured output for narrative report."""
187
- executive_summary: str # 2-3 sentences, key takeaways
188
- background: str # What is this condition, why does it matter
189
- evidence_synthesis: str # Mechanism + Clinical evidence in prose
190
- recommendations: list[str] # Actionable recommendations
191
- limitations: str # Honest limitations
192
- references: list[Reference] # Properly formatted
193
-
194
- class SynthesisAgent:
195
- """Generates narrative research reports from structured data."""
196
-
197
- async def synthesize(
198
- self,
199
- query: str,
200
- evidence: list[Evidence],
201
- assessment: JudgeAssessment,
202
- domain: ResearchDomain,
203
- ) -> NarrativeReport:
204
- """Generate narrative prose report."""
205
- # Build context
206
- context = self._build_synthesis_context(evidence, assessment)
207
-
208
- # ✅ LLM CALL for synthesis
209
- result = await self.agent.run(
210
- f"Generate a narrative research report for: {query}",
211
- context=context,
212
- )
213
- return result.data
214
- ```
215
 
216
- #### 2. Updated System Prompt (`src/prompts/synthesis.py`)
217
 
218
  ```python
219
- SYNTHESIS_SYSTEM_PROMPT = """You are a scientific writer specializing in sexual health research.
220
- Your task is to synthesize research evidence into a clear, narrative report.
 
221
 
222
- ## Writing Style
 
 
 
 
 
 
223
  - Write in PROSE PARAGRAPHS, not bullet points
224
  - Use academic but accessible language
225
- - Be specific about evidence strength (e.g., "in a randomized controlled trial of N=200")
226
  - Reference specific studies by author name
227
- - Provide quantitative results where available
228
 
229
  ## Report Structure
230
 
231
  ### Executive Summary (REQUIRED - 2-3 sentences)
232
- Summarize the key finding and clinical implication. Start with the bottom line.
233
- Example: "Testosterone therapy demonstrates consistent efficacy for HSDD in
234
- postmenopausal women, with transdermal formulations showing the best safety profile."
235
 
236
  ### Background (REQUIRED - 1 paragraph)
237
- Explain the condition, its prevalence, and why this question matters clinically.
238
 
239
  ### Evidence Synthesis (REQUIRED - 2-4 paragraphs)
240
- Weave together the evidence into a coherent narrative:
241
  - Mechanism of Action: How does the intervention work?
242
- - Clinical Evidence: What do the trials show? Be specific about effect sizes.
243
  - Comparative Evidence: How does it compare to alternatives?
244
 
245
- ### Recommendations (REQUIRED - 3-5 bullet points)
246
- Provide actionable clinical recommendations based on the evidence.
247
 
248
  ### Limitations (REQUIRED - 1 paragraph)
249
  Acknowledge gaps, biases, and areas needing more research.
250
 
251
  ### References (REQUIRED)
252
- List the key references in proper academic format.
253
 
254
  ## CRITICAL RULES
255
  1. ONLY cite papers from the provided evidence - NEVER hallucinate references
256
- 2. Write in complete sentences and paragraphs
257
- 3. Avoid lists/bullets except in Recommendations section
258
- 4. Include specific statistics when available (p-values, effect sizes, CIs)
259
- 5. Acknowledge uncertainty honestly
260
  """
261
- ```
262
 
263
- #### 3. Updated Orchestrator Integration
264
 
265
- ```python
266
- # In src/orchestrators/simple.py
267
-
268
- async def _generate_synthesis(
269
- self,
270
- query: str,
271
- evidence: list[Evidence],
272
- assessment: JudgeAssessment,
273
- ) -> str:
274
- """Generate narrative synthesis using LLM."""
275
- from src.agents.synthesis import SynthesisAgent
276
-
277
- synthesis_agent = SynthesisAgent(domain=self.domain)
278
-
279
- report = await synthesis_agent.synthesize(
280
- query=query,
281
- evidence=evidence,
282
- assessment=assessment,
283
- domain=self.domain,
284
- )
285
-
286
- return report.to_markdown()
287
- ```
288
-
289
- ### Few-Shot Example (Required for Quality)
290
-
291
- From issue #82, include a concrete example in the prompt:
292
-
293
- ```python
294
- FEW_SHOT_EXAMPLE = """
295
  ## Example: Strong Evidence Synthesis
296
 
297
  INPUT:
@@ -312,10 +276,9 @@ mechanism particularly valuable for patients who do not respond to oral therapie
312
  ### Background
313
 
314
  Erectile dysfunction affects approximately 30 million men in the United States,
315
- with prevalence increasing with age. While PDE5 inhibitors (sildenafil, tadalafil)
316
- remain first-line therapy, approximately 30% of patients are non-responders or
317
- have contraindications. Alprostadil provides an alternative mechanism of action
318
- through direct smooth muscle relaxation.
319
 
320
  ### Evidence Synthesis
321
 
@@ -323,98 +286,368 @@ through direct smooth muscle relaxation.
323
 
324
  Alprostadil works through a distinct pathway from PDE5 inhibitors. It binds to
325
  EP receptors on cavernosal smooth muscle, activating adenylate cyclase and
326
- increasing intracellular cAMP. This leads to smooth muscle relaxation and
327
- penile erection independent of nitric oxide signaling. As noted by Smith et al.
328
- (2019), this mechanism explains its efficacy in patients with endothelial
329
- dysfunction or nerve damage.
330
 
331
  **Clinical Evidence**
332
 
333
  A meta-analysis by Johnson et al. (2020) pooled data from 8 randomized controlled
334
- trials (N=3,247) comparing intracavernosal alprostadil to placebo. The primary
335
- endpoint of erection sufficient for intercourse was achieved in 87% of alprostadil
336
- patients versus 12% placebo (RR 7.25, 95% CI: 5.8-9.1, p<0.001). The number
337
- needed to treat (NNT) was 1.3, indicating robust effect size.
338
-
339
- Subgroup analysis revealed consistent efficacy across etiologies:
340
- - Vascular ED: 85% response rate
341
- - Neurogenic ED: 91% response rate
342
- - Post-prostatectomy: 82% response rate
343
 
344
  ### Recommendations
345
 
346
- 1. Consider alprostadil as second-line therapy when PDE5 inhibitors fail or are contraindicated
347
- 2. Start with 10 μg intracavernosal injection, titrate up to 40 μg based on response
348
  3. Provide in-office training for self-injection technique
349
- 4. Monitor for penile fibrosis with long-term use (occurs in 3-5% of patients)
350
 
351
  ### Limitations
352
 
353
- Long-term data beyond 2 years is limited. Head-to-head comparisons with
354
- newer therapies (low-intensity shockwave) are lacking. Most trials excluded
355
- patients with severe cardiovascular disease, limiting generalizability.
356
- The intraurethral formulation (MUSE) has lower efficacy (43%) than injection.
357
 
358
  ### References
359
 
360
- 1. Smith AB et al. (2019). Alprostadil mechanism of action in erectile tissue.
361
- J Urol. https://pubmed.ncbi.nlm.nih.gov/12345678/
362
- 2. Johnson CD et al. (2020). Meta-analysis of intracavernosal alprostadil.
363
- J Sex Med. https://pubmed.ncbi.nlm.nih.gov/23456789/
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
364
  """
365
  ```
366
 
367
- ## Implementation Plan
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
368
 
369
- ### Phase 1: Core SynthesisAgent
 
 
370
 
371
- 1. Create `src/agents/synthesis.py` with:
372
- - `SynthesisAgent` class
373
- - `NarrativeReport` Pydantic model
374
- - LLM-based synthesis method
375
 
376
- 2. Create `src/prompts/synthesis.py` with:
377
- - `SYNTHESIS_SYSTEM_PROMPT`
378
- - `FEW_SHOT_EXAMPLE`
379
- - `format_synthesis_context()` helper
380
 
381
- 3. Update `src/orchestrators/simple.py`:
382
- - Make `_generate_synthesis()` async
383
- - Call `SynthesisAgent.synthesize()`
384
- - Keep `_generate_partial_synthesis()` as fallback (free tier)
385
 
386
- ### Phase 2: Advanced Mode Integration
 
 
387
 
388
- 4. Update `src/orchestrators/advanced.py`:
389
- - Add `SynthesisAgent` to Magentic workflow
390
- - Ensure it receives all evidence from prior agents
391
 
392
- ### Phase 3: Test Coverage
393
 
394
- 5. Create `tests/unit/agents/test_synthesis.py`:
395
- - Test narrative output structure
396
- - Test reference accuracy (no hallucinated citations)
397
- - Test prose vs bullet point ratio
398
 
399
- ### Phase 4: Domain Customization
 
400
 
401
- 6. Update `src/config/domain.py`:
402
- - Add `synthesis_system_prompt` field to `DomainConfig`
403
- - Add `synthesis_few_shot_example` field
404
- - Configure for sexual health domain
405
 
406
- ## File Changes
 
 
 
 
407
 
408
- | File | Change |
409
- |------|--------|
410
- | `src/agents/synthesis.py` | NEW - SynthesisAgent |
411
- | `src/prompts/synthesis.py` | NEW - Synthesis prompts |
412
- | `src/orchestrators/simple.py` | MODIFY - Call SynthesisAgent |
413
- | `src/orchestrators/advanced.py` | MODIFY - Add to Magentic |
414
- | `src/config/domain.py` | MODIFY - Add synthesis prompts |
415
- | `src/utils/models.py` | MODIFY - Add NarrativeReport |
416
- | `tests/unit/agents/test_synthesis.py` | NEW - Tests |
417
- | `tests/unit/prompts/test_synthesis.py` | NEW - Tests |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
418
 
419
  ## Acceptance Criteria
420
 
@@ -423,18 +656,21 @@ The intraurethral formulation (MUSE) has lower efficacy (43%) than injection.
423
  - [ ] Report has **background section** explaining the condition
424
  - [ ] Report has **synthesized narrative** weaving evidence together
425
  - [ ] Report has **actionable recommendations**
426
- - [ ] Report has **limitations** section (honest acknowledgment)
427
  - [ ] Citations are **properly formatted** (author, year, title, URL)
428
  - [ ] No hallucinated references (CRITICAL)
429
- - [ ] Works in both simple and advanced modes
430
- - [ ] Falls back gracefully on free tier (minimal templating OK)
 
 
 
431
 
432
  ## Test Criteria
433
 
434
  ```python
435
  def test_report_is_narrative_not_bullets():
436
  """Report should be mostly prose, not bullet points."""
437
- report = synthesis_agent.synthesize(...)
438
 
439
  # Count paragraphs vs bullet points
440
  paragraphs = len([p for p in report.split('\n\n') if len(p) > 100])
@@ -446,24 +682,49 @@ def test_report_is_narrative_not_bullets():
446
  def test_references_not_hallucinated():
447
  """All references must come from provided evidence."""
448
  evidence_urls = {e.citation.url for e in evidence}
449
- report = synthesis_agent.synthesize(...)
 
 
 
 
450
 
451
- for ref in report.references:
452
- assert ref.url in evidence_urls, f"Hallucinated reference: {ref.url}"
 
 
 
453
  ```
454
 
 
 
455
  ## Related Microsoft Agent Framework Patterns
456
 
457
- | Pattern | Location | Application |
458
- |---------|----------|-------------|
459
- | Custom Aggregator | `concurrent_custom_aggregator.py` | LLM-based synthesis |
460
  | Fan-Out/Fan-In | `fan_out_fan_in_edges.py` | Multi-expert synthesis |
461
- | Research Assistant | `research_assistant_agent.py` | Tool-based research |
462
- | Sequential Orchestration | `spec-001-foundry-sdk-alignment.md` | Analyst→Writer→Editor chain |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
463
 
464
  ## References
465
 
466
  - GitHub Issue #85: Report lacks narrative synthesis
467
  - GitHub Issue #86: Microsoft Agent Framework patterns
468
- - LangChain Deep Agents blog: Few-shot examples importance
469
- - Open Deep Research Architecture: Scoping + Synthesis pattern
 
1
  # SPEC_12: Narrative Report Synthesis
2
 
3
+ **Status**: Ready for Implementation
4
  **Priority**: P1 - Core deliverable
5
  **Related Issues**: #85, #86
6
  **Related Spec**: SPEC_11 (Sexual Health Focus)
7
+ **Author**: Deep Audit against Microsoft Agent Framework
8
+
9
+ ---
10
 
11
  ## Problem Statement
12
 
13
  DeepBoner's report generation outputs **structured metadata** instead of **synthesized prose**. The current implementation uses string templating with NO LLM call for narrative synthesis.
14
 
15
+ ### Current Output (Simple Mode - What Users See)
16
 
17
  ```markdown
18
  ## Sexual Health Analysis
 
23
  ### Drug Candidates
24
  - **Testosterone**
25
  - **LibiGel**
 
26
 
27
  ### Key Findings
28
+ - Testosterone therapy improves sexual desire
 
29
 
30
  ### Assessment
31
  - **Mechanism Score**: 8/10
32
  - **Clinical Evidence Score**: 9/10
33
  - **Confidence**: 90%
34
 
 
 
 
35
  ### Citations (33 sources)
36
  1. [Title](url)...
37
  ```
 
39
  ### Expected Output (Professional Research Report)
40
 
41
  ```markdown
42
+ ## Sexual Health Research Report: Testosterone Therapy for HSDD
43
 
44
  ### Executive Summary
45
 
 
53
 
54
  Hypoactive sexual desire disorder affects an estimated 12% of postmenopausal women
55
  and is characterized by persistent lack of sexual interest causing personal distress.
56
+ The ISSWSH published clinical guidelines in 2021 establishing testosterone as a
57
+ recommended intervention...
58
 
59
  ### Evidence Synthesis
60
 
61
  **Mechanism of Action**
62
 
63
  Testosterone exerts its effects on sexual desire through multiple pathways. At the
64
+ hypothalamic level, testosterone modulates dopaminergic signaling. Evidence from
65
+ Smith et al. (2021) demonstrates androgen receptor activation correlates with
66
+ subjective measures of desire (r=0.67, p<0.001)...
 
 
 
 
 
 
 
 
 
 
 
67
 
68
  ### Recommendations
69
 
 
70
  1. **Transdermal testosterone** (300 μg/day) is recommended for postmenopausal
71
  women with HSDD not primarily related to modifiable factors
72
  2. **Duration**: Continue for 6 months to assess efficacy; discontinue if no benefit
 
73
 
74
+ ### Limitations
75
 
76
+ Long-term safety data beyond 24 months remains limited...
 
 
77
 
78
  ### References
79
+ 1. Smith AB et al. (2021). Testosterone mechanisms... https://pubmed.ncbi.nlm.nih.gov/123/
 
 
 
 
80
  ```
81
 
82
+ ---
83
+
84
  ## Root Cause Analysis
85
 
86
+ ### Location 1: Simple Orchestrator (THE PRIMARY BUG)
87
+
88
+ **File**: `src/orchestrators/simple.py`
89
+ **Lines**: 448-505
90
+ **Method**: `_generate_synthesis()`
91
 
92
  ```python
93
  def _generate_synthesis(
 
109
  """
110
  ```
111
 
112
+ **The Problem**: No LLM is ever called. It's just formatted data from JudgeAssessment.
113
+
114
+ ### Location 2: Partial Synthesis (Max Iterations Fallback)
115
+
116
+ **File**: `src/orchestrators/simple.py`
117
+ **Lines**: 507-602
118
+ **Method**: `_generate_partial_synthesis()`
119
+
120
+ Same issue - string templating, no LLM call.
121
+
122
+ ### Location 3: Report Agent (Advanced Mode)
123
+
124
+ **File**: `src/agents/report_agent.py`
125
+ **Lines**: 93-94
126
+
127
+ ```python
128
+ result = await self._get_agent().run(prompt)
129
+ report = result.output # ResearchReport (structured data)
130
+ ```
131
+
132
+ This DOES make an LLM call, but it outputs `ResearchReport` (structured Pydantic model), not narrative prose. The `to_markdown()` method just formats the structured fields.
133
 
134
+ ### Location 4: Report System Prompt
135
 
136
+ **File**: `src/prompts/report.py`
137
+ **Lines**: 13-76
138
+
139
+ The system prompt tells the LLM to output structured JSON with fields like `hypotheses_tested: [...]` and `references: [...]`. It does NOT request narrative prose.
140
+
141
+ ---
142
+
143
+ ## Microsoft Agent Framework Pattern (Reference)
144
+
145
+ **File**: `reference_repos/agent-framework/python/samples/getting_started/workflows/orchestration/concurrent_custom_aggregator.py`
146
+ **Lines**: 56-79
147
 
148
  ```python
149
  # Define a custom aggregator callback that uses the chat client to SYNTHESIZE
150
  async def summarize_results(results: list[Any]) -> str:
 
151
  expert_sections: list[str] = []
152
  for r in results:
153
  messages = getattr(r.agent_run_response, "messages", [])
154
  final_text = messages[-1].text if messages else "(no content)"
155
  expert_sections.append(f"{r.executor_id}:\n{final_text}")
156
 
157
+ # LLM CALL for synthesis
158
  system_msg = ChatMessage(
159
  Role.SYSTEM,
160
  text=(
 
164
  )
165
  user_msg = ChatMessage(Role.USER, text="\n\n".join(expert_sections))
166
 
 
167
  response = await chat_client.get_response([system_msg, user_msg])
168
  return response.messages[-1].text
169
  ```
170
 
171
  **The pattern**: The aggregator makes an **LLM call** to synthesize, not string concatenation.
172
 
173
+ ---
174
+
175
  ## Solution Design
176
 
177
+ ### Architecture Change
178
 
179
  ```
180
+ Current (Simple Mode):
181
  Evidence → Judge → {structured data} → String Template → Bullet Points
182
 
183
+ Proposed (Simple Mode):
184
+ Evidence → Judge → {structured data} → LLM Synthesis → Narrative Prose
185
+
186
+ Uses SynthesisPrompt
187
  ```
188
 
189
+ ### Components to Create/Modify
190
 
191
+ | File | Action | Description |
192
+ |------|--------|-------------|
193
+ | `src/prompts/synthesis.py` | **NEW** | Narrative synthesis prompts |
194
+ | `src/orchestrators/simple.py` | **MODIFY** | Make `_generate_synthesis()` async, add LLM call |
195
+ | `src/config/domain.py` | **MODIFY** | Add `synthesis_system_prompt` field |
196
+ | `tests/unit/prompts/test_synthesis.py` | **NEW** | Test synthesis prompts |
197
+ | `tests/unit/orchestrators/test_simple_synthesis.py` | **NEW** | Test LLM synthesis |
198
 
199
+ ---
200
 
201
+ ## Implementation Plan
 
 
202
 
203
+ ### Phase 1: Create Synthesis Prompts
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
 
205
+ **File**: `src/prompts/synthesis.py` (NEW)
206
 
207
  ```python
208
+ """Prompts for narrative report synthesis."""
209
+
210
+ from src.config.domain import ResearchDomain, get_domain_config
211
 
212
+ def get_synthesis_system_prompt(domain: ResearchDomain | str | None = None) -> str:
213
+ """Get the system prompt for narrative synthesis."""
214
+ config = get_domain_config(domain)
215
+ return f"""You are a scientific writer specializing in {config.name.lower()}.
216
+ Your task is to synthesize research evidence into a clear, NARRATIVE report.
217
+
218
+ ## CRITICAL: Writing Style
219
  - Write in PROSE PARAGRAPHS, not bullet points
220
  - Use academic but accessible language
221
+ - Be specific about evidence strength (e.g., "in an RCT of N=200")
222
  - Reference specific studies by author name
223
+ - Provide quantitative results where available (p-values, effect sizes)
224
 
225
  ## Report Structure
226
 
227
  ### Executive Summary (REQUIRED - 2-3 sentences)
228
+ Start with the bottom line. Example:
229
+ "Testosterone therapy demonstrates consistent efficacy for HSDD in postmenopausal
230
+ women, with transdermal formulations showing the best safety profile."
231
 
232
  ### Background (REQUIRED - 1 paragraph)
233
+ Explain the condition, its prevalence, and clinical significance.
234
 
235
  ### Evidence Synthesis (REQUIRED - 2-4 paragraphs)
236
+ Weave the evidence into a coherent NARRATIVE:
237
  - Mechanism of Action: How does the intervention work?
238
+ - Clinical Evidence: What do trials show? Include effect sizes.
239
  - Comparative Evidence: How does it compare to alternatives?
240
 
241
+ ### Recommendations (REQUIRED - 3-5 items)
242
+ Provide actionable clinical recommendations.
243
 
244
  ### Limitations (REQUIRED - 1 paragraph)
245
  Acknowledge gaps, biases, and areas needing more research.
246
 
247
  ### References (REQUIRED)
248
+ List key references with author, year, title, URL.
249
 
250
  ## CRITICAL RULES
251
  1. ONLY cite papers from the provided evidence - NEVER hallucinate references
252
+ 2. Write in complete sentences and paragraphs (PROSE, not lists)
253
+ 3. Include specific statistics when available
254
+ 4. Acknowledge uncertainty honestly
 
255
  """
 
256
 
 
257
 
258
+ FEW_SHOT_EXAMPLE = '''
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
259
  ## Example: Strong Evidence Synthesis
260
 
261
  INPUT:
 
276
  ### Background
277
 
278
  Erectile dysfunction affects approximately 30 million men in the United States,
279
+ with prevalence increasing with age. While PDE5 inhibitors remain first-line
280
+ therapy, approximately 30% of patients are non-responders. Alprostadil provides
281
+ an alternative mechanism through direct smooth muscle relaxation.
 
282
 
283
  ### Evidence Synthesis
284
 
 
286
 
287
  Alprostadil works through a distinct pathway from PDE5 inhibitors. It binds to
288
  EP receptors on cavernosal smooth muscle, activating adenylate cyclase and
289
+ increasing intracellular cAMP. As noted by Smith et al. (2019), this mechanism
290
+ explains its efficacy in patients with endothelial dysfunction.
 
 
291
 
292
  **Clinical Evidence**
293
 
294
  A meta-analysis by Johnson et al. (2020) pooled data from 8 randomized controlled
295
+ trials (N=3,247). The primary endpoint of erection sufficient for intercourse was
296
+ achieved in 87% of alprostadil patients versus 12% placebo (RR 7.25, 95% CI:
297
+ 5.8-9.1, p<0.001). The NNT was 1.3, indicating robust effect size.
 
 
 
 
 
 
298
 
299
  ### Recommendations
300
 
301
+ 1. Consider alprostadil as second-line therapy when PDE5 inhibitors fail
302
+ 2. Start with 10 μg intracavernosal injection, titrate to 40 μg
303
  3. Provide in-office training for self-injection technique
 
304
 
305
  ### Limitations
306
 
307
+ Long-term data beyond 2 years is limited. Head-to-head comparisons with newer
308
+ therapies are lacking. Most trials excluded severe cardiovascular disease.
 
 
309
 
310
  ### References
311
 
312
+ 1. Smith AB et al. (2019). Alprostadil mechanism. J Urol. https://pubmed.ncbi.nlm.nih.gov/123/
313
+ 2. Johnson CD et al. (2020). Meta-analysis of alprostadil. J Sex Med. https://pubmed.ncbi.nlm.nih.gov/456/
314
+ '''
315
+
316
+
317
+ def format_synthesis_prompt(
318
+ query: str,
319
+ evidence_summary: str,
320
+ drug_candidates: list[str],
321
+ key_findings: list[str],
322
+ mechanism_score: int,
323
+ clinical_score: int,
324
+ confidence: float,
325
+ ) -> str:
326
+ """Format the user prompt for synthesis."""
327
+ return f"""Synthesize a narrative research report for the following query.
328
+
329
+ ## Research Question
330
+ {query}
331
+
332
+ ## Evidence Summary
333
+ {evidence_summary}
334
+
335
+ ## Identified Drug Candidates
336
+ {', '.join(drug_candidates) or 'None identified'}
337
+
338
+ ## Key Findings from Evidence
339
+ {chr(10).join(f'- {f}' for f in key_findings) or 'No specific findings'}
340
+
341
+ ## Assessment Scores
342
+ - Mechanism Score: {mechanism_score}/10
343
+ - Clinical Evidence Score: {clinical_score}/10
344
+ - Confidence: {confidence:.0%}
345
+
346
+ ## Instructions
347
+ Generate a NARRATIVE research report following the structure above.
348
+ Write in prose paragraphs, NOT bullet points (except for Recommendations).
349
+ ONLY cite papers mentioned in the Evidence Summary above.
350
+
351
+ {FEW_SHOT_EXAMPLE}
352
  """
353
  ```
354
 
355
+ ### Phase 2: Update Simple Orchestrator
356
+
357
+ **File**: `src/orchestrators/simple.py`
358
+ **Change**: Make `_generate_synthesis()` async and add LLM call
359
+
360
+ ```python
361
+ # Add imports at top
362
+ from src.prompts.synthesis import get_synthesis_system_prompt, format_synthesis_prompt
363
+ from src.agent_factory.judges import get_model
364
+ from pydantic_ai import Agent
365
+
366
+ # Change method signature and implementation (lines 448-505)
367
+ async def _generate_synthesis(
368
+ self,
369
+ query: str,
370
+ evidence: list[Evidence],
371
+ assessment: JudgeAssessment,
372
+ ) -> str:
373
+ """
374
+ Generate the final synthesis response using LLM.
375
+
376
+ Args:
377
+ query: The original question
378
+ evidence: All collected evidence
379
+ assessment: The final assessment
380
+
381
+ Returns:
382
+ Narrative synthesis as markdown
383
+ """
384
+ # Build evidence summary for LLM context
385
+ evidence_lines = []
386
+ for e in evidence[:20]: # Limit context
387
+ authors = ", ".join(e.citation.authors[:2]) if e.citation.authors else "Unknown"
388
+ evidence_lines.append(
389
+ f"- {e.citation.title} ({authors}, {e.citation.date}): {e.content[:200]}..."
390
+ )
391
+ evidence_summary = "\n".join(evidence_lines)
392
+
393
+ # Format synthesis prompt
394
+ user_prompt = format_synthesis_prompt(
395
+ query=query,
396
+ evidence_summary=evidence_summary,
397
+ drug_candidates=assessment.details.drug_candidates,
398
+ key_findings=assessment.details.key_findings,
399
+ mechanism_score=assessment.details.mechanism_score,
400
+ clinical_score=assessment.details.clinical_evidence_score,
401
+ confidence=assessment.confidence,
402
+ )
403
+
404
+ # Create synthesis agent
405
+ system_prompt = get_synthesis_system_prompt(self.domain)
406
+
407
+ try:
408
+ agent: Agent[None, str] = Agent(
409
+ model=get_model(),
410
+ output_type=str,
411
+ system_prompt=system_prompt,
412
+ )
413
+ result = await agent.run(user_prompt)
414
+ narrative = result.output
415
+ except Exception as e:
416
+ # Fallback to template if LLM fails
417
+ logger.warning("LLM synthesis failed, using template", error=str(e))
418
+ return self._generate_template_synthesis(query, evidence, assessment)
419
+
420
+ # Add citations footer
421
+ citations = "\n".join(
422
+ f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
423
+ f"({e.citation.source.upper()}, {e.citation.date})"
424
+ for i, e in enumerate(evidence[:10])
425
+ )
426
+
427
+ return f"""{narrative}
428
+
429
+ ---
430
+ ### Full Citation List ({len(evidence)} sources)
431
+ {citations}
432
+
433
+ *Analysis based on {len(evidence)} sources across {len(self.history)} iterations.*
434
+ """
435
+
436
+ def _generate_template_synthesis(
437
+ self,
438
+ query: str,
439
+ evidence: list[Evidence],
440
+ assessment: JudgeAssessment,
441
+ ) -> str:
442
+ """Fallback template synthesis (no LLM)."""
443
+ # Keep the existing string template logic here as fallback
444
+ ...
445
+ ```
446
+
447
+ ### Phase 3: Update Call Site
448
+
449
+ **File**: `src/orchestrators/simple.py`
450
+ **Line**: 393
451
+
452
+ ```python
453
+ # Change from:
454
+ final_response = self._generate_synthesis(query, all_evidence, assessment)
455
 
456
+ # To:
457
+ final_response = await self._generate_synthesis(query, all_evidence, assessment)
458
+ ```
459
 
460
+ ### Phase 4: Update Domain Config
 
 
 
461
 
462
+ **File**: `src/config/domain.py`
 
 
 
463
 
464
+ Add optional `synthesis_system_prompt` field to `DomainConfig`:
 
 
 
465
 
466
+ ```python
467
+ class DomainConfig(BaseModel):
468
+ # ... existing fields ...
469
 
470
+ # Synthesis (optional, can inherit from base)
471
+ synthesis_system_prompt: str | None = None
472
+ ```
473
 
474
+ ### Phase 5: Add Tests
475
 
476
+ **File**: `tests/unit/prompts/test_synthesis.py` (NEW)
 
 
 
477
 
478
+ ```python
479
+ """Tests for synthesis prompts."""
480
 
481
+ import pytest
 
 
 
482
 
483
+ from src.prompts.synthesis import (
484
+ get_synthesis_system_prompt,
485
+ format_synthesis_prompt,
486
+ FEW_SHOT_EXAMPLE,
487
+ )
488
 
489
+
490
+ def test_synthesis_system_prompt_is_narrative_focused() -> None:
491
+ """System prompt should emphasize prose, not bullets."""
492
+ prompt = get_synthesis_system_prompt()
493
+ assert "PROSE PARAGRAPHS" in prompt
494
+ assert "not bullet points" in prompt.lower()
495
+ assert "Executive Summary" in prompt
496
+
497
+
498
+ def test_synthesis_system_prompt_warns_about_hallucination() -> None:
499
+ """System prompt should warn about citation hallucination."""
500
+ prompt = get_synthesis_system_prompt()
501
+ assert "NEVER hallucinate" in prompt
502
+
503
+
504
+ def test_format_synthesis_prompt_includes_evidence() -> None:
505
+ """User prompt should include evidence summary."""
506
+ prompt = format_synthesis_prompt(
507
+ query="testosterone libido",
508
+ evidence_summary="Study shows efficacy...",
509
+ drug_candidates=["Testosterone"],
510
+ key_findings=["Improved libido"],
511
+ mechanism_score=8,
512
+ clinical_score=7,
513
+ confidence=0.85,
514
+ )
515
+ assert "testosterone libido" in prompt
516
+ assert "Study shows efficacy" in prompt
517
+ assert "Testosterone" in prompt
518
+ assert "8/10" in prompt
519
+
520
+
521
+ def test_few_shot_example_is_narrative() -> None:
522
+ """Few-shot example should demonstrate narrative style."""
523
+ # Count paragraphs vs bullets
524
+ paragraphs = len([p for p in FEW_SHOT_EXAMPLE.split('\n\n') if len(p) > 100])
525
+ bullets = FEW_SHOT_EXAMPLE.count('\n- ')
526
+
527
+ # Prose should dominate (at least 2x more paragraphs than bullets)
528
+ assert paragraphs >= bullets, "Few-shot example should be mostly narrative"
529
+ ```
530
+
531
+ **File**: `tests/unit/orchestrators/test_simple_synthesis.py` (NEW)
532
+
533
+ ```python
534
+ """Tests for simple orchestrator synthesis."""
535
+
536
+ import pytest
537
+ from unittest.mock import AsyncMock, MagicMock, patch
538
+
539
+ from src.orchestrators.simple import Orchestrator
540
+ from src.utils.models import Evidence, Citation, JudgeAssessment, JudgeDetails
541
+
542
+
543
+ @pytest.fixture
544
+ def sample_evidence() -> list[Evidence]:
545
+ return [
546
+ Evidence(
547
+ content="Testosterone therapy shows efficacy in HSDD treatment.",
548
+ citation=Citation(
549
+ source="pubmed",
550
+ title="Testosterone and Female Libido",
551
+ url="https://pubmed.ncbi.nlm.nih.gov/12345/",
552
+ date="2023",
553
+ authors=["Smith J"],
554
+ ),
555
+ )
556
+ ]
557
+
558
+
559
+ @pytest.fixture
560
+ def sample_assessment() -> JudgeAssessment:
561
+ return JudgeAssessment(
562
+ sufficient=True,
563
+ confidence=0.85,
564
+ reasoning="Evidence is sufficient",
565
+ recommendation="synthesize",
566
+ next_search_queries=[],
567
+ details=JudgeDetails(
568
+ mechanism_score=8,
569
+ clinical_evidence_score=7,
570
+ drug_candidates=["Testosterone"],
571
+ key_findings=["Improved libido in postmenopausal women"],
572
+ ),
573
+ )
574
+
575
+
576
+ @pytest.mark.asyncio
577
+ async def test_generate_synthesis_calls_llm(
578
+ sample_evidence: list[Evidence],
579
+ sample_assessment: JudgeAssessment,
580
+ ) -> None:
581
+ """Synthesis should make an LLM call, not just template."""
582
+ mock_search = MagicMock()
583
+ mock_judge = MagicMock()
584
+
585
+ orchestrator = Orchestrator(
586
+ search_handler=mock_search,
587
+ judge_handler=mock_judge,
588
+ )
589
+
590
+ with patch("src.orchestrators.simple.Agent") as mock_agent_class:
591
+ mock_agent = MagicMock()
592
+ mock_result = MagicMock()
593
+ mock_result.output = "This is a narrative synthesis with prose paragraphs."
594
+ mock_agent.run = AsyncMock(return_value=mock_result)
595
+ mock_agent_class.return_value = mock_agent
596
+
597
+ result = await orchestrator._generate_synthesis(
598
+ query="testosterone HSDD",
599
+ evidence=sample_evidence,
600
+ assessment=sample_assessment,
601
+ )
602
+
603
+ # Verify LLM was called
604
+ mock_agent_class.assert_called_once()
605
+ mock_agent.run.assert_called_once()
606
+
607
+ # Verify output includes narrative
608
+ assert "narrative synthesis" in result.lower() or "prose" in result.lower()
609
+
610
+
611
+ @pytest.mark.asyncio
612
+ async def test_generate_synthesis_falls_back_on_error(
613
+ sample_evidence: list[Evidence],
614
+ sample_assessment: JudgeAssessment,
615
+ ) -> None:
616
+ """Synthesis should fall back to template if LLM fails."""
617
+ mock_search = MagicMock()
618
+ mock_judge = MagicMock()
619
+
620
+ orchestrator = Orchestrator(
621
+ search_handler=mock_search,
622
+ judge_handler=mock_judge,
623
+ )
624
+
625
+ with patch("src.orchestrators.simple.Agent") as mock_agent_class:
626
+ mock_agent_class.side_effect = Exception("LLM unavailable")
627
+
628
+ result = await orchestrator._generate_synthesis(
629
+ query="testosterone HSDD",
630
+ evidence=sample_evidence,
631
+ assessment=sample_assessment,
632
+ )
633
+
634
+ # Should still return something (template fallback)
635
+ assert "Sexual Health Analysis" in result or "testosterone" in result.lower()
636
+ ```
637
+
638
+ ---
639
+
640
+ ## File Changes Summary
641
+
642
+ | File | Lines | Change Type | Description |
643
+ |------|-------|-------------|-------------|
644
+ | `src/prompts/synthesis.py` | ~150 | NEW | Narrative synthesis prompts |
645
+ | `src/orchestrators/simple.py` | 393, 448-505 | MODIFY | Async synthesis with LLM |
646
+ | `src/config/domain.py` | 57 | MODIFY | Add `synthesis_system_prompt` |
647
+ | `tests/unit/prompts/test_synthesis.py` | ~60 | NEW | Prompt tests |
648
+ | `tests/unit/orchestrators/test_simple_synthesis.py` | ~80 | NEW | Synthesis tests |
649
+
650
+ ---
651
 
652
  ## Acceptance Criteria
653
 
 
656
  - [ ] Report has **background section** explaining the condition
657
  - [ ] Report has **synthesized narrative** weaving evidence together
658
  - [ ] Report has **actionable recommendations**
659
+ - [ ] Report has **limitations** section
660
  - [ ] Citations are **properly formatted** (author, year, title, URL)
661
  - [ ] No hallucinated references (CRITICAL)
662
+ - [ ] Falls back gracefully if LLM unavailable
663
+ - [ ] All existing tests still pass
664
+ - [ ] New tests achieve 90%+ coverage of synthesis code
665
+
666
+ ---
667
 
668
  ## Test Criteria
669
 
670
  ```python
671
  def test_report_is_narrative_not_bullets():
672
  """Report should be mostly prose, not bullet points."""
673
+ report = await orchestrator._generate_synthesis(...)
674
 
675
  # Count paragraphs vs bullet points
676
  paragraphs = len([p for p in report.split('\n\n') if len(p) > 100])
 
682
  def test_references_not_hallucinated():
683
  """All references must come from provided evidence."""
684
  evidence_urls = {e.citation.url for e in evidence}
685
+ report = await orchestrator._generate_synthesis(...)
686
+
687
+ # Extract URLs from report
688
+ import re
689
+ report_urls = set(re.findall(r'https?://[^\s\)]+', report))
690
 
691
+ for url in report_urls:
692
+ # Allow pubmed URLs even if slightly different format
693
+ if "pubmed" in url or "clinicaltrials" in url:
694
+ assert any(evidence_url in url or url in evidence_url
695
+ for evidence_url in evidence_urls), f"Hallucinated: {url}"
696
  ```
697
 
698
+ ---
699
+
700
  ## Related Microsoft Agent Framework Patterns
701
 
702
+ | Pattern | File | Application |
703
+ |---------|------|-------------|
704
+ | Custom Aggregator | `concurrent_custom_aggregator.py:56-79` | LLM-based synthesis |
705
  | Fan-Out/Fan-In | `fan_out_fan_in_edges.py` | Multi-expert synthesis |
706
+ | Sequential Chain | `sequential_agents.py` | Writer→Reviewer pattern |
707
+
708
+ ---
709
+
710
+ ## Implementation Notes for Async Agent
711
+
712
+ 1. **Start with `src/prompts/synthesis.py`** - This is independent and can be created first
713
+ 2. **Then modify `src/orchestrators/simple.py`** - Change `_generate_synthesis` to async
714
+ 3. **Update the call site** (line 393) - Add `await`
715
+ 4. **Add tests** - Both unit and integration
716
+ 5. **Run `make check`** - Ensure all 237+ tests still pass
717
+
718
+ The key insight from the MS Agent Framework is:
719
+ > The aggregator makes an **LLM call** to synthesize, not string concatenation.
720
+
721
+ Our `_generate_synthesis()` currently does NO LLM call. Fix that, and the reports will transform from bullet points to narrative prose.
722
+
723
+ ---
724
 
725
  ## References
726
 
727
  - GitHub Issue #85: Report lacks narrative synthesis
728
  - GitHub Issue #86: Microsoft Agent Framework patterns
729
+ - `reference_repos/agent-framework/python/samples/getting_started/workflows/orchestration/concurrent_custom_aggregator.py`
730
+ - LangChain Deep Agents: Few-shot examples importance