CureAgent: A Training-Free Executor-Analyst Framework for Clinical Reasoning¶
Conference: NeurIPS 2025 arXiv: 2512.05576 Code: https://github.com/June01/CureAgent Area: Clinical AI / Multi-Agent Systems Keywords: Clinical Reasoning, Multi-Agent, Executor-Analyst, Stratified Ensemble, Training-Free Architecture Engineering
TL;DR¶
CureAgent proposes an Executor-Analyst collaborative framework that decouples precise tool invocation (TxAgent/Llama-8B as Executor) from high-level clinical reasoning (Gemini 2.5 as Analyst). Combined with a Stratified Ensemble Late Fusion topology that preserves evidence diversity, the system achieves 83.8% accuracy on CURE-Bench without end-to-end fine-tuning, and reveals two critical scaling findings: the context–performance paradox and the curse of dimensionality in action space.
Background & Motivation¶
Background: Large language models show great promise for clinical decision support (Med-PaLM, GPT-4), yet real-world medical reasoning requires actively retrieving and integrating information from continuously updated biomedical sources (FDA labels, OpenTargets, HPO, etc.). The CURE-Bench competition evaluates agent ability to perform clinical reasoning using ToolUniverse (200+ biomedical tools).
Limitations of Prior Work: (a) Context utilization failure: TxAgent (fine-tuned Llama-3.1-8B) successfully retrieves biomedical evidence but fails to leverage it in final diagnoses, resulting in hallucinations (65.8% of error cases); (b) Output parsing errors (19.2%) and instruction-following failures (12.3%) stem from inherent limitations of small models; (c) General-purpose closed-source models (Gemini 2.5) possess strong reasoning capabilities but lack precise tool-calling training, yielding zero-shot performance inferior to TxAgent.
Key Challenge: Tool invocation demands syntactic precision (requiring domain fine-tuning), while clinical reasoning demands semantic robustness (requiring large model capacity)—a single model cannot simultaneously satisfy both requirements. TxAgent has tool-calling ability but weak reasoning (8B); Gemini has reasoning ability but poor tool-calling (not fine-tuned).
Goal: Rather than end-to-end fine-tuning, decouple and combine the "hands" of tool execution with the "brain" of clinical reasoning through architecture engineering.
Key Insight: Error analysis reveals that 65.8% of failures are "retrieval succeeded but reasoning failed"—the bottleneck is reasoning, not retrieval. The solution is to assign retrieval to a dedicated Executor and reasoning to a dedicated Analyst.
Core Idea: TxAgent as "hands" for precise retrieval + Gemini as "brain" for deep reasoning + Stratified Ensemble to preserve evidence diversity = training-free SOTA clinical agent.
Method¶
Overall Architecture¶
Input: Clinical question (multiple-choice format requiring retrieval of biomedical evidence). Output: Final diagnosis + reasoning chain. Pipeline: Three stages—(1) Executors (multiple TxAgent instances in parallel) perform tool calls to collect evidence → (2) Analyst (Gemini 2.5) integrates evidence, supplements via search, generates reasoning chain and preliminary diagnosis → (3) Post-processing module (regex matching + deduplication) ensures output format compliance.
Key Designs¶
-
Executor — Specialized Tool-Retrieval Agent:
- Function: Precisely invoke 200+ biomedical tools in ToolUniverse to collect evidence.
- Mechanism: Uses TxAgent (domain fine-tuned Llama-3.1-8B) to decompose input questions into sub-queries and orchestrate multi-step tool calls and reasoning. Key innovation: self-consistency mechanism — \(n_1\) Executors run in parallel (temperature \(T=0.8\)), aggregating the top-\(k\) most frequent tool-call results and reasoning trajectories.
- Design Motivation: Executors do not generate final answers—they are solely responsible for evidence collection. Multi-sampling with majority voting reduces retrieval randomness, ensuring the downstream Analyst receives a comprehensive and robust evidence set.
-
Analyst — Long-Context Clinical Reasoner:
- Function: Synthesize reasoning from the noisy evidence stream produced by Executors to generate reliable clinical diagnoses.
- Mechanism: Gemini 2.5 (Flash/Pro) serves as the reasoning backbone, relieved of the syntactic burden of tool invocation, and focuses on: (a) cross-referencing tool outputs with patient-specific comorbidities; (b) proactively searching the internet to supplement missing information; (c) filtering irrelevant noise and resolving contradictory data points. Its long context window and "System 2" reasoning capacity are leveraged to generate chain-of-thought reasoning.
- Design Motivation: Context utilization failure in small models is fundamentally a reasoning capacity deficiency—using a large model for reasoning eliminates this bottleneck entirely.
-
Stratified Ensemble Topology (Late Fusion):
- Function: Maximize evidence diversity preservation under a fixed computational budget.
- Mechanism: Two topologies are compared—Config A (Global Pooling / Early Fusion): all Executors pool into a single context → multiple Analysts vote via self-consistency. Config B (Stratified Ensemble / Late Fusion): the Executor budget is divided into \(n_2\) parallel subgroups (each with \(n_1\) instances); each subgroup independently aggregates → independent Analyst → final Late Fusion vote. Key advantage of Config B: different subgroups may explore distinct retrieval paths, and Late Fusion preserves this diversity.
- Design Motivation: Early consensus in Config A filters out minority yet critical evidence—rare drug interactions are discarded by majority voting. Config B allows each retrieval path to complete the full reasoning pipeline independently, reducing collective hallucination.
-
Post-Processing Module:
- Function: Ensure output format compliance and deterministic behavior.
- Mechanism: (a) Format calibration: regular expressions map natural language conclusions to the structured output required by the benchmark; (b) Response deduplication: identical inputs produce identical outputs, eliminating LLM generation randomness.
- Design Motivation: Clinical decision support systems require deterministic behavior—the result for the same case must be consistent across queries.
Loss & Training¶
- Training-free: The entire framework requires no end-to-end fine-tuning. TxAgent uses pre-existing fine-tuned weights; Gemini is accessed via API.
- Executor temperature \(T=0.8\) (selected via search over \(T \in \{0.6, 0.7, 0.8, 0.9\}\)), balancing exploration and reliability.
- Computational budget allocation: \(N_{\text{total}} = n_1 \times n_2\); Stratified Ensemble uses \(n_1=10, n_2=3\).
Key Experimental Results¶
Main Results — CURE-Bench Phase 2¶
| Architecture | Executor | \(n_1\) | Analyst | \(n_2\) | Accuracy |
|---|---|---|---|---|---|
| Baseline | gemini-2.5-flash | 1 | — | — | 63.1 |
| Baseline | TxAgent | 1 | — | — | 69.3 |
| SC only | TxAgent | 30 | — | — | 73.5 |
| Config A | TxAgent | 30 | gemini-flash | 3 | 80.5 |
| Config B | TxAgent | 10 | gemini-flash | 3 | 81.4 |
| Config B + search | TxAgent | 10 | gemini-flash+search | 3 | 83.8 |
Ablation Study — Impact of Architecture Choices¶
| Configuration | Accuracy | Notes |
|---|---|---|
| TxAgent alone | 69.3% | Baseline |
| Decoupled (1 Exec + 1 Ana) | 74.7% | Decoupling alone yields +5.4% |
| Config A (30+3) | 80.5% | Early fusion, information bottleneck |
| Config B (10×3) | 81.4% | Late fusion, diversity preserved +0.9% |
| Config B + search | 83.8% | Search supplements missing tool information +2.4% |
Scaling Findings¶
| Finding | Data | Implication |
|---|---|---|
| Context–performance paradox | Accuracy drops from 94% to 87.93% when reasoning context exceeds 12k tokens | Excessive raw evidence introduces noise that overwhelms the attention mechanism |
| Curse of dimensionality in action space | ToolUniverse v1→v2 (200→600 tools): accuracy drops from 92.0% to 87.5% | Increased tool count degrades retrieval precision |
Key Findings¶
- Decoupling is the largest source of gain: A single Executor + single Analyst (74.7%) already outperforms both TxAgent (69.3%) and Gemini (63.1%).
- Topology matters: Under the same computational budget, Config B (81.4%) > Config A (80.5%); Late Fusion preserves diversity.
- Self-consistency converges rapidly: performance improves quickly for \(n<15\) and plateaus beyond \(n>20\) (approx. 74.2%).
- Temperature \(T=0.8\) is optimal: excessively high temperature (\(0.9 \to 56.7\%\)) renders outputs too stochastic.
- Gemini 2.5 Pro (81.3%, post-competition model) + search suggests future foundation models may reduce reliance on the Executor.
Highlights & Insights¶
- "Hand-brain separation" as an architecture engineering philosophy: Rather than end-to-end fine-tuning, specialized models are assigned distinct roles—small fine-tuned models perform precise tool invocation while large models handle deep reasoning. This principle is broadly applicable to all tool-augmented agent systems.
- Late Fusion preserves evidence diversity: The information bottleneck of Early Fusion is clearly quantified (+0.9%), with the core insight that "premature consensus discards rare but critical evidence."
- Context–performance paradox: Performance degrades beyond 12k tokens, demonstrating that more retrieval is not always better in RAG systems—information compression or early rejection strategies are needed.
- Full modularity: The modular design allows independent upgrading of the Executor and Analyst (e.g., substituting stronger models) without retraining.
Limitations & Future Work¶
- The method is essentially systems engineering (multi-agent orchestration + voting), offering limited technical novelty.
- High computational cost: \(n_1 \times n_2 = 30\) LLM calls per question incurs substantial API expenses.
- The context–performance paradox is observed but not resolved—confidence-based filtering strategies (e.g., DeepConf) are needed.
- Tool-count scaling degradation (4.5% drop with 600 tools) requires hierarchical retrieval or RAG over tool documentation.
- Reliance on closed-source APIs (Gemini) limits reproducibility and deployment flexibility.
Related Work & Insights¶
- vs. TxAgent (single model): Fine-tuned 8B model handles the full pipeline; strong in tool-calling but weak in reasoning. CureAgent's decoupled design yields +14.5%.
- vs. Gemini-2.5-Pro (single model + search): Search provides broad knowledge but lacks the precision of specialized tool invocation; CureAgent combines both advantages for +9%.
- vs. ReAct: ReAct interleaves reasoning and action within a single model; CureAgent delegates reasoning and action to different models, better suited to asymmetric capability scenarios.
- CureAgent's Stratified Ensemble paradigm generalizes to any multi-agent RAG system: independent retrieval → independent reasoning → final vote.
Rating¶
- Novelty: ⭐⭐⭐ Multi-agent decoupling and ensemble are established paradigms; innovation lies in systematic design and empirical analysis tailored to clinical scenarios.
- Experimental Thoroughness: ⭐⭐⭐⭐ Rich ablations, multi-model comparisons, and scaling analyses, though evaluation is limited to the single CURE-Bench benchmark.
- Writing Quality: ⭐⭐⭐⭐ Error-analysis-driven motivation chain is clear; figures and tables are professional; quantitative analysis is thorough.
- Value: ⭐⭐⭐⭐ Training-free architecture engineering holds high practical value for clinical AI; scaling findings are informative for the broader community.