MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks¶
Conference: NeurIPS 2025 (Datasets & Benchmarks Track)
arXiv: 2505.12371
Code: GitHub | Project Page
Area: Medical Imaging / AI for Medicine
Keywords: multi-agent collaboration, medical benchmarking, LLM, clinical workflow, EHR prediction
TL;DR¶
This paper proposes MedAgentBoard, a comprehensive benchmark that systematically evaluates multi-agent collaboration, single-LLM, and conventional methods across diverse medical tasks, revealing that multi-agent collaboration does not consistently outperform strong single models or specialized conventional approaches.
Background & Motivation¶
- The multi-agent LLM trend: A growing body of work has introduced multi-agent collaboration into the medical domain, yet its practical advantages remain unclear.
- Limitations of existing evaluations:
- Task coverage is insufficiently broad, lacking diversity representative of real clinical scenarios.
- Rigorous comparisons with specialized conventional methods are absent (most works only compare across LLMs).
- Data modalities are limited, overlooking structured EHR data and medical imaging.
- Core Problem: Does the added complexity and overhead of multi-agent systems genuinely yield performance gains?
- Research positioning: To provide a comprehensive, evidence-based evaluation that assists researchers in selecting appropriate AI solutions.
Method¶
Overall Architecture¶
MedAgentBoard covers 4 major categories of medical tasks spanning 3 data modalities (text, medical imaging, and structured EHR), systematically comparing 3 classes of methods:
| Task Category | Data Modality | Datasets |
|---|---|---|
| Medical QA | Text | MedQA, PubMedQA |
| Medical VQA | Image + Text | PathVQA, VQA-RAD |
| Lay Summary Generation | Text | PLOS/eLife |
| EHR Predictive Modeling | Structured Data | MIMIC-III/IV |
| Clinical Workflow Automation | Multimodal | Custom scenarios |
Key Designs¶
Three-Way Comparison Framework¶
-
Conventional Methods:
- Text QA: BioLinkBERT, GatorTron
- VQA: Specialized VLMs such as M³AE
- EHR: XGBoost, LSTM, Transformer, etc.
-
Single-LLM Methods:
- Zero-shot / Few-shot ICL / Chain-of-Thought
- Models include GPT-4o, Claude 3.5, Gemini, etc.
-
Multi-Agent Collaboration Frameworks:
- MedAgents: multi-role discussion and collaboration
- ReConcile: multi-model voting and reconciliation
- General-purpose frameworks such as AutoGen
Evaluation Dimensions¶
- Correctness: Accuracy (multiple-choice), BLEU/ROUGE (generation tasks)
- Clinical Relevance: LLM-as-a-judge scoring
- Efficiency: API call count, token consumption, latency
- Robustness: Consistency across datasets
Loss & Training¶
As a benchmark paper, the focus lies on evaluation protocol design rather than model training: - All LLM-based methods use unified prompt templates. - Conventional methods follow optimal configurations reported in their original papers. - Evaluation metrics are standardized across tasks. - Results are averaged over multiple runs to reduce variance.
Key Experimental Results¶
Main Results¶
Medical Text QA Results¶
| Method Category | Method | MedQA Acc↑ | PubMedQA Acc↑ | Category |
|---|---|---|---|---|
| Conventional | BioLinkBERT | 45.2 | 72.8 | Conventional |
| Conventional | GatorTron | 48.1 | 74.5 | Conventional |
| Single LLM | GPT-4o (Zero-shot) | 82.3 | 78.1 | Single LLM |
| Single LLM | GPT-4o (CoT) | 85.7 | 80.4 | Single LLM |
| Single LLM | Claude 3.5 (CoT) | 83.9 | 79.2 | Single LLM |
| Multi-Agent | MedAgents | 83.1 | 78.8 | Multi-Agent |
| Multi-Agent | ReConcile | 84.2 | 79.5 | Multi-Agent |
Finding: On medical text QA, an advanced single LLM (GPT-4o + CoT) suffices to achieve optimal performance; multi-agent systems yield no significant improvement.
Medical VQA and EHR Prediction Results¶
| Method Category | PathVQA Acc↑ | VQA-RAD Acc↑ | MIMIC Mortality AUROC↑ |
|---|---|---|---|
| Conventional VLM (M³AE) | 72.3 | 74.8 | — |
| GPT-4o Vision | 65.7 | 68.2 | 0.71 |
| Multi-Agent VQA | 64.9 | 67.5 | 0.69 |
| XGBoost | — | — | 0.84 |
| LSTM | — | — | 0.81 |
| LLM (numerical reasoning) | — | — | 0.68 |
Finding: Specialized conventional methods still significantly outperform LLM-based approaches on VQA and EHR prediction.
Ablation Study¶
Multi-Agent vs. Single-LLM Efficiency Comparison¶
| Method | Accuracy | API Calls | Token Consumption | Latency (s) |
|---|---|---|---|---|
| GPT-4o (Single) | 85.7 | 1 | 2.1K | 3.2 |
| MedAgents (3 roles) | 83.1 | 5–8 | 12.5K | 18.7 |
| ReConcile (3 models) | 84.2 | 3 | 7.8K | 11.4 |
Finding: Multi-agent systems consume 4–6× more tokens than single-LLM methods, while yielding limited or even negative performance gains.
Key Findings¶
- Multi-agent ≠ better: Across 4 task categories, multi-agent systems demonstrate advantages only in task completeness within clinical workflow automation.
- Conventional methods remain competitive: Specialized fine-tuned models significantly outperform all LLM-based methods on VQA and EHR prediction.
- Single-LLM CoT is sufficiently powerful: A high-quality single model with a well-designed prompt outperforms collaboration among multiple mediocre models.
- Asymmetric cost–benefit trade-off: Multi-agent systems incur 4–6× greater computational overhead, yet achieve on average less than 1% performance improvement.
- Task specificity: No universally optimal method exists; method selection must be guided by the specific task at hand.
Highlights & Insights¶
- A grounded benchmark: This work offers a sober evaluation amid the multi-agent enthusiasm, demonstrating that "multi-agent is not a silver bullet."
- Fair comparison design: Incorporating conventional methods into the comparison is a core contribution of this benchmark, filling a critical gap in existing evaluations.
- Full multimodal coverage: Simultaneously covering text, imaging, and structured data reflects the diversity of real clinical settings.
- Actionable guidance: The benchmark provides practitioners with principled guidelines on when to use multi-agent systems, single models, or conventional methods.
Limitations & Future Work¶
- Rapid LLM evolution: Benchmark results for specific LLMs may become outdated quickly as newer models (e.g., GPT-5) emerge.
- Limited multi-agent framework coverage: Only a small number of frameworks are evaluated; novel collaboration paradigms (e.g., debate, reflection) could be incorporated.
- Incomplete task coverage: Important clinical tasks such as medical image segmentation and radiology report generation are not included.
- Fairness of comparison: Conventional methods are carefully fine-tuned, whereas LLMs are largely evaluated in zero/few-shot settings, introducing an inherent asymmetry.
- Real-world deployment evaluation: Assessment of practical clinical deployment scenarios—such as latency requirements and privacy constraints—is absent.
Related Work & Insights¶
- MedAgents (Tang et al., 2024): A multi-agent discussion framework for medical reasoning.
- AgentBench (Liu et al., 2024): A general-purpose benchmark for LLM agents.
- HELM-Med: A medical LLM evaluation suite.
- Insights: The three-way comparison paradigm proposed in this benchmark (conventional vs. single LLM vs. multi-agent) is generalizable to other domains such as law and finance.
Rating¶
| Dimension | Score (1–5) | Notes |
|---|---|---|
| Novelty | 3.5 | Contribution lies in evaluation perspective rather than technical methodology |
| Technical Depth | 3 | Benchmark work; moderate technical complexity |
| Experimental Thoroughness | 4.5 | Broad coverage and comprehensive comparisons |
| Practical Value | 4.5 | Directly actionable for medical AI method selection |
| Writing Quality | 4 | Clear structure; findings are precisely articulated |
| Overall | 4.0 | An important benchmark contribution with sober and insightful findings |