MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks¶

Conference: NeurIPS 2025 (Datasets & Benchmarks Track)

Area: Medical Imaging / AI for Medicine

Keywords: multi-agent collaboration, medical benchmarking, LLM, clinical workflow, EHR prediction

TL;DR¶

This paper proposes MedAgentBoard, a comprehensive benchmark that systematically evaluates multi-agent collaboration, single-LLM, and conventional methods across diverse medical tasks, revealing that multi-agent collaboration does not consistently outperform strong single models or specialized conventional approaches.

Background & Motivation¶

The multi-agent LLM trend: A growing body of work has introduced multi-agent collaboration into the medical domain, yet its practical advantages remain unclear.
Limitations of existing evaluations:
Task coverage is insufficiently broad, lacking diversity representative of real clinical scenarios.
Rigorous comparisons with specialized conventional methods are absent (most works only compare across LLMs).
Data modalities are limited, overlooking structured EHR data and medical imaging.
Core Problem: Does the added complexity and overhead of multi-agent systems genuinely yield performance gains?
Research positioning: To provide a comprehensive, evidence-based evaluation that assists researchers in selecting appropriate AI solutions.

Method¶

Overall Architecture¶

MedAgentBoard covers 4 major categories of medical tasks spanning 3 data modalities (text, medical imaging, and structured EHR), systematically comparing 3 classes of methods:

Task Category	Data Modality	Datasets
Medical QA	Text	MedQA, PubMedQA
Medical VQA	Image + Text	PathVQA, VQA-RAD
Lay Summary Generation	Text	PLOS/eLife
EHR Predictive Modeling	Structured Data	MIMIC-III/IV
Clinical Workflow Automation	Multimodal	Custom scenarios

Key Designs¶

Three-Way Comparison Framework¶

Conventional Methods:
- Text QA: BioLinkBERT, GatorTron
- VQA: Specialized VLMs such as M³AE
- EHR: XGBoost, LSTM, Transformer, etc.
Single-LLM Methods:
- Zero-shot / Few-shot ICL / Chain-of-Thought
- Models include GPT-4o, Claude 3.5, Gemini, etc.
Multi-Agent Collaboration Frameworks:
- MedAgents: multi-role discussion and collaboration
- ReConcile: multi-model voting and reconciliation
- General-purpose frameworks such as AutoGen

Evaluation Dimensions¶

Correctness: Accuracy (multiple-choice), BLEU/ROUGE (generation tasks)
Clinical Relevance: LLM-as-a-judge scoring
Efficiency: API call count, token consumption, latency
Robustness: Consistency across datasets

Loss & Training¶

As a benchmark paper, the focus lies on evaluation protocol design rather than model training: - All LLM-based methods use unified prompt templates. - Conventional methods follow optimal configurations reported in their original papers. - Evaluation metrics are standardized across tasks. - Results are averaged over multiple runs to reduce variance.

Key Experimental Results¶

Main Results¶

Medical Text QA Results¶

Method Category	Method	MedQA Acc↑	PubMedQA Acc↑	Category
Conventional	BioLinkBERT	45.2	72.8	Conventional
Conventional	GatorTron	48.1	74.5	Conventional
Single LLM	GPT-4o (Zero-shot)	82.3	78.1	Single LLM
Single LLM	GPT-4o (CoT)	85.7	80.4	Single LLM
Single LLM	Claude 3.5 (CoT)	83.9	79.2	Single LLM
Multi-Agent	MedAgents	83.1	78.8	Multi-Agent
Multi-Agent	ReConcile	84.2	79.5	Multi-Agent

Finding: On medical text QA, an advanced single LLM (GPT-4o + CoT) suffices to achieve optimal performance; multi-agent systems yield no significant improvement.

Medical VQA and EHR Prediction Results¶

Method Category	PathVQA Acc↑	VQA-RAD Acc↑	MIMIC Mortality AUROC↑
Conventional VLM (M³AE)	72.3	74.8	—
GPT-4o Vision	65.7	68.2	0.71
Multi-Agent VQA	64.9	67.5	0.69
XGBoost	—	—	0.84
LSTM	—	—	0.81
LLM (numerical reasoning)	—	—	0.68

Finding: Specialized conventional methods still significantly outperform LLM-based approaches on VQA and EHR prediction.

Ablation Study¶

Multi-Agent vs. Single-LLM Efficiency Comparison¶

Method	Accuracy	API Calls	Token Consumption	Latency (s)
GPT-4o (Single)	85.7	1	2.1K	3.2
MedAgents (3 roles)	83.1	5–8	12.5K	18.7
ReConcile (3 models)	84.2	3	7.8K	11.4

Finding: Multi-agent systems consume 4–6× more tokens than single-LLM methods, while yielding limited or even negative performance gains.

Key Findings¶

Multi-agent ≠ better: Across 4 task categories, multi-agent systems demonstrate advantages only in task completeness within clinical workflow automation.
Conventional methods remain competitive: Specialized fine-tuned models significantly outperform all LLM-based methods on VQA and EHR prediction.
Single-LLM CoT is sufficiently powerful: A high-quality single model with a well-designed prompt outperforms collaboration among multiple mediocre models.
Asymmetric cost–benefit trade-off: Multi-agent systems incur 4–6× greater computational overhead, yet achieve on average less than 1% performance improvement.
Task specificity: No universally optimal method exists; method selection must be guided by the specific task at hand.

Highlights & Insights¶

A grounded benchmark: This work offers a sober evaluation amid the multi-agent enthusiasm, demonstrating that "multi-agent is not a silver bullet."
Fair comparison design: Incorporating conventional methods into the comparison is a core contribution of this benchmark, filling a critical gap in existing evaluations.
Full multimodal coverage: Simultaneously covering text, imaging, and structured data reflects the diversity of real clinical settings.
Actionable guidance: The benchmark provides practitioners with principled guidelines on when to use multi-agent systems, single models, or conventional methods.

Limitations & Future Work¶

Rapid LLM evolution: Benchmark results for specific LLMs may become outdated quickly as newer models (e.g., GPT-5) emerge.
Limited multi-agent framework coverage: Only a small number of frameworks are evaluated; novel collaboration paradigms (e.g., debate, reflection) could be incorporated.
Incomplete task coverage: Important clinical tasks such as medical image segmentation and radiology report generation are not included.
Fairness of comparison: Conventional methods are carefully fine-tuned, whereas LLMs are largely evaluated in zero/few-shot settings, introducing an inherent asymmetry.
Real-world deployment evaluation: Assessment of practical clinical deployment scenarios—such as latency requirements and privacy constraints—is absent.

MedAgents (Tang et al., 2024): A multi-agent discussion framework for medical reasoning.
AgentBench (Liu et al., 2024): A general-purpose benchmark for LLM agents.
HELM-Med: A medical LLM evaluation suite.
Insights: The three-way comparison paradigm proposed in this benchmark (conventional vs. single LLM vs. multi-agent) is generalizable to other domains such as law and finance.

Rating¶

Dimension	Score (1–5)	Notes
Novelty	3.5	Contribution lies in evaluation perspective rather than technical methodology
Technical Depth	3	Benchmark work; moderate technical complexity
Experimental Thoroughness	4.5	Broad coverage and comprehensive comparisons
Practical Value	4.5	Directly actionable for medical AI method selection
Writing Quality	4	Clear structure; findings are precisely articulated
Overall	4.0	An important benchmark contribution with sober and insightful findings