ICLR2026 Medical Imaging multi-agent collaboration reinforcement learning medical VQA curriculum learning GRPO clinical reasoning

MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning¶

Conference: ICLR2026 arXiv: 2506.00555 Code: Not released Area: Medical Imaging Keywords: multi-agent collaboration, reinforcement learning, medical VQA, curriculum learning, GRPO, clinical reasoning

TL;DR¶

This paper proposes MMedAgent-RL, a multi-agent system that simulates clinical consultation workflows (triage → specialist → attending physician) optimized via reinforcement learning. The core innovation is Curriculum-guided Multi-Agent Reinforcement Learning (C-MARL) with entropy-aware exploration, enabling the attending physician agent to adopt differentiated explore–exploit strategies when faced with correct, conflicting, or erroneous specialist opinions. The system achieves state-of-the-art performance on 5 medical VQA benchmarks spanning both in-domain and out-of-domain settings.

Background & Motivation¶

Background: Medical image diagnosis spans multiple subspecialties (radiology, pathology, oncology, etc.), which a single Med-LVLM cannot adequately cover.
Limitations of Prior Work: Static multi-agent frameworks such as MedAgents and MDAgents rely on fixed GP→Specialist→GP pipelines with pre-defined, non-learnable interaction patterns. Specialist outputs are not always reliable and may introduce noise or misleading signals; majority voting can suppress correct minority opinions.
Key Challenge: The attending physician must learn when to trust specialist consensus (exploit) and when to challenge it and reason independently (explore).
Key Insight: Works such as DeepSeek-R1 demonstrate that RL can substantially enhance LLM reasoning, yet RL optimization for multi-agent medical collaboration remains unexplored.

Method¶

Overall Architecture: Clinical Consultation Simulation (GP→Specialists→GP)¶

Two GP agents are trained via staged RL on top of Qwen2.5-VL:

Triage Doctor: Selects the appropriate specialty based on the input image and text.
Specialists: Portrayed by strong models (e.g., GPT), providing specialist diagnostic opinions.
Attending Physician: Integrates specialist opinions with its own knowledge to make the final decision.

Key Design 1: RL Optimization of the Triage Doctor¶

Dataset-provided image modality labels serve as ground truth (e.g., histology slide → Pathologist).
Seven candidate specialties: Pathologist, Radiologist, Surgeon, Oncologist, Endocrinologist, Ophthalmologist, Dermatologist.
Optimized with GRPO; reward function: \(R = R_{format} \in \{0, 0.5\} + R_{accuracy} \in \{0, 1\}\).
Improves not only triage accuracy but also the reasoning process (justifying triage decisions).

Key Design 2: Curriculum-Guided Entropy-Aware RL (C-MARL)¶

Core Idea: Training difficulty is defined by the reliability of specialist opinions, and the attending physician is trained in progressive stages.

Data Stratification (based on specialist accuracy \(s = \text{Acc}(y_d, y^*)\)): - Easy (\(s=1\)): All specialists answer correctly. - Medium (\(0<s<1\)): Partial agreement; conflicting opinions exist. - Hard (\(s=0\)): All specialists are wrong, forming a misleading consensus.

Dynamic Entropy Regulation — a curriculum-dependent entropy regularization term is added to the standard GRPO objective:

\[\mathcal{J}_{C\text{-}MARL}(\theta) = \mathbb{E}[\mathcal{J}_{GRPO}(\theta) + \gamma_s \cdot H_t(\pi_\theta)]\]

\(\gamma_{easy} \approx 0\): No additional exploration needed in the Easy stage; the model focuses on exploiting reliable specialist knowledge.
\(\gamma_{medium} > 0\): Moderate exploration is encouraged in the Medium stage to prevent overconfidence when facing conflicting information.
\(\gamma_{hard} \gg \gamma_{medium}\): Strong exploration is enforced in the Hard stage to compel the model to overcome misleading specialist consensus.

Training Order: Easy → Medium → Hard, following a curriculum from simple to difficult.

Key Design 3: Theoretical Analysis¶

A convergence proof (Theorem 4.1) is provided demonstrating the advantage of curriculum learning over standard SGD.
The total training time under curriculum learning depends on the sum of distances between optimal policies in adjacent stages: \(\sum_{j=0}^{J-1}\log\|\theta_j^\star - \theta_{j+1}^\star\|_2^2\).
An effective curriculum ensures small inter-stage distances, with each stage's solution serving as a warm start for the next.
Standard SGD is shown to be unable to converge to the optimal policy under equivalent conditions (lower-bound proof).

Key Experimental Results¶

Main Results (Accuracy, %)¶

Model	VQA-RAD	SLAKE	PathVQA	In-Domain Avg	OmniMedVQA	MMMU-Med	Out-of-Domain Avg
GPT-4o	61.0	75.5	69.4	68.6	68.5	69.7	69.1
Qwen2.5-VL-7B	61.8	64.7	60.5	62.3	60.8	56.6	58.7
MedVLThinker-7B	63.7	67.8	65.2	65.6	62.4	57.0	59.7
MDAgents	66.8	68.2	65.4	66.8	58.2	52.3	55.1
MMedAgent-RL (7B)	71.5	76.2	72.3	73.3	73.3	71.9	72.6
+ Test-Time Scaling	73.9	80.1	74.3	76.1	79.6	73.5	76.6

Ablation Study¶

Configuration	VQA-RAD	SLAKE	OmniMedVQA	MMMU-Med
Full MMedAgent-RL	71.5	76.2	73.3	71.9
w/o Triage	66.3	69.9	66.2	59.3
w/o Specialists	65.8	67.8	64.4	54.2
w/o C-MARL	63.5	65.5	57.9	50.2
+ Easy only	64.7	69.3	68.2	58.0
+ Easy + Medium	69.4	76.9	70.8	68.8
+ Easy + Medium + Hard	71.5	76.2	73.3	71.9

Key Findings¶

C-MARL contributes most: Removing C-MARL results in an average drop of 18.6%, identifying it as the most critical component.
Each curriculum stage contributes incrementally: Easy→Medium→Hard yields progressively cumulative gains, validating the curriculum design.
The Hard stage is crucial: The largest improvements are observed under the "Hard" scenario where all specialists are incorrect (+20%), demonstrating that the model learns to resist blind conformity.
Strong out-of-domain generalization: OmniMedVQA (+13%) and MMMU-Med (+15%) both substantially outperform the backbone and surpass GPT-4o.
Triage accuracy propagates through the pipeline: The optimized triage doctor establishes the foundation for accurate specialist consultation downstream (+3%).
Test-Time Scaling (TTS) further boosts performance: Majority voting yields an additional gain of 4.5%.

Highlights & Insights¶

The entropy regulation mechanism in C-MARL is elegant: it addresses the specialist noise problem in multi-agent collaboration from an explore–exploit trade-off perspective.
The combination of curriculum learning and RL is theoretically grounded (convergence proof) rather than purely heuristic.
A 7B model surpasses GPT-4o on multiple benchmarks, demonstrating the substantial potential of RL-based optimization.
The experimental design is comprehensive, covering in-domain/out-of-domain evaluation, ablation studies, specialty selection analysis, and difficulty stratification.

Limitations & Future Work¶

Specialists are portrayed by closed-source models such as GPT, resulting in high deployment costs and dependence on third-party APIs.
The triage specialty set is fixed at 7 categories, which may not cover all clinical scenarios.
The three-tier difficulty stratification relies on ground-truth labels, which are unavailable at inference time.
Evaluation is limited to multiple-choice VQA; open-ended clinical reasoning and report generation are not addressed.
The theoretical analysis rests on strong assumptions (e.g., strong convexity) that may not hold for practical deep networks.

Medical VLMs: Single-agent models including LLaVA-Med, HuatuoGPT-Vision, and VILA-M3.
Medical Multi-Agent Systems: Agent Hospital, MedAgents, and MDAgents — all rely on static, pre-defined pipelines.
RL-Enhanced Reasoning: GRPO-based post-training paradigms such as DeepSeek-R1 and VLM-R1.
Curriculum Learning: Progressive training from easy to hard (Bengio et al., 2009).

Rating¶

⭐⭐⭐⭐⭐ (5/5)

The method is elegantly designed; the entropy regulation strategy in C-MARL is both intuitively well-motivated and theoretically supported. The experiments are comprehensive and convincing, and the result of a 7B model surpassing GPT-4o is impressive. The integration of curriculum learning and RL establishes a new paradigm for multi-agent collaborative reasoning.