PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis¶

Conference: AAAI 2026 arXiv: 2601.10945 Code: https://vl2g.github.io/projects/pcdf Area: Multimodal VLM Keywords: Medical Diagnosis, Multi-turn Dialogue, VLM Interaction, Data Synthesis, Dialogue-driven Fine-tuning

TL;DR¶

This paper proposes the Pre-Consultation Dialogue Framework (PCDF), which simulates multi-turn doctor–patient dialogues using two VLMs (DocVLM and PatientVLM) to generate image–dialogue–diagnosis triplets for fine-tuning DocVLM, achieving an average F1 improvement of 11.48 across four medical imaging benchmarks.

Background & Motivation¶

State of the Field¶

Medical imaging AI research has long revolved around the "image → diagnosis" paradigm, progressing from early CNN-based classification to medical adaptations of CLIP (MedCLIP, BioMedCLIP), and then to large VLMs (MedPaLM2, MedGemma, LLaVA-Med), with continuously improving visual understanding capabilities.

Limitations of Prior Work¶

In clinical practice, however, diagnosis rarely relies on imaging alone. Physicians engage in multi-turn interactions with patients, progressively eliciting symptoms and medical history to narrow the differential diagnosis. This dialogue-driven reasoning process is central to clinical diagnosis, yet existing models entirely neglect this aspect, resulting in fragile predictions.

Data Collection Challenges¶

Collecting real doctor–patient dialogue data is extremely difficult: it requires IRB ethical approval, patient informed consent, and faces physician concerns over legal liability and workflow disruption, making large-scale data collection practically infeasible.

Root Cause¶

Existing efforts (Yang et al. 2024; Chen et al. 2023) have attempted to generate synthetic dialogues using a single LLM playing both doctor and patient, but suffer from two critical limitations: (1) they operate in text-only settings without medical images; and (2) having a single model generate both roles undermines role separation and interactional authenticity.

Starting Point¶

This paper proposes PCDF — employing two independent VLMs to respectively play the doctor and patient roles, conducting joint visual-dialogue reasoning over medical images. PatientVLM generates symptom-based responses conditioned on ground-truth diagnostic labels (while being explicitly instructed not to reveal the diagnosis), and DocVLM generates follow-up questions based on the image and dialogue history. This design preserves the information asymmetry inherent in real clinical consultations.

Method¶

Overall Architecture¶

PCDF consists of two stages: a dialogue simulation stage and a dialogue-conditioned fine-tuning stage.

Stage 1: Given a medical dataset \(\mathcal{D}=\{(I_i, C_i)\}_{i=1}^N\), T-turn DocVLM–PatientVLM dialogues are simulated for each sample to produce an augmented dataset \(\hat{\mathcal{D}}=\{(I_i, H_i, C_i)\}\).

Stage 2: DocVLM is fine-tuned on the augmented dataset to learn \(P(C|I,H)\), i.e., diagnosis conditioned on both image and dialogue history.

Key Designs¶

DocVLM (Doctor Model):
- Generates follow-up questions based on the medical image \(I_i\), dialogue history \(H_{i,<t}\), and the full set of candidate diagnoses \(\mathcal{C}\)
- Core formulation: \(Q_{i,t} = \text{DocVLM}(P_{doc}(I_i, H_{i,<t}, \mathcal{C}))\)
- Design Motivation: Including all candidate diagnoses in the prompt encourages the generation of discriminative questions that help distinguish between similar conditions
PatientVLM (Patient Model):
- Generates responses based on the image \(I_i\), ground-truth diagnosis \(C_i\), and DocVLM's question \(Q_{i,t}\)
- Core formulation: \(A_{i,t} = \text{PatientVLM}(P_{pat}(I_i, C_i, Q_{i,t}))\)
- Design Motivation: The ground-truth diagnosis guides symptom expression, while the model is explicitly instructed not to disclose the diagnosis itself, preserving information asymmetry
- PatientVLM parameters remain frozen throughout dialogue simulation
Iterative Dialogue Generation:
- DocVLM and PatientVLM interact for up to T rounds (T=8 in experiments)
- Each round consists of one question from DocVLM and one response from PatientVLM
- The process ultimately yields image–dialogue–diagnosis triplets

Loss & Training¶

Diagnosis classification is formulated as a text generation problem, with diagnoses generated autoregressively
Standard generation loss: \(\mathcal{L}_{gen}(\theta) = -\mathbb{E}_{(I,H,C)}\left[\sum_m \log P_\theta(C_m|C_{<m}, I, H)\right]\)
DocVLM is fine-tuned with LoRA: rank=16, alpha=32, dropout=0.05
Training for 10 epochs with batch size=8
mPLUG-Owl3 is used as the default PatientVLM in experiments

Key Experimental Results¶

Main Results¶

Evaluated on four datasets from MedMNIST v2:

Model	Setting	DermaMNIST F1	PneumoniaMNIST F1	RetinaMNIST F1	PathMNIST F1
InternVL3-2B	Image-only SFT	36.5	88.4	31.5	70.9
InternVL3-2B	+PCDF	73.7(+37.2)	98.6(+10.2)	54.9(+23.4)	85.5(+14.6)
Qwen2.5-VL-7B	Image-only SFT	56.5	83.3	33.8	73.5
Qwen2.5-VL-7B	+PCDF	81.0(+24.5)	94.5(+11.2)	39.7(+5.9)	77.9(+4.4)
Gemma3-4B	Image-only SFT	78.3	95.7	47.7	86.0
Gemma3-4B	+PCDF	81.9(+3.6)	99.0(+3.3)	67.7(+20.0)	90.2(+4.2)
MedGemma3-4B	Image-only SFT	81.5	99.1	71.2	90.9
MedGemma3-4B	+PCDF	86.4(+4.9)	99.3(+0.2)	81.3(+10.1)	96.9(+6.0)

Key Findings: PCDF-augmented VLMs achieve an average F1 gain of 11.48, with general-purpose VLMs benefiting most (InternVL3 F1 +37.2).

Ablation Study¶

Dialogue Length Analysis (Gemma3 + mPLUG-Owl3):

Turns T	DermaMNIST F1	PneumoniaMNIST F1	RetinaMNIST F1	PathMNIST F1
2	63.5	78.8	27.8	59.1
4	70.3	80.3	36.6	49.5
6	71.9	91.7	44.1	71.8
8	81.9	99.0	67.7	90.2

PatientVLM Selection Analysis (DocVLM = Qwen2.5-VL-7B):

PatientVLM	Avg. F1	Notes
Image-only SFT	61.8	Baseline
mPLUG-Owl3	73.3	Best PatientVLM
InternVL3	70.1	Second best
Qwen2.5-VL	72.7	Same architecture, different role
MedGemma	70.5	Medical domain model

Key Findings¶

General-purpose VLMs benefit more: InternVL3 achieves F1 +37.2 on DermaMNIST due to the lack of medical domain pre-training
Longer dialogues yield better performance: T from 2 to 8 increases RetinaMNIST F1 from 27.8 to 67.7 (+39.9 absolute)
PCDF outperforms CoT reasoning: MedGemma with PCDF zero-shot F1 exceeds CoT by an average of 23.6
Clinical validation: 96.9% of simulated dialogues are rated as clinically relevant, with no cases of diagnostic leakage

Highlights & Insights¶

Dual-VLM role separation is an elegant design — it preserves the information asymmetry between doctor and patient inherent in real consultations, yielding greater authenticity than single-model generation
Model-agnostic: PCDF is applicable to arbitrary VLMs without architectural modification, requiring only LoRA fine-tuning
Even medically specialized models such as MedGemma benefit, indicating that dialogue-based supervisory signals are complementary to conventional domain adaptation
Zero-cost clinical dialogue data: No real doctor–patient dialogues are required, entirely circumventing the ethical and financial barriers to data collection

Limitations & Future Work¶

Clinical validation is limited in scale (210 cases), necessitating larger and more diverse evaluations
DocVLM-generated questions tend toward professional terminology, potentially difficult for lay patients to understand
Current support is limited to English, constraining applicability in multilingual healthcare settings
MedMNIST datasets are relatively straightforward; validation in more complex clinical scenarios (e.g., multi-morbidity) is lacking
The quality of symptom generation by PatientVLM depends on the underlying VLM's medical knowledge

Unlike evaluation-oriented works such as MedIQ and 3MDBench, PCDF is a training framework
Inspired by real clinical consultation workflows: physicians do not only read images but also elicit symptoms
Future work could extend PCDF to multimodal settings (incorporating laboratory test data) or multilingual scenarios

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (Dual-VLM dialogue simulation for medical consultation is an entirely novel framework design)
Experimental Thoroughness: ⭐⭐⭐⭐ (Four datasets + four VLMs + multi-dimensional ablation)
Writing Quality: ⭐⭐⭐⭐⭐ (Clear problem motivation and naturally presented methodology)
Value: ⭐⭐⭐⭐ (Demonstrates practical application potential in medical AI)