PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis¶
Conference: AAAI 2026 arXiv: 2601.10945 Code: https://vl2g.github.io/projects/pcdf Area: Multimodal VLM Keywords: Medical Diagnosis, Multi-turn Dialogue, VLM Interaction, Data Synthesis, Dialogue-driven Fine-tuning
TL;DR¶
This paper proposes the Pre-Consultation Dialogue Framework (PCDF), which simulates multi-turn doctor–patient dialogues using two VLMs (DocVLM and PatientVLM) to generate image–dialogue–diagnosis triplets for fine-tuning DocVLM, achieving an average F1 improvement of 11.48 across four medical imaging benchmarks.
Background & Motivation¶
State of the Field¶
Medical imaging AI research has long revolved around the "image → diagnosis" paradigm, progressing from early CNN-based classification to medical adaptations of CLIP (MedCLIP, BioMedCLIP), and then to large VLMs (MedPaLM2, MedGemma, LLaVA-Med), with continuously improving visual understanding capabilities.
Limitations of Prior Work¶
In clinical practice, however, diagnosis rarely relies on imaging alone. Physicians engage in multi-turn interactions with patients, progressively eliciting symptoms and medical history to narrow the differential diagnosis. This dialogue-driven reasoning process is central to clinical diagnosis, yet existing models entirely neglect this aspect, resulting in fragile predictions.
Data Collection Challenges¶
Collecting real doctor–patient dialogue data is extremely difficult: it requires IRB ethical approval, patient informed consent, and faces physician concerns over legal liability and workflow disruption, making large-scale data collection practically infeasible.
Root Cause¶
Existing efforts (Yang et al. 2024; Chen et al. 2023) have attempted to generate synthetic dialogues using a single LLM playing both doctor and patient, but suffer from two critical limitations: (1) they operate in text-only settings without medical images; and (2) having a single model generate both roles undermines role separation and interactional authenticity.
Starting Point¶
This paper proposes PCDF — employing two independent VLMs to respectively play the doctor and patient roles, conducting joint visual-dialogue reasoning over medical images. PatientVLM generates symptom-based responses conditioned on ground-truth diagnostic labels (while being explicitly instructed not to reveal the diagnosis), and DocVLM generates follow-up questions based on the image and dialogue history. This design preserves the information asymmetry inherent in real clinical consultations.
Method¶
Overall Architecture¶
PCDF consists of two stages: a dialogue simulation stage and a dialogue-conditioned fine-tuning stage.
Stage 1: Given a medical dataset \(\mathcal{D}=\{(I_i, C_i)\}_{i=1}^N\), T-turn DocVLM–PatientVLM dialogues are simulated for each sample to produce an augmented dataset \(\hat{\mathcal{D}}=\{(I_i, H_i, C_i)\}\).
Stage 2: DocVLM is fine-tuned on the augmented dataset to learn \(P(C|I,H)\), i.e., diagnosis conditioned on both image and dialogue history.
Key Designs¶
-
DocVLM (Doctor Model):
- Generates follow-up questions based on the medical image \(I_i\), dialogue history \(H_{i,<t}\), and the full set of candidate diagnoses \(\mathcal{C}\)
- Core formulation: \(Q_{i,t} = \text{DocVLM}(P_{doc}(I_i, H_{i,<t}, \mathcal{C}))\)
- Design Motivation: Including all candidate diagnoses in the prompt encourages the generation of discriminative questions that help distinguish between similar conditions
-
PatientVLM (Patient Model):
- Generates responses based on the image \(I_i\), ground-truth diagnosis \(C_i\), and DocVLM's question \(Q_{i,t}\)
- Core formulation: \(A_{i,t} = \text{PatientVLM}(P_{pat}(I_i, C_i, Q_{i,t}))\)
- Design Motivation: The ground-truth diagnosis guides symptom expression, while the model is explicitly instructed not to disclose the diagnosis itself, preserving information asymmetry
- PatientVLM parameters remain frozen throughout dialogue simulation
-
Iterative Dialogue Generation:
- DocVLM and PatientVLM interact for up to T rounds (T=8 in experiments)
- Each round consists of one question from DocVLM and one response from PatientVLM
- The process ultimately yields image–dialogue–diagnosis triplets
Loss & Training¶
- Diagnosis classification is formulated as a text generation problem, with diagnoses generated autoregressively
- Standard generation loss: \(\mathcal{L}_{gen}(\theta) = -\mathbb{E}_{(I,H,C)}\left[\sum_m \log P_\theta(C_m|C_{<m}, I, H)\right]\)
- DocVLM is fine-tuned with LoRA: rank=16, alpha=32, dropout=0.05
- Training for 10 epochs with batch size=8
- mPLUG-Owl3 is used as the default PatientVLM in experiments
Key Experimental Results¶
Main Results¶
Evaluated on four datasets from MedMNIST v2:
| Model | Setting | DermaMNIST F1 | PneumoniaMNIST F1 | RetinaMNIST F1 | PathMNIST F1 |
|---|---|---|---|---|---|
| InternVL3-2B | Image-only SFT | 36.5 | 88.4 | 31.5 | 70.9 |
| InternVL3-2B | +PCDF | 73.7(+37.2) | 98.6(+10.2) | 54.9(+23.4) | 85.5(+14.6) |
| Qwen2.5-VL-7B | Image-only SFT | 56.5 | 83.3 | 33.8 | 73.5 |
| Qwen2.5-VL-7B | +PCDF | 81.0(+24.5) | 94.5(+11.2) | 39.7(+5.9) | 77.9(+4.4) |
| Gemma3-4B | Image-only SFT | 78.3 | 95.7 | 47.7 | 86.0 |
| Gemma3-4B | +PCDF | 81.9(+3.6) | 99.0(+3.3) | 67.7(+20.0) | 90.2(+4.2) |
| MedGemma3-4B | Image-only SFT | 81.5 | 99.1 | 71.2 | 90.9 |
| MedGemma3-4B | +PCDF | 86.4(+4.9) | 99.3(+0.2) | 81.3(+10.1) | 96.9(+6.0) |
Key Findings: PCDF-augmented VLMs achieve an average F1 gain of 11.48, with general-purpose VLMs benefiting most (InternVL3 F1 +37.2).
Ablation Study¶
Dialogue Length Analysis (Gemma3 + mPLUG-Owl3):
| Turns T | DermaMNIST F1 | PneumoniaMNIST F1 | RetinaMNIST F1 | PathMNIST F1 |
|---|---|---|---|---|
| 2 | 63.5 | 78.8 | 27.8 | 59.1 |
| 4 | 70.3 | 80.3 | 36.6 | 49.5 |
| 6 | 71.9 | 91.7 | 44.1 | 71.8 |
| 8 | 81.9 | 99.0 | 67.7 | 90.2 |
PatientVLM Selection Analysis (DocVLM = Qwen2.5-VL-7B):
| PatientVLM | Avg. F1 | Notes |
|---|---|---|
| Image-only SFT | 61.8 | Baseline |
| mPLUG-Owl3 | 73.3 | Best PatientVLM |
| InternVL3 | 70.1 | Second best |
| Qwen2.5-VL | 72.7 | Same architecture, different role |
| MedGemma | 70.5 | Medical domain model |
Key Findings¶
- General-purpose VLMs benefit more: InternVL3 achieves F1 +37.2 on DermaMNIST due to the lack of medical domain pre-training
- Longer dialogues yield better performance: T from 2 to 8 increases RetinaMNIST F1 from 27.8 to 67.7 (+39.9 absolute)
- PCDF outperforms CoT reasoning: MedGemma with PCDF zero-shot F1 exceeds CoT by an average of 23.6
- Clinical validation: 96.9% of simulated dialogues are rated as clinically relevant, with no cases of diagnostic leakage
Highlights & Insights¶
- Dual-VLM role separation is an elegant design — it preserves the information asymmetry between doctor and patient inherent in real consultations, yielding greater authenticity than single-model generation
- Model-agnostic: PCDF is applicable to arbitrary VLMs without architectural modification, requiring only LoRA fine-tuning
- Even medically specialized models such as MedGemma benefit, indicating that dialogue-based supervisory signals are complementary to conventional domain adaptation
- Zero-cost clinical dialogue data: No real doctor–patient dialogues are required, entirely circumventing the ethical and financial barriers to data collection
Limitations & Future Work¶
- Clinical validation is limited in scale (210 cases), necessitating larger and more diverse evaluations
- DocVLM-generated questions tend toward professional terminology, potentially difficult for lay patients to understand
- Current support is limited to English, constraining applicability in multilingual healthcare settings
- MedMNIST datasets are relatively straightforward; validation in more complex clinical scenarios (e.g., multi-morbidity) is lacking
- The quality of symptom generation by PatientVLM depends on the underlying VLM's medical knowledge
Related Work & Insights¶
- Unlike evaluation-oriented works such as MedIQ and 3MDBench, PCDF is a training framework
- Inspired by real clinical consultation workflows: physicians do not only read images but also elicit symptoms
- Future work could extend PCDF to multimodal settings (incorporating laboratory test data) or multilingual scenarios
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (Dual-VLM dialogue simulation for medical consultation is an entirely novel framework design)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Four datasets + four VLMs + multi-dimensional ablation)
- Writing Quality: ⭐⭐⭐⭐⭐ (Clear problem motivation and naturally presented methodology)
- Value: ⭐⭐⭐⭐ (Demonstrates practical application potential in medical AI)