Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning¶
Conference: CVPR 2026 arXiv: 2603.08921 Code: None Area: Multimodal VLM Keywords: Concept Bottleneck Models, Medical Imaging, Explainable AI, Clinical Guidelines, CLIP
TL;DR¶
This paper proposes MedCBR, a framework that integrates clinical diagnostic guidelines (e.g., BI-RADS) into the training and inference pipeline of concept bottleneck models. By leveraging LVLMs to generate guideline-consistent reports for enhanced concept supervision, combining multi-task CLIP training with a large reasoning model for structured clinical explanation generation, MedCBR achieves AUROCs of 94.2% and 84.0% on ultrasound and mammography cancer detection, respectively.
Background & Motivation¶
Background: Concept Bottleneck Models (CBMs) connect model predictions to human-interpretable concepts through an intermediate concept layer, representing a dominant paradigm in explainable AI and proving particularly valuable in medical imaging.
Limitations of Prior Work: Standard CBMs rely on discrete concept representations, neglecting broader clinical context such as diagnostic guidelines and expert heuristics, which leads to reduced reliability in complex cases. Specific issues include: (a) concept annotations are noisy and incomplete due to inter-observer variability; (b) CBMs fail to capture experience-driven reasoning, such as cases that appear benign but require holistic assessment within the context of clinical guidelines.
Key Challenge: CBMs require complete and noise-free concept annotations and assume that diagnostic reasoning is a deterministic function of concept presence—yet medical diagnosis depends on contextual information and structured reasoning embedded in clinical guidelines.
Goal: (a) Address concept annotation noise and incompleteness; (b) remedy the lack of clinical context in concept-to-diagnosis reasoning; (c) provide auditable explanations for model predictions.
Key Insight: Diagnostic reasoning is modeled as inference over multiple evidence sources rather than a direct function of concepts, with clinical guidelines introduced as a structured knowledge source.
Core Idea: Enrich concept representations via LVLM-generated guideline-consistent reports, combined with multi-task contrastive learning and a large reasoning model for interpretable diagnostic narrative generation.
Method¶
Overall Architecture¶
MedCBR comprises three stages: (1) guideline-driven concept enrichment—converting discrete concept labels into guideline-consistent textual reports using an LVLM; (2) vision-language concept modeling—training CLIP with multi-task objectives that jointly optimize cross-modal alignment, concept prediction, and diagnostic classification; (3) concept-based clinical reasoning—using a frozen large reasoning model (LRM) to integrate predicted concepts with guidelines to produce structured diagnostic explanations.
Key Designs¶
-
Guideline-Driven Concept Enrichment:
- Function: Transforms discrete concept vectors \(c\) into continuous, guideline-conditioned textual representations \(r\).
- Mechanism: An LVLM receives the image \(x\), the positive concept label set \(c^+\), the label \(y\), and clinical guidelines \(\mathcal{G}\), and generates a structured report describing visual findings and summarizing their diagnostic implications according to \(\mathcal{G}\).
- Design Motivation: Discrete concept labels merely indicate which findings are present and cannot express inter-concept relationships or diagnostic significance. LVLM-generated enriched reports capture contextual and relational semantics among concepts, providing a more consistent supervision signal.
-
Multi-Task Vision-Language Concept Model:
- Function: Jointly learns image-text alignment, concept prediction, and diagnostic classification.
- Mechanism: Built on a CLIP backbone, the model is simultaneously optimized with three losses: a contrastive loss \(\mathcal{L}_{CLIP}\) aligning images with LVLM-generated reports; a diagnostic loss \(\mathcal{L}_y\) for cancer classification over visual embeddings; and a concept loss \(\mathcal{L}_c\) predicting individual concepts via \(N_c\) dedicated lightweight adapters. The total loss is \(\mathcal{L} = \lambda\mathcal{L}_{CLIP} + \mu\mathcal{L}_y + \nu\mathcal{L}_c\).
- Design Motivation: Multi-task training simultaneously enforces (i) cross-modal consistency, (ii) concept-level interpretability, and (iii) diagnostic discriminability, yielding representations that are both semantically rich and clinically grounded.
-
Concept-Based Clinical Reasoning:
- Function: Converts model predictions into structured diagnostic narratives.
- Mechanism: A frozen LRM receives a structured prompt \(\pi = (\mathcal{Q}, \hat{y}, \hat{c}, \mathcal{G})\) comprising task instructions, predicted cancer probability, concept prediction confidences, and clinical guidelines, and generates a step-by-step diagnostic reasoning explanation.
- Design Motivation: Since the LRM operates on structured inputs and explicit guidelines \(\mathcal{G}\), its reasoning is anchored to verifiable clinical knowledge, reducing the risk of hallucination.
Key Experimental Results¶
Main Results — Cancer Detection¶
| Method | BUS-BRA (AUROC) | CBIS-DDSM (AUROC) | CUB-200 (Acc.) |
|---|---|---|---|
| CBM | 84.8 | 79.6 | 62.9 |
| CLIP ViT-L/14 | 93.5 | 82.4 | 85.7 |
| AdaCBM | 87.9 | 75.6 | 69.8 |
| Label-free CBM | 60.0 | 70.0 | 74.3 |
| MedCBR | 94.2 | 84.0 | 86.1 |
Ablation Study — Component Contributions¶
| Configuration | BUS-BRA | CBIS-DDSM | CUB-200 |
|---|---|---|---|
| CLIP ViT | 93.5 | 82.4 | 85.7 |
| CLIP+CBL | 91.8 | 81.8 | 67.0 |
| CLIP+CBL+Guideline | 92.0 | 83.1 | 72.9 |
| CLIP+MTL | 93.6 | 83.2 | 82.3 |
| CLIP+MTL+Guideline (MedCBR) | 94.2 | 84.0 | 86.1 |
Key Findings¶
- MedCBR consistently outperforms all CBM variants and vanilla CLIP across all three datasets, demonstrating the superiority of combining guideline-driven concept enrichment with multi-task learning.
- Introducing the concept bottleneck layer (CBL) alone degrades performance; however, incorporating guidelines recovers and surpasses the baseline, indicating that guideline information effectively compensates for the information loss induced by the bottleneck structure.
- Strong performance on CUB-200 bird classification (86.1%) validates the framework's generalizability beyond the medical domain.
- Concept-level detection performance is also consistently superior, with multi-modal supervision enabling the model to simultaneously capture visually grounded and modality-specific features.
Highlights & Insights¶
- Clinical Guidelines as a Structured Knowledge Source: Unlike prior work that treats concepts or guidelines as auxiliary context, MedCBR integrates guidelines throughout the entire pipeline from training to inference, ensuring that concept-to-decision reasoning is constrained and validated.
- LVLM-Driven Concept Enrichment: The framework cleverly leverages LVLMs to transform noisy and incomplete discrete annotations into high-quality structured reports, addressing the practical challenge of concept annotation in medical data.
- End-to-End Interpretable Pipeline: The full chain from image → concepts → guidelines → diagnostic explanation is auditable at every step, satisfying the stringent transparency requirements of clinical practice.
Limitations & Future Work¶
- The inference stage depends on an external frozen LRM, increasing deployment complexity and latency.
- Evaluation is limited to binary classification (benign/malignant); multi-class or finer-grained grading tasks have not been explored.
- Guidelines are provided as fixed text, with no exploration of dynamic retrieval or personalized guideline adaptation.
- The concept set relies on manual definition; extending the framework to new diseases requires domain experts to redefine the concept taxonomy.
- Radiologist evaluation covers only 20 cases, limiting statistical power.
Related Work & Insights¶
- vs. AdaCBM: AdaCBM mitigates CLIP domain shift via learnable adapters but does not incorporate clinical knowledge; MedCBR provides stronger inductive bias through guideline-driven training.
- vs. Label-free CBM: Automatically generated concepts may omit clinically important features or introduce spurious correlations; MedCBR constrains concept discovery through guidelines.
- vs. Agent-based methods (e.g., MAGDA, MedRAX): These approaches use guidelines or tools as reasoning aids but do not deeply integrate them into model training.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Deeply integrating clinical guidelines into CBM training and inference is a novel contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-dataset validation with ablation and clinical evaluation, though the clinical evaluation sample size is limited.
- Writing Quality: ⭐⭐⭐⭐ — The framework is clearly presented, formulations are rigorous, and clinical relevance is well-motivated.
- Value: ⭐⭐⭐⭐ — Provides a practical guideline-integration paradigm for medical explainable AI.