Skip to content

MedMKG: Benchmarking Medical Knowledge Exploitation with Multimodal Knowledge Graph

Conference: NeurIPS 2025 arXiv: 2505.17214 Code: GitHub Area: Medical Imaging Keywords: Multimodal Knowledge Graph, Medical VQA, Text-Image Retrieval, Knowledge Augmentation, Link Prediction

TL;DR

This paper constructs MedMKG, a medical multimodal knowledge graph that integrates MIMIC-CXR imaging data with UMLS clinical concepts, proposes a Neighbor-aware Filtering (NaF) algorithm for image selection, and conducts comprehensive benchmarking of 24 baseline methods across three tasks: link prediction, text-image retrieval, and VQA.

Background & Motivation

Medical deep learning models heavily rely on domain knowledge for knowledge-intensive clinical tasks. Existing approaches primarily leverage unimodal knowledge graphs (e.g., UMLS) to enhance textual understanding, but achieve limited performance on multimodal clinical tasks such as VQA and text-image retrieval, due to the absence of explicit associations between visual data and clinical concepts.

Constructing a multimodal medical knowledge graph faces two major challenges:

Quality (C1): Accurately identifying and representing intra- and inter-modal relations requires a carefully designed construction pipeline.

Utility (C2): The graph must encode clinically meaningful multimodal knowledge that effectively improves downstream task performance.

Most existing multimodal knowledge graph construction pipelines rely on search engines or web crawlers, which lack sufficient precision for the medical domain. The core idea of this paper is to use UMLS as the backbone and extract cross-modal edges from MIMIC-CXR to construct a high-quality medical multimodal knowledge graph, with its practical utility validated through large-scale benchmark evaluation.

Method

Overall Architecture

MedMKG adopts a "modality expansion" strategy: starting from UMLS as a unimodal knowledge graph, a multi-stage pipeline extracts image nodes and cross-modal edges from MIMIC-CXR, forming a multimodal graph containing two types of nodes (clinical concepts and images) and two types of edges (intra-modal and cross-modal).

Key Designs

  1. Two-Stage Concept Extraction (addressing C1)

Exploiting complementary strengths of rule-based systems and LLMs: - Stage I — Concept Recognition: MetaMap is applied to each radiology report for UMLS concept candidate matching, providing broad coverage of domain terminology, with semantically irrelevant types filtered via domain knowledge. - Stage II — Concept Disambiguation: ChatGPT-4o leverages the full report context along with candidate lists to select the most contextually appropriate concept for each mention, utilizing the LLM's semantic understanding to resolve ambiguity.

This "broad coverage, then fine selection" design ensures both accuracy and completeness of cross-modal edges.

  1. Relation Extraction

  2. Intra-modal relations: Retrieved directly from the UMLS database as annotated inter-concept relations.

  3. Cross-modal relations: Concurrently with concept disambiguation, the LLM determines the semantic polarity (Positive/Negative/Uncertain) of each image–concept relation, annotating every cross-modal edge accordingly.

  4. Neighbor-aware Filtering (NaF)

The fully constructed graph is large in scale, with many redundant images capturing similar regions. NaF selects the most informative images by balancing connectivity and uniqueness.

The informativeness score for image \(m\) is defined as:

$\(\text{NaF}(m) = \sum_{(r,c) \in \mathcal{N}_m} \log \frac{M}{|\mathcal{N}_{(r,c)}|}\)$

where \(\mathcal{N}_m\) is the 1-hop neighbor set of image \(m\), \(M\) is the total number of images in the graph, and \(\mathcal{N}_{(r,c)}\) is the set of images linked to concept \(c\) via relation \(r\). Intuitively, if a (relation, concept) pair is associated with only a few images, those images carry more unique clinical information — analogous to the TF-IDF weighting scheme.

Images are ranked by NaF score in descending order and selected greedily until all concepts are covered, achieving redundancy reduction while preserving knowledge completeness.

Loss & Training

MedMKG final statistics: 3,149 concept nodes, 4,868 image nodes, 262 relation types, and 35,387 edges (including 20,705 cross-modal edges). Human quality evaluation achieves approximately 80% average score across three dimensions: concept coverage, relation correctness, and image diversity.

Key Experimental Results

Model Head Hits@10 ↑ Rel Hits@10 ↑ Tail Hits@10 ↑
TransD 11.89 48.53 18.87
TransE 9.58 41.05 14.21
TransH 9.15 41.61 15.03
TuckER 6.92 65.28 9.75
AttH 0.20 63.08 14.80
ConvE 4.27 41.02 10.79

Knowledge-Augmented Text-Image Retrieval (MedCSPCLIP backbone)

Method OpenI P@10 OpenI R@100 MIMIC P@10 MIMIC R@100
MedCSPCLIP 1.60 52.14 3.77 81.58
+ FashionKLIP 1.81 57.65 4.02 84.98
+ KnowledgeCLIP 1.90 59.55 4.95 88.99

Knowledge-Augmented VQA (MedCSPCLIP backbone)

Method VQA-RAD Acc SLAKE Acc PathVQA Acc
MedCSPCLIP 68.13 66.20 77.72
+ MR-MKG 78.49 83.94 86.53
+ KRISP 80.08 70.70 83.19
+ EKGRL 76.10 69.30 84.92

Key Findings

  1. Translation-based models perform best on multimodal KGs: TransD achieves the best overall performance in link prediction, suggesting that translation-based models are better suited for heterogeneous multimodal graph structures. Tensor decomposition models (SimplE, RESCAL, etc.) generally underperform.
  2. Knowledge augmentation is broadly effective: Integrating MedMKG consistently improves performance across retrieval and VQA tasks, with more pronounced gains at smaller top-K values.
  3. Divergence between pre-training and fine-tuning strategies: KnowledgeCLIP (pre-training fusion) shows larger advantages on MIMIC-CXR, while FashionKLIP (joint fine-tuning) yields greater improvements on OpenI.
  4. Contrastive learning fusion is most robust: MR-MKG achieves visual-knowledge alignment via contrastive learning and demonstrates the most stable performance across different backbones and datasets.

Highlights & Insights

  • MedMKG is the first multimodal medical knowledge graph to integrate chest X-ray images with UMLS clinical knowledge.
  • The NaF algorithm effectively adapts the IDF concept from information retrieval to address image redundancy in a simple yet principled manner.
  • The two-stage concept extraction pipeline (rule-based + LLM) fully exploits the complementary strengths of both approaches: rule systems provide broad coverage while LLMs offer deep semantic understanding.
  • The breadth of the benchmark is impressive: 3 tasks × 2 settings × 24 baselines × 4 backbones × 6 datasets.

Limitations & Future Work

  • The current graph is limited to chest X-rays (MIMIC-CXR); extension to additional modalities and anatomical regions is desirable.
  • NaF operates only at the image selection level and does not address redundancy on the concept side.
  • Knowledge fusion strategies are relatively straightforward, lacking a backbone-agnostic adaptive framework.
  • The three-way classification of cross-modal relations (Positive/Negative/Uncertain) is coarse-grained and could be refined further.
  • The MedMKG construction pipeline can serve as a reference template for building other medical multimodal knowledge graphs.
  • The NaF strategy is generalizable to redundancy reduction in other multimodal knowledge graph settings.
  • The comprehensive benchmark results provide an empirical basis for selecting knowledge augmentation methods in future work.
  • Future direction: a unified adaptive knowledge fusion framework applicable across both pre-training and fine-tuning stages.

Rating

  • Novelty: ⭐⭐⭐⭐ First-of-its-kind multimodal medical KG with a well-designed construction pipeline.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across 24 baselines × 4 backbones × 6 datasets.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure with detailed experimental setup.
  • Value: ⭐⭐⭐⭐ High contribution as a resource and benchmark, though generalizability is constrained by the single-modality scope.