Text-Attributed Knowledge Graph Enrichment with Large Language Models for Medical Concept Representation¶
Conference: ACL 2026 arXiv: 2604.13331 Code: None Area: Medical Imaging / Graph Learning Keywords: Medical Concept Representation, Knowledge Graph, LLM-GNN Joint Learning, Electronic Health Records, Text-Attributed Graph
TL;DR¶
This paper proposes CoMed, an LLM-empowered graph learning framework that constructs a global medical knowledge graph by combining EHR statistical evidence with type-constrained LLM inference, enriches it into a text-attributed graph via LLM-generated node descriptions and edge rationales, and jointly trains a LoRA-finetuned LLaMA encoder with a heterogeneous GNN to learn unified medical concept embeddings, achieving significant improvements in diagnosis prediction on MIMIC-III/IV.
Background & Motivation¶
Background: Learning high-quality medical concept representations (embeddings of diagnosis, medication, and procedure codes) from EHR data is fundamental to clinical prediction. Existing methods primarily leverage hierarchical structures in medical ontologies (e.g., parent-child relationships in ICD) or limited cross-type semantics (e.g., UMLS) to construct knowledge graphs for guiding representation learning.
Limitations of Prior Work: (1) Cross-type dependencies (e.g., diagnosis–medication treatment relationships, drug–procedure associations) are largely missing or incomplete in existing ontologies; (2) rich clinical semantics encoded in text are difficult to integrate with KG structure; (3) unconstrained LLM prompting may produce plausible but unsupported edges with inconsistent outputs.
Key Challenge: LLMs encode broad biomedical knowledge, yet KG inference for clinical modeling must remain evidence-grounded, type-aware, and globally consistent — requiring a balance between the semantic richness of LLMs and the empirical grounding of EHR data.
Goal: To construct a clinically interpretable and empirically grounded heterogeneous KG, and to learn unified medical concept embeddings that integrate textual semantics with graph structure.
Key Insight: Statistically significant code pairs are first extracted from EHR data as candidate relations, and LLMs then infer semantic relation types conditioned on type constraints and statistical evidence — a dual-safeguard strategy of "statistical filtering + LLM inference."
Core Idea: EHR statistical evidence provides an empirical foundation while LLMs supply semantic interpretation and relation typing — the two are complementary for KG construction, followed by LLM-GNN joint learning to fuse textual and structural information.
Method¶
Overall Architecture¶
CoMed proceeds in four stages: (1) extracting co-occurrence and temporal transition statistics from EHR data, retaining statistically significant code pairs; (2) using type-constrained LLM prompting to infer directed relation types, confidence scores, and rationales for each code pair; (3) enriching the KG with LLM-generated node descriptions and edge metadata; (4) jointly training a LoRA-finetuned LLaMA-1B encoder and a heterogeneous GNN to learn concept embeddings.
Key Designs¶
-
EHR Statistical Evidence Extraction and Filtering:
- Function: Identify empirically supported candidate relations from the data.
- Mechanism: Three statistics are computed for each code pair — smoothed conditional probability, PMI association, and chi-square independence test p-value — under both intra-visit co-occurrence and inter-visit temporal transition settings. Code pairs with low support, low association, or non-significant p-values (\(p > 0.05\)) are filtered out.
- Design Motivation: Pure LLM inference is prone to hallucination; statistical filtering ensures that every candidate edge is grounded in actual observations within the target EHR dataset — relations are not only "clinically plausible" but also "empirically attested in this dataset."
-
Type-Constrained LLM Relation Inference:
- Function: Infer semantic relation types for statistically significant code pairs.
- Mechanism: A predefined candidate relation pool is specified for each combination of code types (dx-dx, rx-dx, px-dx, etc.), containing labels such as causes, treats, and diagnostic_of. The structured prompt includes code identifiers, frequency information, eight statistical indicators with explanations. The LLM returns a relation label, a directed triple, a confidence score, and a 50–60-word clinical rationale.
- Design Motivation: Type constraints prevent semantically implausible relations (e.g., a diagnosis "treating" another diagnosis); evidence conditioning enables the LLM to integrate clinical knowledge with statistical signals. Clinical expert auditing of 50 edges yielded a mean rating of 4.84/5, validating high quality.
-
LLM-GNN Joint Learning (CoMed):
- Function: Fuse textual semantics and graph structure to learn unified concept embeddings.
- Mechanism: A LoRA-finetuned LLaMA-1B encodes node descriptions into text embeddings, which are projected into the GNN space via type-specific linear projections. A heterogeneous GNN performs relation-aware message passing over the KG to produce final concept embeddings. The model is trained end-to-end with a two-stage LoRA update schedule — an "least-updated-first" strategy in early training ensures broad coverage, while a mixture of low- and high-frequency codes is used in later stages.
- Design Motivation: GNNs excel at aggregating graph structure but do not interpret long-form text; LLMs encode semantics but do not exploit global relational constraints — joint learning enables mutual complementarity. The two-stage schedule addresses the under-updating of rare codes inherent in mini-batch training.
Loss & Training¶
A multi-label cross-entropy loss is used for the next-visit diagnosis prediction task. CoMed serves as a plug-and-play concept encoder integrated into standard EHR models for end-to-end training.
Key Experimental Results¶
Main Results¶
Diagnosis Prediction Performance on MIMIC-III
| Method | AUPRC | F1 | Acc@15 |
|---|---|---|---|
| Base Transformer | 41.00 | 33.16 | 47.20 |
| GRAM | 41.70 | 34.60 | 48.60 |
| LINKO | 44.91 | 38.20 | 52.30 |
| GraphCare | 43.35 | 35.46 | 52.76 |
| CoMed | 47.21 | 42.28 | 54.20 |
Ablation Study¶
Plug-and-Play Analysis (CoMed Integrated into Different Backbones)
| Backbone | w/o CoMed | w/ CoMed | Gain |
|---|---|---|---|
| Transformer | 41.00 | 47.21 | +6.21 |
| RETAIN | ~40 | ~46 | +6 |
| GRAM | 41.70 | ~47 | +5 |
Key Findings¶
- CoMed improves AUPRC from 41.00 to 47.21 (+6.21) on MIMIC-III, ranking first among all baselines.
- Gains are particularly pronounced for rare diagnosis labels (0–25% frequency) — from 40.60 to 47.67 (+7.07) — as KG relations allow rare concepts to borrow information from associated concepts.
- CoMed consistently improves performance across multiple backbones as a plug-and-play concept encoder.
- Clinical experts rated LLM-inferred edges at 4.84 ± 0.29 / 5, validating the clinical validity of the KG.
- Consistent improvements on MIMIC-IV demonstrate cross-dataset generalizability.
Highlights & Insights¶
- The dual-safeguard KG construction strategy of "statistical filtering + LLM inference" ensures both empirical grounding and semantic plausibility.
- The two-stage LoRA update schedule elegantly addresses the training imbalance induced by the long-tailed distribution of medical codes.
- The substantial gains on rare diagnoses carry significant clinical implications — rare diseases are often the hardest to predict and the most in need of attention.
Limitations & Future Work¶
- LLM-generated node descriptions and relational rationales may contain subtle hallucinations or biases.
- Evaluation is limited to the diagnosis prediction task; effectiveness on medication recommendation, readmission prediction, and other tasks remains unverified.
- KG construction relies on statistics derived from the target dataset, meaning different hospitals' EHR data may yield different KGs.
- The text encoding capacity of LLaMA-1B is limited; larger LLMs may produce higher-quality embeddings.
Related Work & Insights¶
- vs. GRAM: GRAM relies solely on the ICD hierarchy, whereas CoMed introduces cross-type relations and textual semantics — AUPRC gain of +5.51.
- vs. GraphCare: GraphCare employs an external medical KG without alignment to EHR data; CoMed ensures empirical grounding through statistical filtering.
- vs. LINKO: LINKO constructs a KG via link prediction but does not incorporate textual semantics; CoMed's LLM-GNN joint learning provides a more comprehensive integration.
Rating¶
- Novelty: ⭐⭐⭐⭐ The KG construction paradigm combining EHR statistics with LLM inference, and the LLM-GNN joint learning framework, are both novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ MIMIC-III/IV × multiple baselines + plug-and-play analysis + clinical expert validation.
- Writing Quality: ⭐⭐⭐⭐ The methodological pipeline is clearly presented, with explicit motivation for each design choice.
- Value: ⭐⭐⭐⭐⭐ The plug-and-play concept encoder offers high practical value for the EHR research community.