Text-Attributed Knowledge Graph Enrichment with Large Language Models for Medical Concept Representation¶

Conference: ACL 2026 arXiv: 2604.13331 Code: None Area: Medical Imaging / Graph Learning Keywords: Medical Concept Representation, Knowledge Graph, LLM-GNN Joint Learning, Electronic Health Records, Text-Attributed Graph

TL;DR¶

This paper proposes CoMed, an LLM-empowered graph learning framework that constructs a global medical knowledge graph by combining EHR statistical evidence with type-constrained LLM inference, enriches it into a text-attributed graph via LLM-generated node descriptions and edge rationales, and jointly trains a LoRA-finetuned LLaMA encoder with a heterogeneous GNN to learn unified medical concept embeddings, achieving significant improvements in diagnosis prediction on MIMIC-III/IV.

Background & Motivation¶

Background: Learning high-quality medical concept representations (embeddings of diagnosis, medication, and procedure codes) from EHR data is fundamental to clinical prediction. Existing methods primarily leverage hierarchical structures in medical ontologies (e.g., parent-child relationships in ICD) or limited cross-type semantics (e.g., UMLS) to construct knowledge graphs for guiding representation learning.

Limitations of Prior Work: (1) Cross-type dependencies (e.g., diagnosis–medication treatment relationships, drug–procedure associations) are largely missing or incomplete in existing ontologies; (2) rich clinical semantics encoded in text are difficult to integrate with KG structure; (3) unconstrained LLM prompting may produce plausible but unsupported edges with inconsistent outputs.

Key Challenge: LLMs encode broad biomedical knowledge, yet KG inference for clinical modeling must remain evidence-grounded, type-aware, and globally consistent — requiring a balance between the semantic richness of LLMs and the empirical grounding of EHR data.

Goal: To construct a clinically interpretable and empirically grounded heterogeneous KG, and to learn unified medical concept embeddings that integrate textual semantics with graph structure.

Key Insight: Statistically significant code pairs are first extracted from EHR data as candidate relations, and LLMs then infer semantic relation types conditioned on type constraints and statistical evidence — a dual-safeguard strategy of "statistical filtering + LLM inference."

Core Idea: EHR statistical evidence provides an empirical foundation while LLMs supply semantic interpretation and relation typing — the two are complementary for KG construction, followed by LLM-GNN joint learning to fuse textual and structural information.

Method¶

Overall Architecture¶

CoMed proceeds in four stages: (1) extracting co-occurrence and temporal transition statistics from EHR data, retaining statistically significant code pairs; (2) using type-constrained LLM prompting to infer directed relation types, confidence scores, and rationales for each code pair; (3) enriching the KG with LLM-generated node descriptions and edge metadata; (4) jointly training a LoRA-finetuned LLaMA-1B encoder and a heterogeneous GNN to learn concept embeddings.

Key Designs¶

EHR Statistical Evidence Extraction and Filtering:
- Function: Identify empirically supported candidate relations from the data.
- Mechanism: Three statistics are computed for each code pair — smoothed conditional probability, PMI association, and chi-square independence test p-value — under both intra-visit co-occurrence and inter-visit temporal transition settings. Code pairs with low support, low association, or non-significant p-values (\(p > 0.05\)) are filtered out.
- Design Motivation: Pure LLM inference is prone to hallucination; statistical filtering ensures that every candidate edge is grounded in actual observations within the target EHR dataset — relations are not only "clinically plausible" but also "empirically attested in this dataset."
Type-Constrained LLM Relation Inference:
- Function: Infer semantic relation types for statistically significant code pairs.
- Mechanism: A predefined candidate relation pool is specified for each combination of code types (dx-dx, rx-dx, px-dx, etc.), containing labels such as causes, treats, and diagnostic_of. The structured prompt includes code identifiers, frequency information, eight statistical indicators with explanations. The LLM returns a relation label, a directed triple, a confidence score, and a 50–60-word clinical rationale.
- Design Motivation: Type constraints prevent semantically implausible relations (e.g., a diagnosis "treating" another diagnosis); evidence conditioning enables the LLM to integrate clinical knowledge with statistical signals. Clinical expert auditing of 50 edges yielded a mean rating of 4.84/5, validating high quality.
LLM-GNN Joint Learning (CoMed):
- Function: Fuse textual semantics and graph structure to learn unified concept embeddings.
- Mechanism: A LoRA-finetuned LLaMA-1B encodes node descriptions into text embeddings, which are projected into the GNN space via type-specific linear projections. A heterogeneous GNN performs relation-aware message passing over the KG to produce final concept embeddings. The model is trained end-to-end with a two-stage LoRA update schedule — an "least-updated-first" strategy in early training ensures broad coverage, while a mixture of low- and high-frequency codes is used in later stages.
- Design Motivation: GNNs excel at aggregating graph structure but do not interpret long-form text; LLMs encode semantics but do not exploit global relational constraints — joint learning enables mutual complementarity. The two-stage schedule addresses the under-updating of rare codes inherent in mini-batch training.

Loss & Training¶

A multi-label cross-entropy loss is used for the next-visit diagnosis prediction task. CoMed serves as a plug-and-play concept encoder integrated into standard EHR models for end-to-end training.

Key Experimental Results¶

Main Results¶

Diagnosis Prediction Performance on MIMIC-III

Method	AUPRC	F1	Acc@15
Base Transformer	41.00	33.16	47.20
GRAM	41.70	34.60	48.60
LINKO	44.91	38.20	52.30
GraphCare	43.35	35.46	52.76
CoMed	47.21	42.28	54.20

Ablation Study¶

Plug-and-Play Analysis (CoMed Integrated into Different Backbones)

Backbone	w/o CoMed	w/ CoMed	Gain
Transformer	41.00	47.21	+6.21
RETAIN	~40	~46	+6
GRAM	41.70	~47	+5

Key Findings¶

CoMed improves AUPRC from 41.00 to 47.21 (+6.21) on MIMIC-III, ranking first among all baselines.
Gains are particularly pronounced for rare diagnosis labels (0–25% frequency) — from 40.60 to 47.67 (+7.07) — as KG relations allow rare concepts to borrow information from associated concepts.
CoMed consistently improves performance across multiple backbones as a plug-and-play concept encoder.
Clinical experts rated LLM-inferred edges at 4.84 ± 0.29 / 5, validating the clinical validity of the KG.
Consistent improvements on MIMIC-IV demonstrate cross-dataset generalizability.

Highlights & Insights¶

The dual-safeguard KG construction strategy of "statistical filtering + LLM inference" ensures both empirical grounding and semantic plausibility.
The two-stage LoRA update schedule elegantly addresses the training imbalance induced by the long-tailed distribution of medical codes.
The substantial gains on rare diagnoses carry significant clinical implications — rare diseases are often the hardest to predict and the most in need of attention.

Limitations & Future Work¶

LLM-generated node descriptions and relational rationales may contain subtle hallucinations or biases.
Evaluation is limited to the diagnosis prediction task; effectiveness on medication recommendation, readmission prediction, and other tasks remains unverified.
KG construction relies on statistics derived from the target dataset, meaning different hospitals' EHR data may yield different KGs.
The text encoding capacity of LLaMA-1B is limited; larger LLMs may produce higher-quality embeddings.

vs. GRAM: GRAM relies solely on the ICD hierarchy, whereas CoMed introduces cross-type relations and textual semantics — AUPRC gain of +5.51.
vs. GraphCare: GraphCare employs an external medical KG without alignment to EHR data; CoMed ensures empirical grounding through statistical filtering.
vs. LINKO: LINKO constructs a KG via link prediction but does not incorporate textual semantics; CoMed's LLM-GNN joint learning provides a more comprehensive integration.

Rating¶

Novelty: ⭐⭐⭐⭐ The KG construction paradigm combining EHR statistics with LLM inference, and the LLM-GNN joint learning framework, are both novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ MIMIC-III/IV × multiple baselines + plug-and-play analysis + clinical expert validation.
Writing Quality: ⭐⭐⭐⭐ The methodological pipeline is clearly presented, with explicit motivation for each design choice.
Value: ⭐⭐⭐⭐⭐ The plug-and-play concept encoder offers high practical value for the EHR research community.