A General Knowledge Injection Framework for ICD Coding¶

Conference: ACL 2025
arXiv: 2505.18708
Code: GitHub
Area: Knowledge Editing
Keywords: ICD Coding, Knowledge Injection, Multi-task Learning, Guideline Synthesis, Clinical Text Classification

TL;DR¶

This paper proposes GKI-ICD, a general knowledge injection framework. By employing guideline synthesis and multi-task learning mechanisms, it simultaneously integrates three types of ICD knowledge—Description, Synonym, and Hierarchy—without requiring extra network modules, achieving SOTA performance on the MIMIC-III benchmark.

Background & Motivation¶

Limitations of Prior Work¶

Limitations of Prior Work: Research problem: The ICD coding task requires assigning a large number of medical codes to clinical texts, which faces two major challenges: the long-tail distribution of codes and the lack of code-level evidence annotation.

Background¶

Background: Single knowledge type: Existing methods typically focus on only one type of knowledge (description, synonym, or hierarchical relation) and design specialized modules for it.

Key Challenge¶

Key Challenge: Incompatible modules: The multi-synonym attention mechanism designed for synonyms and the graph neural network designed for hierarchical relations are difficult to integrate into a unified model.

Key Insight¶

Proposed Solution: Poor scalability: The complexity of specialized modules makes them difficult to extend to more advanced models.

Core Motivation: To design a general framework that injects three complementary types of ICD code data in a unified manner, without relying on specialized network modules.

Method¶

Overall Architecture¶

GKI-ICD consists of two core components: 1. Guideline Synthesis: Synthesizes training guidelines using code knowledge, replacing specialized network modules. 2. Multi-task Learning: The model simultaneously learns from both raw samples and synthesized guidelines, aligning them via semantic consistency constraints.

Key Designs¶

1. Guideline Synthesis: Given a clinical document and its annotated set of ICD codes, the following steps are performed: - Description Parsing: Extract the official ICD-9 description for each positive code, removing non-standard terms such as "NOS". - Synonym Substitution: Extract synonyms for each code from the UMLS knowledge base and randomly replace portions of the descriptions to enhance diversity. - Hierarchical Retrieval: Add hierarchical descriptions of the groups to which the codes belong (e.g., 038.9 \(\rightarrow\) 030-041 \(\rightarrow\) 001-139). - Shuffling & Concatenation: Periodically shuffle the order of the codes and concatenate them into a long text sequence to serve as the synthesized guideline \(\hat{x}\).

2. Multi-task Learning Mechanism: - Raw Text Prediction: Standard ICD coding loss \(L_{raw} = L_{BCE}(f(x), y)\) - Guideline Prediction: Requiring the model to correctly predict codes from synthesized guidelines as well: \(L_{guide} = L_{BCE}(f(\hat{x}), y)\) - Semantic Consistency Constraint: Aligning the code-specific representations extracted from the raw text and the guidelines: \(L_{sim} = 1 - cosine(E, \hat{E})\)

Loss & Training¶

\[L = L_{raw}(x, y) + L_{guide}(\hat{x}, y) + \lambda L_{sim}(E, \hat{E})\]

where \(\lambda\) controls the weight of semantic consistency, balancing the gap between theoretical knowledge and clinical expressions.

Key Experimental Results¶

Main Results¶

Model	MIMIC-III-Full MacroAUC	MicroAUC	MacroF1	MicroF1	P@8
CAML	0.895	0.986	0.088	0.539	0.709
PLM-CA	0.916	0.989	0.103	0.599	0.772
MSMN	0.950	0.992	0.103	0.584	0.752
CoRelation	0.952	0.992	0.102	0.591	0.762
GKI-ICD	0.962	0.993	0.123	0.612	0.777

GKI-ICD achieves SOTA performance across all metrics on MIMIC-III-Full, with a MacroAUC gain of 4.6% and a MacroF1 gain of 19.4% compared to the baseline PLM-CA.

Ablation Study¶

Knowledge Combination	Effectiveness
No Knowledge (baseline PLM-CA)	MacroAUC 0.916, MacroF1 0.103
+ Description	Gain
+ Description + Synonym	Further Gain
+ Description + Synonym + Hierarchy	Optimal (MacroAUC 0.962, MacroF1 0.123)

The gradual contributions of the three types of knowledge validate the necessity and complementarity of multi-knowledge integration.

Key Findings¶

Effectiveness of General Framework: Multiple types of knowledge can be injected without specialized modules, and the performance surpasses methods that utilize specialized modules.
Strong Knowledge Complementarity: The three types of knowledge (descriptions, synonyms, and hierarchies) provide progressive incremental gains.
Zero Inference Overhead: Knowledge is injected solely via synthesized guidelines during the training phase; guidelines are not used during inference, resulting in no increased computational overhead.
Outperforming Methods with Additional Labels: GKI-ICD remains highly competitive even when compared to methods that use additional manual annotations (such as DRG/CPT).
Significant Long-tail Improvement: The substantial increase in MacroF1 demonstrates a significantly improved capability to handle low-frequency codes.

Highlights & Insights¶

Integrates three complementary types of ICD code data in a unified manner for the first time without introducing extra network modules.
The guideline synthesis method elegantly incorporates discrete knowledge into continuous text sequences, leveraging the semantic understanding capabilities of language models.
Injecting knowledge during training with zero overhead during inference represents a highly practical design philosophy.
Achieves comprehensive SOTA performance on the largest publicly available clinical dataset, MIMIC-III.

Limitations & Future Work¶

Only validated on the ICD-9 coding system, with future verification needed on updated versions like ICD-10.
Guideline synthesis utilizes ground-truth labels, limiting direct generalization to semi-supervised scenarios.
The randomness in synonym substitution might introduce training noise.
Experiments were restricted to RoBERTa-PM as the encoder, leaving the effects on other pre-trained models unverified.

ICD Coding Networks: CAML (Mullenbach et al., 2018) pioneered the label attention mechanism; PLM-ICD/PLM-CA (Edin et al., 2022/2024) introduced pre-trained language models.
Knowledge Injection - Description: ISD (Zhou et al., 2021) combined code descriptions using self-distillation; KEPTLongformer (Yang et al., 2022) treated descriptions as prompts.
Knowledge Injection - Synonym: MSMN (Yuan et al., 2022) utilized multi-synonym attention to learn diverse code representations.
Knowledge Injection - Hierarchy: MSATT-KG (Xie et al., 2019) employed graph convolutional networks to capture hierarchical relations between codes.

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐
Technical Depth	⭐⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Value	⭐⭐⭐⭐⭐
Overall Rating	8.0/10