NeurIPS 2025 Medical LLM Medical Text Summarization Faithfulness Cross-lingual TextRank Named Entity Recognition LLaMA

Faithful Summarization of Consumer Health Queries: A Cross-Lingual Framework with LLMs¶

Conference: NeurIPS 2025
arXiv: 2511.10768
Code: Undisclosed
Area: Medical NLP
Keywords: Medical Text Summarization, Faithfulness, Cross-lingual, TextRank, Named Entity Recognition, LLaMA

TL;DR¶

A framework combining TextRank-based extractive sentence selection and medical Named Entity Recognition (NER) is proposed to guide LLMs in generating faithful medical summaries. By fine-tuning LLaMA-2-7B on the English MeQSum and Bengali BanglaCHQ-Summ datasets, consistent improvements in both quality and faithfulness are achieved, with SummaC reaching 0.57 and human evaluation showing that 82% of the summaries retain key medical information.

Background & Motivation¶

Growth in Online Health Consultation: Online consultation platforms driven by the pandemic have become crucial sources of medical information, but long and redundant patient queries impose a burden on healthcare professionals, who must identify core concerns before responding.

Inadequacy of Traditional Metrics: Existing summarization evaluations mainly rely on quality metrics like ROUGE/BERTScore to measure lexical/semantic similarity, but ignore faithfulness—the factual consistency between the summary and the source text.

Specific Risks in Medical Scenarios: Abstractive models often generate intrinsic errors (misrepresenting entities/relations) and extrinsic errors (introducing unsupported facts). In the medical domain, even minor distortions can mislead patients/doctors and jeopardize health outcomes.

Faithfulness is Underestimated: Compared to readability and general accuracy, faithfulness is understudied in medical summarization, lacking frameworks specifically designed to enhance it.

Cross-Lingual Gap: Cross-lingual evaluation of faithfulness in medical summarization is virtually non-existent, and medical NLP for low-resource languages (such as Bengali) is severely lacking.

Method¶

Overall Architecture¶

Three-stage pipeline: Medical NER + TextRank Extraction → LLM Fine-Tuning → Best-of-N Selection

Step 1: Preprocessing and Relevant Sentence Extraction¶

Standardize datasets into a question-summary format.
Identify overlapping medical entities and negation words to ensure key information is preserved.
Apply the TextRank algorithm to extract sentences containing medical entities and query-relevant words.
Core Goal: Ensure that the input is anchored to medically important content prior to abstractive generation by the LLM.

TextRank is a graph-based ranking algorithm that treats sentences as nodes and inter-sentence similarities as edge weights, iteratively computing sentence importance scores:

\[S(V_i) = (1 - d) + d \cdot \sum_{V_j \in \text{In}(V_i)} \frac{w_{ji}}{\sum_{V_k \in \text{Out}(V_j)} w_{jk}} S(V_j)\]

where \(d\) is the damping factor, and \(w_{ji}\) is the weight of the edge from sentence \(j\) to \(i\).

Step 2: LLM Fine-Tuning¶

Base Model: LLaMA-2-7B
Training Approach: LoRA (Low-Rank Adaptation) parameter-efficient fine-tuning
Input: Sentences containing medical entities filtered by TextRank
Output: Concise and faithful summaries

Step 3: Best-of-N Selection¶

Generate multiple candidate summaries (temperature \(t = 0.7\)), selecting the optimal one using two strategies: - ROUGE-1 Selection: Maximize lexical coverage (quality-oriented) - SummaC Selection: Maximize factual consistency (faithfulness-oriented)

Temperature sweep \(t \in \{0.1, 0.3, 0.5, 0.7, 0.9\}\) reveals a trade-off: lower temperatures favor ROUGE, while higher temperatures favor SummaC, with \(t = 0.7\) serving as the equilibrium point.

Evaluation Metrics¶

Quality Metrics: ROUGE-1/2/L, BERTScore
Faithfulness Metrics: SummaC (NLI-based factual consistency), AlignScore (semantic alignment)
Readability: Flesch Reading Ease (FRE)

Key Experimental Results¶

MeQSum (English, 1000 samples)¶

Setting	R1	R2	RL	BERT	Readability	SummaC	AlignScore
Zero-shot	21.97	6.48	19.98	0.60	65.16	0.28	21.80
FT (w/o TR)	44.23	27.36	41.55	0.71	70.21	0.31	38.45
FT + TR	47.07	29.44	44.08	0.72	70.69	0.37	45.65
Best-of-3 (R1)	50.50	34.38	47.74	0.74	71.56	0.40	39.24
Best-of-3 (SummaC)	48.27	31.38	45.34	0.73	71.56	0.57	45.91

Comparison with SOTA¶

Model	R1	R2	RL	BERT	SummaC	AlignScore
Mixtral-8x7B-Inst.	32.47	36.38	16.86	0.72	-	-
BioBART + FaMeSumm	31.76	11.71	29.64	0.74	0.46	-
Ours (Best-of-3)	50.50	34.38	47.74	0.74	0.57	0.46

BanglaCHQ-Summ (Bengali, 2350 samples)¶

Setting	R1	R2	RL	BERT	SummaC
Zero-shot	19.10	8.21	18.97	0.62	0.22
FT (w/o TR)	28.24	14.22	24.54	0.71	0.26
FT + TR	30.71	15.71	28.95	0.74	0.28
Best-of-3 (R1)	32.35	16.32	29.09	0.76	0.29
Best-of-3 (SummaC)	30.92	15.74	27.35	0.73	0.32

Key Findings¶

TextRank remains consistently effective: Integrating TextRank consistently improves both quality and faithfulness in both English and Bengali.
Best-of-N strategy is significant: Selecting the best candidate from 3 options yields substantial improvements over single-pass generation—with ROUGE-1 selection optimizing quality, and SummaC selection optimizing faithfulness.
Successful knowledge transfer: The framework remains effective from English to Bengali, validating its cross-lingual generalization capability.
Human evaluation validation: Clinical MD evaluations indicate that 82% of the summaries retain all key medical information while maintaining factual consistency.
Temperature trade-off: A quality-faithfulness trade-off exists, where \(t = 0.7\) acts as the optimal balance point.

Highlights & Insights¶

⭐⭐⭐⭐ Focus on Faithfulness: First to treat faithfulness as the core optimization objective in medical text summarization rather than solely pursuing ROUGE.
⭐⭐⭐ Cross-lingual Validation: Validated on both English and Bengali (a low-resource language), presenting the first cross-lingual faithfulness evaluation in medical summarization.
⭐⭐⭐ Practical Pipeline: The TextRank + NER + LoRA fine-tuning + Best-of-N pipeline is simple, practical, and highly reproducible.
⭐⭐⭐ Human Validation: An 82% pass rate in human evaluation provides credibility beyond automated metrics.
⭐⭐ Clear Ablation: The incremental contribution of each component is clearly demonstrated through ablation studies.

Limitations & Future Work¶

Single LLM: Evaluated only on LLaMA-2-7B, without testing newer or larger models (e.g., LLaMA-3, Mistral).
Small Dataset Scale: MeQSum has only 1,000 samples and BanglaCHQ-Summ has 2,350 samples, which limits statistical significance.
Only Two Languages: Covers only English and Bengali; generalization to other low-resource languages remains unknown.
Limitations of TextRank: Graph-based ranking based on term frequency might miss semantically crucial but low-frequency information in complex medical narratives.
Limited Evaluation Dimensions: SummaC and AlignScore are themselves proxy metrics that may fail to capture all types of factual inconsistency.

Overall Evaluation ⭐⭐⭐¶

The research direction is correct—treating faithfulness as a core focus in medical summarization is an important contribution. The methodological design is straightforward yet effective (TextRank + NER guidance + Best-of-N). The cross-lingual evaluation is a key highlight. However, the overall scale is relatively small (single LLM, small datasets, only two languages), and the depth of technical contribution is limited—resembling a systematic engineering integration rather than a methodological innovation. Its positioning as a workshop paper (Muslims in ML @ NeurIPS) is appropriate.