Faithful Summarization of Consumer Health Queries: A Cross-Lingual Framework with LLMs¶
Conference: NeurIPS 2025
arXiv: 2511.10768
Code: Undisclosed
Area: Medical NLP
Keywords: Medical Text Summarization, Faithfulness, Cross-lingual, TextRank, Named Entity Recognition, LLaMA
TL;DR¶
A framework combining TextRank-based extractive sentence selection and medical Named Entity Recognition (NER) is proposed to guide LLMs in generating faithful medical summaries. By fine-tuning LLaMA-2-7B on the English MeQSum and Bengali BanglaCHQ-Summ datasets, consistent improvements in both quality and faithfulness are achieved, with SummaC reaching 0.57 and human evaluation showing that 82% of the summaries retain key medical information.
Background & Motivation¶
Growth in Online Health Consultation: Online consultation platforms driven by the pandemic have become crucial sources of medical information, but long and redundant patient queries impose a burden on healthcare professionals, who must identify core concerns before responding.
Inadequacy of Traditional Metrics: Existing summarization evaluations mainly rely on quality metrics like ROUGE/BERTScore to measure lexical/semantic similarity, but ignore faithfulness—the factual consistency between the summary and the source text.
Specific Risks in Medical Scenarios: Abstractive models often generate intrinsic errors (misrepresenting entities/relations) and extrinsic errors (introducing unsupported facts). In the medical domain, even minor distortions can mislead patients/doctors and jeopardize health outcomes.
Faithfulness is Underestimated: Compared to readability and general accuracy, faithfulness is understudied in medical summarization, lacking frameworks specifically designed to enhance it.
Cross-Lingual Gap: Cross-lingual evaluation of faithfulness in medical summarization is virtually non-existent, and medical NLP for low-resource languages (such as Bengali) is severely lacking.
Method¶
Overall Architecture¶
Three-stage pipeline: Medical NER + TextRank Extraction → LLM Fine-Tuning → Best-of-N Selection
Step 1: Preprocessing and Relevant Sentence Extraction¶
- Standardize datasets into a question-summary format.
- Identify overlapping medical entities and negation words to ensure key information is preserved.
- Apply the TextRank algorithm to extract sentences containing medical entities and query-relevant words.
- Core Goal: Ensure that the input is anchored to medically important content prior to abstractive generation by the LLM.
TextRank is a graph-based ranking algorithm that treats sentences as nodes and inter-sentence similarities as edge weights, iteratively computing sentence importance scores:
where \(d\) is the damping factor, and \(w_{ji}\) is the weight of the edge from sentence \(j\) to \(i\).
Step 2: LLM Fine-Tuning¶
- Base Model: LLaMA-2-7B
- Training Approach: LoRA (Low-Rank Adaptation) parameter-efficient fine-tuning
- Input: Sentences containing medical entities filtered by TextRank
- Output: Concise and faithful summaries
Step 3: Best-of-N Selection¶
Generate multiple candidate summaries (temperature \(t = 0.7\)), selecting the optimal one using two strategies: - ROUGE-1 Selection: Maximize lexical coverage (quality-oriented) - SummaC Selection: Maximize factual consistency (faithfulness-oriented)
Temperature sweep \(t \in \{0.1, 0.3, 0.5, 0.7, 0.9\}\) reveals a trade-off: lower temperatures favor ROUGE, while higher temperatures favor SummaC, with \(t = 0.7\) serving as the equilibrium point.
Evaluation Metrics¶
- Quality Metrics: ROUGE-1/2/L, BERTScore
- Faithfulness Metrics: SummaC (NLI-based factual consistency), AlignScore (semantic alignment)
- Readability: Flesch Reading Ease (FRE)
Key Experimental Results¶
MeQSum (English, 1000 samples)¶
| Setting | R1 | R2 | RL | BERT | Readability | SummaC | AlignScore |
|---|---|---|---|---|---|---|---|
| Zero-shot | 21.97 | 6.48 | 19.98 | 0.60 | 65.16 | 0.28 | 21.80 |
| FT (w/o TR) | 44.23 | 27.36 | 41.55 | 0.71 | 70.21 | 0.31 | 38.45 |
| FT + TR | 47.07 | 29.44 | 44.08 | 0.72 | 70.69 | 0.37 | 45.65 |
| Best-of-3 (R1) | 50.50 | 34.38 | 47.74 | 0.74 | 71.56 | 0.40 | 39.24 |
| Best-of-3 (SummaC) | 48.27 | 31.38 | 45.34 | 0.73 | 71.56 | 0.57 | 45.91 |
Comparison with SOTA¶
| Model | R1 | R2 | RL | BERT | SummaC | AlignScore |
|---|---|---|---|---|---|---|
| Mixtral-8x7B-Inst. | 32.47 | 36.38 | 16.86 | 0.72 | - | - |
| BioBART + FaMeSumm | 31.76 | 11.71 | 29.64 | 0.74 | 0.46 | - |
| Ours (Best-of-3) | 50.50 | 34.38 | 47.74 | 0.74 | 0.57 | 0.46 |
BanglaCHQ-Summ (Bengali, 2350 samples)¶
| Setting | R1 | R2 | RL | BERT | SummaC |
|---|---|---|---|---|---|
| Zero-shot | 19.10 | 8.21 | 18.97 | 0.62 | 0.22 |
| FT (w/o TR) | 28.24 | 14.22 | 24.54 | 0.71 | 0.26 |
| FT + TR | 30.71 | 15.71 | 28.95 | 0.74 | 0.28 |
| Best-of-3 (R1) | 32.35 | 16.32 | 29.09 | 0.76 | 0.29 |
| Best-of-3 (SummaC) | 30.92 | 15.74 | 27.35 | 0.73 | 0.32 |
Key Findings¶
- TextRank remains consistently effective: Integrating TextRank consistently improves both quality and faithfulness in both English and Bengali.
- Best-of-N strategy is significant: Selecting the best candidate from 3 options yields substantial improvements over single-pass generation—with ROUGE-1 selection optimizing quality, and SummaC selection optimizing faithfulness.
- Successful knowledge transfer: The framework remains effective from English to Bengali, validating its cross-lingual generalization capability.
- Human evaluation validation: Clinical MD evaluations indicate that 82% of the summaries retain all key medical information while maintaining factual consistency.
- Temperature trade-off: A quality-faithfulness trade-off exists, where \(t = 0.7\) acts as the optimal balance point.
Highlights & Insights¶
- ⭐⭐⭐⭐ Focus on Faithfulness: First to treat faithfulness as the core optimization objective in medical text summarization rather than solely pursuing ROUGE.
- ⭐⭐⭐ Cross-lingual Validation: Validated on both English and Bengali (a low-resource language), presenting the first cross-lingual faithfulness evaluation in medical summarization.
- ⭐⭐⭐ Practical Pipeline: The TextRank + NER + LoRA fine-tuning + Best-of-N pipeline is simple, practical, and highly reproducible.
- ⭐⭐⭐ Human Validation: An 82% pass rate in human evaluation provides credibility beyond automated metrics.
- ⭐⭐ Clear Ablation: The incremental contribution of each component is clearly demonstrated through ablation studies.
Limitations & Future Work¶
- Single LLM: Evaluated only on LLaMA-2-7B, without testing newer or larger models (e.g., LLaMA-3, Mistral).
- Small Dataset Scale: MeQSum has only 1,000 samples and BanglaCHQ-Summ has 2,350 samples, which limits statistical significance.
- Only Two Languages: Covers only English and Bengali; generalization to other low-resource languages remains unknown.
- Limitations of TextRank: Graph-based ranking based on term frequency might miss semantically crucial but low-frequency information in complex medical narratives.
- Limited Evaluation Dimensions: SummaC and AlignScore are themselves proxy metrics that may fail to capture all types of factual inconsistency.
Overall Evaluation ⭐⭐⭐¶
The research direction is correct—treating faithfulness as a core focus in medical summarization is an important contribution. The methodological design is straightforward yet effective (TextRank + NER guidance + Best-of-N). The cross-lingual evaluation is a key highlight. However, the overall scale is relatively small (single LLM, small datasets, only two languages), and the depth of technical contribution is limited—resembling a systematic engineering integration rather than a methodological innovation. Its positioning as a workshop paper (Muslims in ML @ NeurIPS) is appropriate.