ACL2025 Medical LLM AI paper notes paper summaries Medical Imaging LLM Dialogue Question Answering RAG Alignment/RLHF

🩺 Medical LLM¶

💬 ACL2025 · 31 paper notes

📌 Same area in other venues: 📷 CVPR2026 (1) · 🔬 ICLR2026 (20) · 💬 ACL2026 (47) · 🧪 ICML2026 (4) · 🤖 AAAI2026 (12) · 🧠 NeurIPS2025 (17)

🔥 Top topics: Medical Imaging ×24 · LLM ×6 · Dialogue ×4 · Question Answering ×4 · RAG ×4

A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment: This paper proposes a modular framework to efficiently adapt Small Language Models (SLMs) into clinical domain models. It includes pre-instruction tuning for domain experts (training multiple expert models on medical corpora), model merging (combining multiple experts into a unified MediPhi), and clinical-task alignment based on 2.5 million synthetic instructions (MediFlow). Ultimately, the 3.8B-parameter MediPhi outperforms GPT-4 on several clinical tasks.
A Retrieval-Based Approach to Medical Procedure Matching in Romanian: By modeling Romanian medical procedure name matching as a retrieval problem rather than a classification problem, under an extreme long-tail scenario of 39,097 standard entries (50% with only a single sample), this work compares BM25 sparse retrieval with three dense embeddings (mE5/RoBERT/BioClinicalBERT). After fine-tuning via metric learning, mE5 achieves 85.2% Acc@1. In real-world deployment, verification by doctors yields 94.7% accuracy, performing 1200 times faster than manual matching.
A Survey of Large Language Models in Psychotherapy: Current Landscape and Future Directions: The first survey to systematically organize and review LLM research in psychotherapy using the APA three-stage (Assessment \(\to\) Diagnosis \(\to\) Treatment) conceptual taxonomy. Covering over 60 works, it comprehensively analyzes four levels from symptom detection to virtual therapists, revealing a four-fold imbalance across disorder coverage, language bias, methodology fragmentation, and theoretical integration.
Adaptive-VP: A Framework for LLM-Based Virtual Patients that Adapts to Trainees' Dialogue to Facilitate Nurse Communication Training: Proposes the Adaptive-VP framework, which utilizes LLMs to build Virtual Patients (VPs) that dynamically adjust their behavior based on the communication quality of nursing trainees. Through a four-module pipeline of multi-Agent evaluation \(\rightarrow\) dynamic adaptation \(\rightarrow\) dialogue generation \(\rightarrow\) safety monitoring, the framework significantly improves the perceived realism of VP interactions (persona fidelity \(\eta_p^2 = 0.151\), dialogue realism \(\eta_p^2 = 0.254\)) in a between-subjects experiment with 28 nursing experts.
AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset: This work constructs AfriMed-QA (15,275 questions across 32 specialties in 16 countries), the first large-scale pan-African medical QA benchmark, systematically evaluates 30 LLMs, and reveals significant regional performance gaps and the counter-intuitive phenomenon where domain-specific biomedical models underperform general-purpose models in African healthcare contexts.
Are LLMs Effective Psychological Assessors? Leveraging Adaptive RAG for Interpretable Mental Health Screening through Psychometric Practice: This paper proposes a questionnaire-guided mental health screening framework. By leveraging adaptive RAG to retrieve relevant content from users' Reddit posts, LLMs are used to fill out standardized psychometric scales (such as BDI-II) on behalf of users. It matches or outperforms supervised methods without requiring training data, while providing clinically interpretable assessment results.
ArgHiTZ at ArchEHR-QA 2025: A Two-Step Divide and Conquer Approach to Patient Question Answering for Top Factuality: A two-step "divide and conquer" approach was proposed for the ArchEHR-QA 2025 shared task: first, key sentences are extracted from electronic health records using a re-ranking model, and then a small medical LLM generates the response. This approach achieved first place in factuality and 8th/30 in overall score without using any external knowledge.
Automated Structured Radiology Report Generation: This work proposes a new task, Structured Radiology Report Generation (SRRG), which leverages LLMs to restructure free-text reports into standardized formats. It also introduces SRR-BERT, a 55-label disease classification model, and F1-SRR-BERT, an evaluation metric, addressing the challenges of report generation and evaluation caused by highly diverse reporting styles.
The Impact of Auxiliary Patient Data on Automated Chest X-Ray Report Generation and How to Incorporate It: This paper investigates how to integrate Emergency Department (ED) patient data (vital signs, medications, triage information, etc.) into multimodal language models for automated chest X-ray report generation. It proposes a method to convert heterogeneous tabular data, text, and images into unified embeddings, which significantly improves the clinical accuracy of reports on the MIMIC-CXR + MIMIC-IV-ED datasets, outperforming multiple baseline models including CXRMate-RRG24.
Improving Automatic Evaluation of LLMs in Biomedical Relation Extraction via LLMs-as-the-Judge: This paper presents the first systematic study of LLM-as-the-Judge in evaluating biomedical relation extraction. The authors find that its accuracy is typically below 50%, and propose structured output formatting (JSON) and domain adaptation techniques to improve evaluation accuracy by approximately 15%.
CheXalign: Preference Fine-tuning in Chest X-ray Interpretation Models without Human Feedback: CheXalign proposes an automated preference data generation pipeline without radiologist feedback. It leverages reference reports from public datasets and reference-based evaluation metrics (such as GREEN and BERTScore) to construct preference pairs, and performs preference fine-tuning on chest X-ray report generation models using direct alignment algorithms like DPO, achieving SOTA CheXbert scores on MIMIC-CXR.
Aligning AI Research with the Needs of Clinical Coding Workflows: Eight Recommendations Based on US Data Analysis and Critical Review: Through an in-depth analysis of the MIMIC dataset and existing automated clinical coding research, this position paper points out that current evaluation methodologies (such as focusing only on the top-50 high-frequency codes and using inappropriate metrics) are severely disconnected from real clinical scenarios. It proposes eight specific recommendations to improve evaluation methods and research directions.
CliniDial: A Naturally Occurring Multimodal Dialogue Dataset for Team Reflection in Action During Clinical Operation: This work constructs CliniDial, a dataset collected from natural dialogues during simulated clinical operations, containing multimodal data like audio transcriptions, dual-angle videos, and patient physiological signals. Annotated with team reflection action coding, CliniDial reveals substantial shortcomings of state-of-the-art LLMs in handling class imbalance, natural conversational interactions, and domain-specific multimodal data.
CSTRL: Context-Driven Sequential Transfer Learning for Abstractive Radiology Report Summarization: Proposes CSTRL, a context-driven sequential transfer learning approach for abstractive radiology report summarization. By optimizing Gap Sentence Generation (GSG) pre-training, utilizing Fisher matrix regularization to prevent catastrophic forgetting, and combining knowledge distillation for model compression, it significantly outperforms existing methods on the MIMIC-CXR and Open-I datasets.
Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment: This paper proposes MedRef, a medical dialogue system that integrates a knowledge refinement mechanism and a dynamic prompt adjustment strategy. It filters irrelevant knowledge graph triplets using latent variables, conducts joint entity-action prediction, and dynamically builds system prompts via a triplet filter and an exemplar selector, achieving SOTA performance on both the MedDG and KaMed benchmarks.
Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings: A systematic benchmark study reveals that LLM performance significantly degrades under high OOV (out-of-vocabulary) and high-novelty medical text summarization scenarios. Through various vocabulary adaptation strategies (MEDVOC, MEDVOC-LLM, ScafFix), it demonstrates that even Llama-3.1 (128K vocabulary size) still suffers from over-fragmentation, and vocabulary adaptation yields remarkable improvements.
LLMs Can Simulate Standardized Patients via Agent Coevolution: EvoPatient proposes a multi-agent coevolution framework. Through autonomous simulated dialogues between patient and doctor agents, LLMs learn to simulate standardized patients (SP) without human supervision, surpassing existing reasoning methods by more than 10% in requirement alignment.
Follow-up Question Generation for Enhanced Patient-Provider Conversations: This paper proposes FollowupQ, a multi-agent framework that integrates EHR reasoning, differential diagnosis, and message clarification agents to automatically generate personalized follow-up questions for asynchronous patient-provider conversations. FollowupQ improves the RIM score by 17% and 5% on real and semi-synthetic datasets, respectively, compared to baselines, and reduces the need for clinicians to send additional information-gathering messages by 34%.
ANGEL: Learning from Negative Samples in Biomedical Generative Entity Linking: The ANGEL framework is proposed, introducing negative sample training to generative Biomedical Entity Linking (BioEL) for the first time. Through a two-stage strategy (positive-only training + negative-aware preference optimization), it significantly improves the model's ability to distinguish between entities with similar surface forms but different semantics, achieving an average top-1 accuracy improvement of 1.7% across five benchmark datasets.
MedBioRAG: Semantic Search and Retrieval-Augmented Generation with Large Language Models for Medical and Biological QA: MedBioRAG proposes a retrieval-augmented generation framework that combines semantic search, document retrieval, and fine-tuned LLMs, comprehensively outperforming GPT-4o baselines and prior SOTAs on three types of biomedical QA tasks: text retrieval, closed-book QA, and long-text QA.
MedBioRAG: Semantic Search and Retrieval-Augmented Generation with Large Language Models for Medical and Biological QA: MedBioRAG proposes a retrieval-augmented generation framework that integrates semantic search, document retrieval, and fine-tuned LLMs for biomedical QA tasks. It outperforms previous SOTA and GPT-4o baseline models across multiple benchmarks in four dimensions: text retrieval (NFCorpus, TREC-COVID), closed-domain QA (MedQA, PubMedQA, BioASQ), and long-text QA.
Online Iterative Self-Alignment for Radiology Report Generation: Proposed the Online Iterative Self-Alignment (OISA) method: Through a four-stage self-loop consisting of self-generation \(\rightarrow\) self-evaluation \(\rightarrow\) self-alignment \(\rightarrow\) self-iteration, it leverages Multi-Objective Preference Optimization (MODPO) to continuously improve the quality of radiology reports generated by a lightweight RRG model without requiring external large language models or human annotations, achieving SOTA performance on MIMIC-CXR and IU-Xray.
Towards Omni-RAG: Comprehensive Retrieval-Augmented Generation for Large Language Models in Medical Applications: This paper proposes MedOmniKB, a multi-source medical knowledge base, and the Source Planning Optimisation (SPO) method. By enabling an expert model to explore multi-source retrieval plans and training smaller models to learn source alignment, this work significantly enhances multi-source retrieval planning capabilities, allowing a 7B small model to outperform a 72B large model.
One Size Fits None: Rethinking Fairness in Medical AI: This paper conducts a subpopulation performance analysis across three multimodal medical prediction tasks (ICU mortality, graft failure, and emergency triage), exposing performance disparities among groups that are otherwise masked by aggregated metrics. It advocates for tightly coupling fairness with transparency to promote responsible medical AI deployment through routine subpopulation reporting.
Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine: This paper constructs a medical multiple-choice question (MCQ) benchmark centered around a fictitious organ, "Glianorex," to reveal that LLMs predominantly rely on pattern recognition and test-taking heuristics rather than genuine clinical reasoning in medical MCQ tests. Models scored an average of 64% on entirely fictitious medical knowledge, whereas medical doctors scored only 27%.
Radar: Enhancing Radiology Report Generation with Supplementary Knowledge Injection: The Radar framework is proposed to systematically fuse internal and external knowledge sources for more accurate radiology report generation, by distinguishing between trusted internal knowledge already mastered by LLM and external knowledge that needs to be supplemented.
RedactX: An LLM-Powered Framework for Automatic Clinical Data De-Identification: RedactX is proposed as a fully automated, multimodal clinical data de-identification framework. By combining multi-round LLM extraction, rule-based processing, and retrieval-based relexicalization, it achieves an F1 score (0.9646) comparable to specialized commercial systems on the i2b2 dataset, while optimizing token usage efficiency.
ReflecTool: Towards Reflection-Aware Tool-Augmented Clinical Agents: ReflecTool proposes a reflection-aware tool-augmented clinical Agent framework. By accumulating successful trajectories and tool-level experience in the optimization stage, and retrieving similar cases while refining tool usage with a validator in the inference stage, it outperforms pure LLMs by 10+ points and existing Agent methods by 3 points on ClinicalAgent Bench across 18 tasks.
SECRET: Semi-supervised Clinical Trial Document Similarity Search: Proposes SECRET, a semi-supervised clinical trial protocol similarity search method. By converting clinical trial documents into Q/A pair representations and combining local (Q/A-level) and global (trial-level) contrastive learning to generate embeddings, it improves recall@1 by 78% relative to the best baseline in full trial search.
Query-driven Document-level Scientific Evidence Extraction from Biomedical Studies: This paper proposes the URCA (Uniform Retrieval Clustered Augmentation) framework, which automatically extracts scientific evidence and conclusions related to clinical questions from the full texts of RCT studies using a RAG pipeline of uniform retrieval, clustering, and knowledge extraction. It achieves an 8.81% F1 improvement over the best baseline on the newly constructed CochraneForest dataset.
VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare: This paper constructs VITAL, the first pluralistic alignment benchmark dataset for the healthcare domain, containing 13.1K value scenarios and 5.4K multiple-choice questions. Extensive evaluation of 8 LLMs demonstrates that existing pluralistic alignment techniques (especially ModPlural) perform poorly in medical scenarios, and simple prompting yields better results.