ACL2026 Medical LLM AI paper notes paper summaries Medical Imaging LLM Reasoning Reinforcement Learning Question Answering Alignment/RLHF

🩺 Medical LLM¶

💬 ACL2026 · 47 paper notes

📌 Same area in other venues: 📷 CVPR2026 (1) · 🔬 ICLR2026 (20) · 🧪 ICML2026 (4) · 🤖 AAAI2026 (12) · 🧠 NeurIPS2025 (17) · 🧪 ICML2025 (4)

🔥 Top topics: Medical Imaging ×31 · LLM ×12 · Reasoning ×8 · Reinforcement Learning ×4 · Question Answering ×3

"Excuse Me, May I Say Something…" CoLabScience: A Proactive AI Assistant for Biomedical Discovery: CoLabScience utilizes the PULI (Positive-Unlabeled Learning for Intervention) framework to train an LLM assistant capable of proactively deciding when and how to intervene in biomedical team discussions. It leverages GRPO and an RL coordinator to automatically identify optimal intervention timings and generate scientific suggestions from streaming dialogues.
Anonpsy: A Graph-Based Framework for Structure-Preserving De-identification of Psychiatric Narratives: Anonpsy is proposed to redefine the de-identification of psychiatric narratives as a graph-guided semantic rewriting problem—narratives are first converted into semantic graphs, then constrained perturbations are performed on the graph to modify identity information while preserving clinical structure, followed by narrative reconstruction through graph-conditional generation.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering: This paper proposes StsPatient, which simulates standardized patients across various cognitive impairment domains and severity levels by extracting domain-specific Steering Vectors from contrastive instruction/response pairs. Combined with a Stochastic Token Modulation (STM) mechanism to control injection probability, it achieves an average improvement of 11.23% in clinical authenticity compared to prompt engineering methods and exceeds the best baseline by 18.54% in severity controllability.
Beyond the Individual: Virtualizing Multi-Disciplinary Reasoning for Clinical Intake via Collaborative Agents: The proposed Aegle framework virtualizes Multi-Disciplinary Teams (MDT) through a graph-structured multi-agent architecture. By introducing decoupled parallel reasoning and dynamic topology into the clinical intake process, it outperforms SOTA models on 53 metrics across 24 clinical departments.
Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models: The authors propose MedCheck—the first evaluation framework for the lifecycle of medical LLM benchmarks, decomposing benchmark construction into 5 stages with a total of 46 criteria. Auditing 56 medical benchmarks using this framework reveals three systemic issues: (1) 50% do not align with any medical standards (ICD/SNOMED), (2) 88% do not handle data contamination, and (3) 89% do not test model robustness while 91% do not test uncertainty—concluding that current "leaderboard progress" is largely an illusion.
BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels: BioHiCL utilizes hierarchical multi-label annotations of MeSH (Medical Subject Headings) to provide structured supervision for dense retrievers. By aligning the embedding space with the MeSH semantic space through depth-weighted label similarity, a 0.1B model outperforms most specialized models on biomedical retrieval, sentence similarity, and question-answering tasks.
Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA: Ours investigates how social identity markers (sexual orientation and religious beliefs) distort the accuracy and confidence calibration of LLMs in medical QA. It is found that "homosexual" markers consistently lead to performance degradation and calibration crises across 9 LLMs, and intersectional identities produce non-additive, specific harm.
Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?: This paper constructs a high-quality German medical corpus, FineMed-de (7.3 million documents / 5.1 billion tokens filtered from FineWeb2), performs continual pre-training and SLERP model merging on three LLMs (7B-24B) to create the DeFineMed model family. It demonstrates that domain-specialized 7B models can significantly bridge the performance gap with general 24B models on German medical tasks (win rate improved by ~3.5x).
CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation: The authors decompose the ambiguous question of "quality of a CT report" into a QA checklist of "whether each fine-grained attribute of every finding matches," constructing the CT-FineBench benchmark with 44k questions. Its sensitivity to clinical errors and correlation with human expert scores significantly outperform existing metrics such as BLEU, BERTScore, RadGraph, RaTEScore, and GREEN.
CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers: The authors remodel 3D CT interpretation as an agentic task where "radiologists iteratively explore via tools." By exposing four categories of tools—Data Ingestion, Global Navigation, Detailed Observation, and Advanced Analysis—through the Model Context Protocol (MCP), they construct CT-FlowBench with 2000+300 executable trajectories. They subsequently perform SFT to develop CT-Flow-8B, which achieves 69.46% ACC on 3D-RAD (a +22.46% improvement over slice-only baselines) with a tool name error rate of only 0.007/case.
CURA: Clinical Uncertainty Risk Alignment for Language Model-Based Risk Prediction: CURA proposes a dual-level uncertainty calibration framework: the individual level aligns prediction uncertainty with error probability, while the cohort level regularizes predictions via neighborhood risk rates in the embedding space. It consistently improves calibration metrics across five clinical risk prediction tasks on MIMIC-IV without sacrificing discriminative performance.
CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning: The authors construct CureMed-Bench, a medical reasoning dataset covering 13 languages (including low-resource languages like Amharic, Yoruba, and Swahili) with 15,774 open-ended questions. They propose Cure-Med: a two-stage "code-switching aware SFT + curriculum GRPO" framework that jointly optimizes reasoning correctness and language consistency. At 7B, it achieves a language consistency/logical accuracy of 85.21% / 54.35%, and at 32B, it reaches 94.96% / 70.04%.
Dr. Assistant: Enhancing Clinical Diagnostic Inquiry via Structured Diagnostic Reasoning Data and Reinforcement Learning: This paper proposes the Clinical Diagnostic Reasoning Data (CDRD) structure to capture the abstract clinical reasoning logic from symptoms to differential diagnosis. Based on CDRD, the Dr. Assistant model (14B) is developed using a two-stage SFT+RL training process. It outperforms HuatuoGPT-o1-72B by 13.59% in ICD-Recall on clinical inquiry benchmarks, achieving performance competitive with GPT-5.
Efficient and Effective Internal Memory Retrieval for LLM-Based Healthcare Prediction: This paper proposes the K2K framework, which treats the LLM's FFN parameter space as a retrievable knowledge base. By injecting clinical knowledge via LoRA, constructing precise retrieval with activation-guided probes, and adaptively integrating via cross-attention re-ranking, it achieves medical prediction SOTA without external retrieval latency.
Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach: This paper proposes the MedSSR framework, which efficiently enhances the medical reasoning capabilities of LLMs through controllable data synthesis injected with rare disease knowledge and a "Self-supervised RL \(\rightarrow\) Supervised RL" semi-supervised training paradigm. It achieves a maximum improvement of +5.93% on rare disease tasks, breaking the +3% improvement ceiling of existing methods.
Empathy Applicability Modeling for General Health Queries: This paper proposes the Empathy Applicability Framework (EAF) to determine whether it is "appropriate" to express emotional reactions or interpretive understanding in single-turn health queries. By constructing a benchmark with human and GPT-4o annotations and training classifiers, the study provides upstream signals for empathy requirement identification in medical LLMs before response generation.
Faithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical Evidence: This paper constructs the MedCounterFact dataset—systematically replacing interventions in clinical trials with nonsense words, medical terms, non-medical objects, and toxic substances. It finds that leading LLMs exhibit nearly unconditional compliance with the context in the face of counterfactual medical evidence, confidently providing answers even when "evidence" suggests heroin or mustard gas is effective, revealing a severe lack of defined boundaries between faithfulness and safety.
Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech: This paper evaluates TF-IDF, BERT, NeoBERT, XLM-R, and RoBERTa-Tagalog using 4,000 parallel English-Filipino DementiaBank dialogue transcripts. It finds that cross-lingual robustness in dementia detection primarily stems from language coverage during training rather than modern encoder architectures.
From Answers to Arguments: Toward Trustworthy Clinical Diagnostic Reasoning with Toulmin-Guided Curriculum Goal-Conditioned Learning: This paper adapts the Toulmin argumentation model to the clinical diagnostic process and proposes the CGCL three-stage curriculum training framework (Fact Collection → Hypothesis Testing → Comprehensive Conclusion). Coupled with T-Eval for quantifying reasoning structural integrity, it achieves diagnostic reasoning quality comparable to RL methods without requiring RL.
HeteroRAG: A Heterogeneous Retrieval-Augmented Generation Framework for Medical Vision Language Tasks: HeteroRAG constructs the MedAtlas knowledge base with 2.7 million image-text pairs and five types of corpora. It decomposes medical multimodal RAG into three components—ModCLIPs trained by modality to retrieve reports, MQG generating customized queries per corpus to retrieve documents, and HKPT preference fine-tuning to align cross-modality and multi-source knowledge—enabling a 7B model to consistently outperform open-source Med-LVLMs with 4-5× its parameters across 11 datasets.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering: This paper proposes HypEHR, a Lorentz hyperbolic model with only 22M parameters. It embeds medical codes, visit records, and questions into hyperbolic space and aligns them with the ICD ontology structure via hierarchy-aware regularization, achieving performance close to LLM-based methods on the MIMIC-IV EHR-QA task.
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages: This paper constructs IndicMedDialog, the first parallel multi-turn medical diagnostic dialogue dataset covering English and 9 Indic languages (Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu), totaling 29,800 instances (2,980 dialogues × 10 languages). The dataset was created using LLaMA-3.3-70B for dialogue synthesis, TranslateGemma for translation, native speaker verification, and script-aware post-processing for phonetic/spelling/spacing corrections. Furthermore, IndicMedLM was trained using 4-bit quantized LLaMA-3.2-3B with LoRA, achieving the highest post-processed accuracy in 7 out of 10 languages and a 95.3% medical safety pass rate, while identifying 5 systematic failure modes (ID/LC/CDC/TTF/PLG).
Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation: This paper proposes DyReMe, a dynamic medical diagnostic evaluation framework. It utilizes the DyGen module to generate brand-new diagnostic cases incorporating clinical distractors such as differential diagnoses and misdiagnosis factors. Through the EvalMed module, LLMs are evaluated across four dimensions—Accuracy, Veracity, Helpfulness, and Consistency—revealing that existing static benchmarks overestimate the diagnostic capabilities of LLMs. For instance, GPT-5's accuracy dropped by 8.25% on DyReMe, and 12 LLMs all exhibited significant deficiencies in trustworthiness.
Language Reconstruction with Brain Predictive Coding from fMRI Data: Ours proposes PredFT, an end-to-end fMRI-to-Text decoding model that integrates a main network (language decoding) and a side network (brain predictive coding representation). By extracting forward-looking semantic representations from predictive brain regions (PTO areas) and fusing them into the decoding process, PredFT achieves a BLEU-1 of 34.95% (Sub-1) on the LeBel dataset, a Gain of 7.84 percentage points compared to the strongest baseline MapGuide.
Learning Dynamic Representations and Policies from Multimodal Clinical Time-Series with Informative Missingness: Proposes the OPL-MT-MNAR framework, which learns dynamic ICU patient representations by combining MNAR-aware multimodal encoders, Bayesian filtered latent states, and offline policy learning. By utilizing "information carried by the missingness patterns themselves" in structured data and clinical text, it achieves sepsis treatment policies superior to clinician behavior (FQE 0.679 vs 0.528).
LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification: This PsyDefDetect competition system utilizes Qwen3-8B QLoRA, minority lexical augmentation, grouped 5-fold cross-validation, Out-of-Fold (OOF) logit bias, and multi-seed ensembles to improve the official macro F1 for psychological defense mechanism classification to 0.3917, ranking 4th among 21 teams.
MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation: This paper proposes MARCH, a multi-agent framework that simulates the hierarchical collaboration of radiology Residents, Fellows, and Attending physicians. Through a three-stage process (initial drafting, retrieval-augmented revision, and consensus-driven finalization), it generates CT reports. On the RadGenome-ChestCT dataset, it achieves a CE-F1 of 0.399, representing a 57.7% improvement over the best baseline, Reg2RG (0.253).
Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation: This paper proposes the CARE framework and the FAITH-M benchmark dataset. By integrating local dialogue context encoding with contrastive exemplar retrieval and Knowledge Distillation Chain-of-Thought (KD-CoT), it performs fine-grained ordinal assessment of AI-generated psychotherapy dialogues across six therapeutic principles. The framework achieves a weighted F1 of 63.34, representing a 64.26% Gain over the strongest baseline, Qwen3.
MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts: MedFact establishes an expert-annotated fact-checking benchmark covering real-world Chinese medical texts. Testing 20 LLMs proves that while current models can easily judge "whether an error exists," they struggle to precisely locate errors. RAG is beneficial, whereas multi-agent systems and reasoning-time scaling tend to amplify "over-criticism."
MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models: MHGraphBench automatically constructs 9 categories of multiple-choice tasks from the mental health subgraph of PrimeKG. It finds that LLMs achieve near-perfect scores in entity recognition but remain significantly deficient in drug-disease relationship judgment, contraindication boundaries, and two-hop KG reasoning.
MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models: Ours proposes the R-MHSafe role-aware mental health safety taxonomy and the MHSafeEval closed-loop agent evaluation framework. Through adversarial multi-turn counseling interactions, it systematically identifies role-dependent cumulative safety failures in LLMs within mental health counseling scenarios, revealing interaction-level harms that traditional static benchmarks fail to capture.
Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection: This paper proposes decomposing utterances into Emotion-Logic-Behavior (ELB) components and utilizing LLMs to reason about multiple cognitive distortion instances. These instances are then aggregated using a multi-view gated attention MIL framework for bag-level classification. The method outperforms direct LLM reasoning baselines on both Korean (KoACD) and English (Therapist QA) datasets.
MultiDx: A Multi-Source Knowledge Integration Framework towards Diagnostic Reasoning: MultiDx integrates web retrieval, SOAP structured cases, similar case libraries, and fine-grained reasoning trace retrieval into a two-stage diagnostic reasoning framework. By first generating candidate diseases from multi-path evidence and then performing disease matching, voting, and differential diagnosis reranking, it simultaneously improves diagnostic accuracy and reasoning recall on both MedCaseReasoning and DiReCT benchmarks.
PCoA: A New Benchmark for Medical Aspect-Based Summarization With Phrase-Level Context Attribution: PCoA constructs a medical aspect-based summarization benchmark for Randomized Controlled Trial (RCT) abstracts, aligning each aspect summary with both supporting sentences and contributory phrases, and utilizes a three-tier metric system (claim, citation, and phrase) to evaluate LLM capabilities in verifiable medical summarization.
PrinciplismQA: A Philosophy-Grounded Approach to Assessing LLM-Human Clinical Medical Ethics Alignment: This paper constructs the PrinciplismQA benchmark (3,648 questions, including knowledge MCQA and open-ended clinical ethical dilemmas) based on Principlism (the four principles of Autonomy, Non-maleficence, Beneficence, and Justice), the international gold standard for medical ethics. Supported by an expert-calibrated evaluation pipeline, the study reveals that high accuracy on knowledge benchmarks does not equate to clinical ethical reasoning capability—the strongest model, o3, achieved an overall score of only 77.5%.
ProMedical: Hierarchical Fine-Grained Criteria Modeling for Medical LLM Alignment via Explicit Injection: ProMedical utilizes hierarchical, fine-grained clinical rubrics co-constructed with medical doctors to guide preference datasets, reward modeling, and benchmarks. Through explicit criteria injection, a multi-dimensional reward model is trained, achieving an improvement of 22.3% in overall accuracy and 21.7% in safety compliance for Qwen3-8B in medical alignment.
Query Pipeline Optimization for Cancer Patient Question Answering Systems: This paper proposes CoMeta, a three-layer controllable metadata-aware RAG framework for Cancer Patient Question Answering (CPQA). By integrating Clinical Hybrid Semantic-Symbolic Document Retrieval (CHSDR)—which fuses E-Utilities real-time Boolean search with MedCPT semantic retrieval—and Semantic-Enhanced Overlapping Segmentation (SEOS) to prevent context fragmentation, the framework improves Claude-3-Haiku's answer accuracy on the CMMQA dataset by 5.24% (vs. CoT) and approximately 3% (vs. naive RAG).
RA-RRG: Multimodal Retrieval-Augmented Radiology Report Generation with Key Phrase Extraction: The RA-RRG framework is proposed to extract clinical key phrases from radiology reports via LLMs to construct a retrieval database. Given chest X-ray images, relevant phrases are retrieved and input into an LLM to generate reports. This effectively suppresses hallucinations without requiring LLM fine-tuning, achieving SOTA on CheXbert metrics with only 18 GPU hours of training.
RADS: Reinforcement Learning-Based Sample Selection Improves Transfer Learning in Low-resource and Imbalanced Clinical Settings: Ours proposes RADS (Reinforcement Adaptive Domain Sampling), an RL-based sample selection strategy that significantly improves cross-domain disease detection in extreme low-resource and imbalanced clinical scenarios by intelligently selecting a few target domain samples for annotation and joint fine-tuning.
Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework: This paper introduces VietPET-RoI, the first 3D PET/CT dataset (Vietnamese) with fine-grained ROI annotations, and HiRRA, a hierarchical report generation framework that simulates the diagnostic workflow of radiologists. By modeling spatial-morphological relationships between ROIs using Graph Neural Networks, the framework achieves a 19.7% improvement in BLEU-4 and a 45.8% increase in the clinical metric RoIQ.
Reliable Automated Triage in Spanish Clinical Notes: A Hybrid Framework for Risk-Aware HIV Suspicion Identification: This paper proposes a dual-validation selective triage framework for early HIV suspicion identification in Spanish clinical notes, utilizing MCP to handle aleatoric uncertainty and MCMD geometric veto to handle epistemic uncertainty. The system automatically processes 67.7% of cases while achieving a 0.982 Clear \(F_2\) under strict safety constraints.
ReMedi: Reasoner for Medical Clinical Prediction: ReMedi reformulates EHR clinical prediction as a "rationale-prediction" generation and preference learning task. By utilizing hard sample regeneration with ground-truth outcome hints, SFT, and DPO, it teaches medical LLMs to provide fine-grained explanations for patient risks. It achieves up to a 19.9 F1 point improvement over KARE across three clinical prediction tasks on MIMIC-IV.
RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models: This paper proposes RePrompT, a time-aware LLM framework that consistently outperforms EHR and LLM baselines on readmission and mortality prediction tasks in MIMIC-III/IV through two complementary mechanisms: recurrent prompt tuning (using the hidden state of the previous visit as a soft prompt for the next) and struct-encoded prompt tuning (injecting embeddings from population-level EHR encoders).
Responsible Evaluation of AI for Mental Health: Through a systematic analysis of 135 ACL Anthology papers, this work reveals five major flaws in the evaluation of AI mental health tools (reliance on generic metrics, lack of human evaluation, neglect of safety and fairness, etc.) and proposes an interdisciplinary evaluation taxonomy (assessment/intervention/information synthesis \(\times\) validity/reliability/implementation/maintenance) that integrates clinical psychometrics and implementation science.
Ryze: Evidence-Enriched Data Synthesis from Biomedical Papers: Ryze automatically converts biomedical paper PDFs into evidence-enriched QA data that preserves figures, captions, structured extractions, and cited paragraphs. Using a progress-gated SFT+GRPO strategy to train BioVLM-8B, it achieves 48.0% weighted accuracy on LAB-Bench, outperforming the Qwen3-VL-8B base by 12.6 percentage points and GPT-5.2 by 3.8 percentage points.
SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning: The authors propose SEMA-RAG, a self-evolving multi-agent Retrival-Augmented Generation framework. By simulating phased clinical reasoning via three specialized agents (Interpreter, Explorer, Arbiter), it outperforms the strongest baselines by an average of +6.46 accuracy points across 5 medical QA benchmarks.
Text-Attributed Knowledge Graph Enrichment with Large Language Models for Medical Concept Representation: This paper proposes CoMed, an LLM-empowered graph learning framework. It constructs a global medical knowledge graph by combining EHR statistical evidence with type-constrained LLM reasoning. It then enriches the graph into a text-attributed graph using LLM-generated node descriptions and edge rationales. Finally, it jointly trains a LoRA-finetuned LLaMA encoder and a heterogeneous GNN to learn unified medical concept embeddings, significantly improving diagnosis prediction performance on MIMIC-III/IV.