🏥 Medical Imaging¶

💬 ACL2026 · 42 paper notes

"Excuse Me, May I Say Something…" CoLabScience: A Proactive AI Assistant for Biomedical Discovery: CoLabScience introduces the PULI (Positive-Unlabeled Learning Intervention) framework to train an LLM assistant capable of proactively determining when and how to intervene in biomedical team discussions. By leveraging GRPO and a reinforcement learning coordinator, the system automatically identifies optimal intervention moments and generates scientific suggestions from streaming conversations.
Anonpsy: A Graph-Based Framework for Structure-Preserving De-identification of Psychiatric Narratives: This paper proposes Anonpsy, a framework that reformulates the de-identification of psychiatric narratives as a graph-guided semantic rewriting problem. The approach first converts narratives into semantic graphs, applies constrained perturbations on the graph to modify identity-related information while preserving clinical structure, and finally reconstructs the narrative via graph-conditioned generation.
AROMA: Augmented Reasoning Over a Multimodal Architecture for Virtual Cell Genetic Perturbation Modeling: This paper proposes the AROMA framework, which integrates textual evidence, knowledge graph topological information, and protein sequence features within a multimodal architecture, combined with a two-stage training strategy (SFT + GRPO), to achieve interpretable and accurate prediction of genetic perturbation effects.
Benchmarking and Enabling Efficient Chinese Medical Retrieval via Asymmetric Encoders: This paper proposes CMedTEB (Chinese Medical Text Embedding Benchmark) and CARE (asymmetric retrieval framework). CMedTEB constructs a high-quality Chinese medical retrieval/reranking/STS benchmark via multi-LLM voting with expert validation, while CARE adopts an asymmetric architecture that encodes queries with a lightweight BERT and documents with a large LLM. Through a two-stage progressive alignment strategy, CARE achieves LLM-level retrieval accuracy at BERT-level online latency.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering: This paper proposes StsPatient, which extracts domain-specific steering vectors from contrastive instruction/response pairs and applies a Stochastic Token Modulation (STM) mechanism to control injection probability, enabling simulation of standardized patients across different cognitive impairment domains and severity levels. Compared to prompt engineering methods, StsPatient achieves an average improvement of 11.23% in clinical authenticity and surpasses the best baseline by 18.54% in severity controllability.
Beyond the Individual: Virtualizing Multi-Disciplinary Reasoning for Clinical Intake via Collaborative Agents: This paper proposes Aegle, a graph-structured multi-agent framework that virtualizes multidisciplinary team (MDT) consultation for clinical intake. By introducing decoupled parallel reasoning and dynamic topology into the outpatient interview workflow, Aegle surpasses state-of-the-art models across 53 metrics spanning 24 clinical departments.
BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels: BioHiCL leverages the hierarchical multi-label annotations of MeSH (Medical Subject Headings) to provide structured supervision for dense retrievers. By aligning the embedding space with the MeSH semantic space via depth-weighted label similarity, a 0.1B model surpasses most specialized models on biomedical retrieval, sentence similarity, and question answering tasks.
Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA: This paper investigates how social identity markers (sexual orientation and religious affiliation) distort LLM accuracy and confidence calibration in medical question answering. It finds that the "homosexual" marker consistently degrades performance and induces calibration crises across 9 LLMs, and that intersectional identities produce non-additive, identity-specific harms.
Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?: This paper constructs a high-quality German medical corpus, FineMed-de (7.3 million documents / 5.1 billion tokens filtered from FineWeb2), applies continual pre-training and SLERP model merging to three LLMs (7B–24B), and creates the DeFineMed model family. The results demonstrate that a domain-specialized 7B model can substantially narrow the performance gap with a 24B general-purpose model on German medical tasks, improving the win rate by approximately 3.5×.
Cognitive Policy-Driven LLM for Diagnosis and Intervention of Cognitive Distortions in Emotional Support Conversation: This paper proposes CoPoLLM, a framework that constructs CogBiasESC — the first emotional support conversation dataset annotated with cognitive distortions — and integrates a Cognitive Policy Reinforcement Learning (CPRL) engine with Dual-Stream Conditional Optimization (DSCO) to enable LLMs to diagnose eight types of cognitive distortions and generate strategy-aware intervention responses, achieving state-of-the-art performance over 15 baselines.
CURA: Clinical Uncertainty Risk Alignment for Language Model-Based Risk Prediction: CURA proposes a dual-level uncertainty calibration framework: at the individual level, it aligns predictive uncertainty with error probability; at the cohort level, it regularizes predictions using neighborhood event rates in the embedding space. The framework consistently improves calibration metrics across five clinical risk prediction tasks on MIMIC-IV without sacrificing discriminative performance.
DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training: DART identifies and addresses "harm drift"—a phenomenon whereby fine-tuning LLMs to improve difference-aware classification accuracy (e.g., recognizing legitimate demographic distinctions) causes the model's generated explanations to become increasingly harmful. Through a three-stage Distill-Audit-Repair pipeline, DART improves Llama-3-8B accuracy from 39.0% to 68.8% while reducing harm drift cases by 72.6%.
Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps: This work proposes four audio-attention-based metrics (AudioRatio, AudioConsistency, AudioEntropy, TextEntropy) and trains lightweight logistic regression classifiers to detect hallucinations in Speech Large Language Models (SpeechLLMs) at inference time, achieving up to +0.23 PR-AUC improvement on in-domain data.
Dr. Assistant: Enhancing Clinical Diagnostic Inquiry via Structured Diagnostic Reasoning Data and Reinforcement Learning: This paper proposes the Clinical Diagnostic Reasoning Data (CDRD) structure to capture abstract clinical reasoning logic from symptoms to differential diagnosis. Based on CDRD, a two-stage SFT+RL training pipeline is employed to build the Dr. Assistant model (14B), which surpasses HuatuoGPT-o1-72B by 13.59% in ICD-Recall on clinical inquiry benchmarks, reaching a level competitive with GPT-5.
Efficient and Effective Internal Memory Retrieval for LLM-Based Healthcare Prediction: This paper proposes the K2K framework, which treats the FFN parameter space of LLMs as a retrievable knowledge base. Clinical knowledge is injected via LoRA, activation-guided probes enable precise retrieval, and cross-attention reranking adaptively integrates multi-source internal knowledge — achieving state-of-the-art healthcare prediction without external retrieval latency.
Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach: This paper proposes MedSSR, a framework that enhances LLM medical reasoning through controllable data synthesis with rare disease knowledge injection and a semi-supervised training paradigm of "self-supervised RL → supervised RL." MedSSR achieves up to +5.93% improvement on rare disease tasks, surpassing the +3% ceiling observed in all prior methods.
Faithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical Evidence: This paper introduces the MedCounterFact dataset—constructed by systematically replacing interventions in clinical trials with nonsense words, medical terminology, non-medical objects, and toxic substances—and finds that state-of-the-art LLMs almost unconditionally defer to context when presented with counterfactual medical evidence, confidently providing answers even when the "evidence" attributes therapeutic efficacy to heroin or mustard gas. The findings expose a critical lack of a well-defined boundary between faithfulness and safety.
From Answers to Arguments: Toward Trustworthy Clinical Diagnostic Reasoning with Toulmin-Guided Curriculum Goal-Conditioned Learning: This paper adapts the Toulmin argument model to clinical diagnosis and proposes CGCL, a three-stage curriculum training framework (fact collection → hypothesis testing → synthesis), paired with T-Eval for quantifying reasoning structural completeness. The approach achieves diagnostic reasoning quality comparable to RL-based methods without requiring reinforcement learning.
HCFD: A Benchmark for Audio Deepfake Detection in Healthcare: This paper introduces HCFD, a codec-based audio deepfake detection task for healthcare settings. It constructs HCFK, the first codec-forged speech dataset covering multiple clinical pathological conditions (depression, Alzheimer's disease, dysarthria), and proposes the PHOENIX-Mamba framework, which models heterogeneous forgery evidence prototypes in hyperbolic space, achieving 97.04% accuracy on English depression detection.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering: This paper proposes HypEHR, a 22M-parameter Lorentz hyperbolic model that embeds medical codes, patient visits, and questions into hyperbolic space. Through hierarchy-aware regularization aligned with the ICD ontology structure, HypEHR achieves performance comparable to LLM-based approaches on the MIMIC-IV EHR question answering task.
Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation: This paper proposes DyReMe, a dynamic medical diagnostic evaluation framework. Its DyGen module generates novel diagnostic cases containing clinically grounded distractors—including differential diagnoses and misdiagnosis factors—while the EvalMed module assesses LLMs across four dimensions: accuracy, veracity, helpfulness, and consistency. The results reveal that existing static benchmarks systematically overestimate LLM diagnostic capability; GPT-5 suffers an 8.25% accuracy drop on DyReMe, and all 12 evaluated LLMs exhibit significant trustworthiness deficiencies.
Language Reconstruction with Brain Predictive Coding from fMRI Data: This paper proposes PredFT, an end-to-end fMRI-to-Text decoding model comprising a main network (language decoding) and a side network (brain predictive coding representations). By extracting prospective semantic representations from prediction-related brain regions (PTO areas) and integrating them into the decoding process, PredFT achieves a BLEU-1 of 34.95% on the LeBel dataset (Sub-1), outperforming the strongest baseline MapGuide by 7.84 percentage points.
Learning Dynamic Representations and Policies from Multimodal Clinical Time-Series with Informative Missingness: This work proposes OPL-MT-MNAR, a framework that learns dynamic patient representations from ICU data by leveraging the information embedded in missingness patterns of structured observations and clinical text. It combines an MNAR-aware multimodal encoder, Bayesian filtering for latent belief states, and offline policy learning to derive sepsis treatment policies that outperform clinician behavior (FQE 0.679 vs. 0.528).
LogosKG: Hardware-Optimized Scalable and Interpretable Knowledge Graph Retrieval: This paper proposes LogosKG, a hardware-aligned knowledge graph retrieval framework that reformulates graph traversal as multiplications over three sparse associative matrices (SUB/OBJ/REL). Combined with degree-aware graph partitioning, cross-graph routing, and on-demand LRU caching, LogosKG enables scalable and interpretable high-hop retrieval over billion-edge KGs on a single device. Downstream KG-LLM interaction experiments further reveal the systematic influence of graph topology on LLM diagnostic reasoning.
MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation: This paper proposes MARCH, a multi-agent framework that simulates the resident–fellow–attending hierarchical collaboration process in radiology. Through three stages—initial report drafting, retrieval-augmented revision, and consensus-driven finalization—MARCH generates CT reports achieving a CE-F1 of 0.399 on the RadGenome-ChestCT dataset, representing a 57.7% improvement over the best baseline Reg2RG (0.253).
Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation: This paper proposes the CARE framework and the FAITH-M benchmark dataset. By integrating conversational context encoding, contrastive exemplar retrieval, and knowledge distillation chain-of-thought reasoning (KD-CoT), CARE performs fine-grained ordinal evaluation of AI-generated psychotherapeutic responses across six therapeutic principle dimensions, achieving a weighted F1 of 63.34—a 64.26% improvement over the strongest baseline, Qwen3.
MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models: This paper proposes R-MHSafe, a role-aware mental health safety taxonomy, and MHSafeEval, a closed-loop agent evaluation framework. Through adversarial multi-turn counseling interactions, the framework systematically uncovers role-dependent cumulative safety failures of LLMs in mental health counseling scenarios, revealing interaction-level harms that existing static benchmarks fail to capture.
Model-Agnostic Meta Learning for Class Imbalance Adaptation: This paper proposes HAMR (Hardness-Aware Meta-Resample), a unified meta-learning framework that dynamically estimates instance-level importance weights via bi-level optimization to prioritize genuinely difficult samples, coupled with a neighborhood-aware resampling mechanism that shifts training focus toward hard samples and their semantic neighbors. HAMR consistently outperforms strong baselines across 6 imbalanced NLP datasets.
Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection: This paper proposes decomposing utterances into Emotion–Logic–Behavior (ELB) components and leveraging LLM reasoning to generate multiple cognitive distortion instances, which are subsequently aggregated via a multi-view gated attention MIL framework for bag-level classification. The approach outperforms direct LLM inference baselines on both the Korean (KoACD) and English (Therapist QA) datasets.
OmniCompliance-100K: A Multi-Domain Rule-Grounded Real-World Safety Compliance Dataset: This paper introduces OmniCompliance-100K, the first large-scale, multi-domain, regulation-grounded safety compliance dataset built upon real-world cases. It comprises 12,985 manually curated regulatory rules and 106,009 real-world compliance cases collected via a Web search agent, spanning 9 domains including AI safety, data privacy, finance, and healthcare. Extensive benchmarking reveals systematic deficiencies in current LLMs' safety compliance capabilities.
PrinciplismQA: A Philosophy-Grounded Approach to Assessing LLM-Human Clinical Medical Ethics Alignment: This paper constructs the PrinciplismQA benchmark (3,648 questions, including knowledge-based MCQA and open-ended clinical ethics dilemmas) grounded in the internationally recognized gold standard of medical ethics—Principlism (the four principles of Autonomy, Non-maleficence, Beneficence, and Justice)—and develops an expert-calibrated evaluation pipeline. The study finds that high accuracy on knowledge benchmarks does not imply clinical ethical reasoning capability: even the strongest model, o3, achieves only 77.5% overall.
ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design: ProtoCycle proposes a reflective agent framework that positions an LLM as a planner coupled with a lightweight tool environment for text-guided protein sequence design. Through a multi-round feedback-driven decision loop and online reinforcement learning training, the framework achieves strong language alignment while maintaining competitive foldability.
Query Pipeline Optimization for Cancer Patient Question Answering Systems: This paper proposes CoMeta, a three-tier controllable metadata-aware RAG framework for Cancer Patient Question Answering (CPQA). It integrates Clinical Hybrid Semantic-symbolic Document Retrieval (CHSDR), which fuses real-time Boolean search via E-Utilities with MedCPT semantic retrieval, and employs Semantically Enhanced Overlapping Segmentation (SEOS) to prevent context fragmentation. On the CMMQA dataset, CoMeta improves Claude-3-Haiku answer accuracy by 5.24% over CoT and approximately 3% over naive RAG.
RA-RRG: Multimodal Retrieval-Augmented Radiology Report Generation with Key Phrase Extraction: This paper proposes RA-RRG, a framework that leverages an LLM to extract clinically relevant key phrases from radiology reports and construct a retrieval database. Given a chest X-ray image, relevant phrases are retrieved and fed to an LLM for report generation—without any LLM fine-tuning—effectively suppressing hallucinations. The approach requires only 18 GPU hours of training and achieves state-of-the-art performance on CheXbert metrics.
RADS: Reinforcement Learning-Based Sample Selection Improves Transfer Learning in Low-resource and Imbalanced Clinical Settings: This paper proposes RADS (Reinforcement Adaptive Domain Sampling), a reinforcement learning-based sample selection strategy that significantly improves cross-domain disease detection under extreme low-resource and class-imbalanced clinical settings by intelligently selecting a small number of target-domain samples for annotation and joint fine-tuning.
Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework: This paper presents VietPET-RoI, the first 3D PET/CT dataset with fine-grained ROI annotations (in Vietnamese), along with HiRRA, a hierarchical report generation framework that emulates the diagnostic workflow of radiologists. By modeling spatial-morphological inter-ROI relationships via GATv2 graph neural networks, HiRRA achieves a 19.7% improvement in BLEU-4 and a 45.8% improvement in the clinical metric RoIQ.
RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models: This paper proposes RePrompT, a temporally aware LLM framework that consistently outperforms both EHR and LLM baselines on readmission and mortality prediction tasks on MIMIC-III/IV through two complementary mechanisms: recurrent prompt tuning (propagating the hidden state of the previous visit as a soft prompt for the current visit) and struct-encoded prompt tuning (injecting embeddings from a population-level EHR encoder).
RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs in Medicine: RiTeK constructs two large-scale medical textual knowledge graphs (TKGs) and corresponding complex reasoning QA datasets, covering 6 topological structures with rich textual descriptions. It evaluates 11 retrieval methods and reveals critical deficiencies in existing LLM-driven retrieval systems for medical TKG reasoning.
Semi-Supervised Diseased Detection from Speech Dialogues with Multi-Level Data Modeling: This paper proposes an audio-only semi-supervised learning framework that jointly models pathological speech features in clinical dialogues at three levels—session, clip, and frame—using an EMA teacher-student network to dynamically generate high-quality pseudo-labels. With only 11 annotated samples, the framework achieves 90% of fully supervised performance on depression and Alzheimer's disease detection.
Stable On-Policy Distillation through Adaptive Target Reformulation: This paper proposes Veto, a target-level reformulation method that stabilizes on-policy knowledge distillation by constructing a geometric bridging distribution between teacher and student in logit space. A single parameter \(\beta\) simultaneously serves as an adaptive gradient veto in forward KL (suppressing harmful gradients from low-confidence tokens) and a decisiveness knob in reverse KL (balancing reward-driven optimization and output diversity), achieving a 9.2% improvement over SFT on GSM8K.
Text-Attributed Knowledge Graph Enrichment with Large Language Models for Medical Concept Representation: This paper proposes CoMed, an LLM-empowered graph learning framework that constructs a global medical knowledge graph by combining EHR statistical evidence with type-constrained LLM inference, enriches it into a text-attributed graph via LLM-generated node descriptions and edge rationales, and jointly trains a LoRA-finetuned LLaMA encoder with a heterogeneous GNN to learn unified medical concept embeddings, achieving significant improvements in diagnosis prediction on MIMIC-III/IV.
Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry: This paper proposes PlantInquiryVQA, a benchmark comprising 24,950 plant images and 138,068 question–answer pairs, along with a Chain-of-Inquiry (CoI) framework that simulates the adaptive diagnostic inquiry strategies of expert botanists. The benchmark is used to evaluate 18 MLLMs on multi-step visual reasoning for plant pathology diagnosis. Results show that structured inquiry significantly improves diagnostic accuracy and reduces hallucinations; nonetheless, even the strongest model achieves a clinical utility score of only 0.188.