NeurIPS2025 Medical LLM AI paper notes paper summaries Medical Imaging LLM Reasoning RAG Summarization Time-Series Forecasting

🩺 Medical LLM¶

🧠 NeurIPS2025 · 17 paper notes

📌 Same area in other venues: 📷 CVPR2026 (1) · 🔬 ICLR2026 (20) · 💬 ACL2026 (47) · 🧪 ICML2026 (4) · 🤖 AAAI2026 (12) · 🧪 ICML2025 (4)

🔥 Top topics: Medical Imaging ×6 · LLM ×4 · Reasoning ×2 · RAG ×2 · Summarization ×2

AI Should Sense Better, Not Just Scale Bigger: Adaptive Sensing as a Paradigm Shift: Inspired by biological sensory systems, this position paper argues that AI research must shift from simply scaling models to optimizing inputs—by dynamically adjusting sensor-level parameters (exposure, gain, multimodal configuration, etc.) to produce inputs most favorable to the model. Under ideal sensor adaptation, a small model (EfficientNet-B0, 5M parameters) can outperform a large model (OpenCLIP-H, 632M parameters), and the paper proposes a progressive formalization framework ranging from single-shot perception to closed-loop perception–action coupling.
CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research: This paper introduces CGBench, a clinical genetics benchmark grounded in ClinGen expert annotations, designed to evaluate the scientific literature reasoning capabilities of LLMs from both variant and gene curation perspectives. The benchmark encompasses three tasks—evidence scoring, evidence verification, and experimental evidence extraction—and finds that reasoning models perform best on fine-grained tasks but underperform non-reasoning models on high-level judgments.
CureAgent: A Training-Free Executor-Analyst Framework for Clinical Reasoning: CureAgent proposes an Executor-Analyst collaborative framework that decouples precise tool invocation (TxAgent/Llama-8B as Executor) from high-level clinical reasoning (Gemini 2.5 as Analyst). Combined with a Stratified Ensemble Late Fusion topology that preserves evidence diversity, the system achieves 83.8% accuracy on CURE-Bench without end-to-end fine-tuning, and reveals two critical scaling findings: the context–performance paradox and the curse of dimensionality in action space.
Demo: Guide-RAG: Evidence-Driven Corpus Curation for Retrieval-Augmented Generation in Long COVID: This paper systematically evaluates six RAG corpus configurations for Long COVID clinical QA. The GS-4 configuration—combining clinical guidelines with high-quality systematic reviews—consistently outperforms both single-guideline and large-scale literature retrieval baselines across faithfulness, relevance, and comprehensiveness. The authors further introduce the Guide-RAG framework and the LongCOVID-CQ evaluation dataset.
Document Summarization with Conformal Importance Guarantees: This work presents the first application of Conformal Prediction to document summarization. By calibrating a threshold on sentence importance scores, it provides rigorous statistical guarantees on user-controllable coverage (\(1-\alpha\)) and recall (\(\beta\)) for extractive summaries. The method is model-agnostic and requires only a small calibration set.
Faithful Summarization of Consumer Health Queries: A Cross-Lingual Framework with LLMs: A framework combining TextRank-based extractive sentence selection and medical Named Entity Recognition (NER) is proposed to guide LLMs in generating faithful medical summaries. By fine-tuning LLaMA-2-7B on the English MeQSum and Bengali BanglaCHQ-Summ datasets, consistent improvements in both quality and faithfulness are achieved, with SummaC reaching 0.57 and human evaluation showing that 82% of the summaries retain key medical information.
H-DDx: A Hierarchical Evaluation Framework for Differential Diagnosis: H-DDx proposes a differential diagnosis evaluation framework grounded in the ICD-10 classification hierarchy. By expanding both predicted and ground-truth diagnoses to their ancestor nodes and computing a Hierarchical Diagnostic F1 (HDF1), the framework rewards "clinically relevant approximate correctness" rather than exact match only. Evaluating 22 LLMs reveals that the domain-specialized model MediPhi rises from 20th to 2nd place under HDF1, an advantage completely obscured by Top-5 metrics.
HealthSLM-Bench: Benchmarking Small Language Models for Mobile and Wearable Healthcare Monitoring: The first benchmark systematically evaluating small language models (SLMs, 1–4B parameters) on mobile and wearable health monitoring tasks, covering zero-shot, few-shot, and instruction fine-tuning paradigms, with on-device deployment validated on an iPhone.
Large Language Models as Medical Codes Selectors: A Benchmark Using the International Classification of Primary Care: This work constructs a medical coding benchmark based on an extract-retrieve-select framework, evaluating ICPC-2 code selection capability across 33 LLMs. Results show that 28 models achieve F1 > 0.8, demonstrating that LLMs can effectively automate primary care coding without fine-tuning.
LLM-Assisted Emergency Triage Benchmark: Bridging Hospital-Rich and MCI-Like Field Simulation: This work constructs an open, LLM-assisted emergency triage benchmark based on MIMIC-IV-ED, defining two evaluation scenarios—hospital-rich and mass casualty incident (MCI)-like field simulation—and providing baseline models along with SHAP-based interpretability analysis to promote reproducibility and accessibility in triage prediction research.
Mind the Gap: Aligning Knowledge Bases with User Needs to Enhance Mental Health Retrieval: This paper proposes a knowledge base augmentation framework grounded in "demand gap" analysis. By overlaying real user data (forum posts) onto existing mental health resource repositories to identify content voids, the framework applies targeted augmentation strategies to achieve near-full-corpus RAG retrieval quality with minimal document additions.
PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions: This paper introduces PatientSim — an LLM-based patient simulator grounded in real MIMIC clinical data and a four-dimensional persona framework (personality, language proficiency, medical history recall, and cognitive confusion), generating 37 unique persona combinations. The system is evaluated across 8 LLMs for factual accuracy and persona fidelity, and validated by 4 clinical experts with a mean quality score of 3.89/4.
Position: Thematic Analysis of Unstructured Clinical Transcripts with Large Language Models: This position paper systematically reviews the current state of LLM-assisted thematic analysis (TA) on unstructured clinical transcripts, identifies highly fragmented evaluation practices across the literature, and proposes a standardized evaluation framework centered on three dimensions: Validity, Reliability, and Interpretability.
RAxSS: Retrieval-Augmented Sparse Sampling for Explainable Variable-Length Medical Time Series Classification: This paper proposes RAxSS, a framework that integrates retrieval augmentation into the random sparse sampling (SSS) pipeline. By replacing uniform averaging with intra-window similarity-weighted aggregation, RAxSS maintains competitive performance on variable-length medical time series classification while providing an interpretable evidence chain spanning from "where" to "why."
Shallow Robustness, Deep Vulnerabilities: Multi-Turn Evaluation of Medical LLMs: This paper proposes the MedQA-Followup framework to systematically evaluate the multi-turn robustness of medical LLMs. It reveals that models exhibit acceptable performance under single-turn perturbations (shallow robustness), yet accuracy can catastrophically drop from 91.2% to 13.5% under multi-turn follow-up challenges (deep vulnerability). Notably, indirect contextual manipulation proves more destructive than direct incorrect suggestions.
The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated Curriculum: This paper proposes a framework for studying world model formation in human neural organoids, comprising three progressively complex virtual environments (conditioned avoidance, predator–prey, Pong) and a meta-learning approach in which an LLM automatically generates experimental protocols, complemented by a multi-scale biophysical evaluation strategy to quantify the physical basis of biological learning.
Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series: This work constructs Time-IMM — the first multimodal multivariate time series benchmark that categorizes irregularity according to causal mechanisms (9 irregularity types organized into three classes: Trigger, Constraint, and Artifact, spanning 9 datasets). An accompanying forecasting library, IMM-TSF, supports asynchronous multimodal fusion. Experiments demonstrate that explicitly modeling multimodal information reduces MSE by 6.71% on average across irregular time series settings, with a maximum improvement of 38.38%.