ICLR2026 Medical LLM AI paper notes paper summaries Medical Imaging Dialogue LLM Question Answering Reasoning Adversarial Robustness

🩺 Medical LLM¶

🔬 ICLR2026 · 20 paper notes

📌 Same area in other venues: 📷 CVPR2026 (1) · 💬 ACL2026 (47) · 🧪 ICML2026 (4) · 🤖 AAAI2026 (12) · 🧠 NeurIPS2025 (17) · 🧪 ICML2025 (4)

🔥 Top topics: Medical Imaging ×6 · Dialogue ×3 · LLM ×3 · Question Answering ×3 · Reasoning ×3

ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue: This paper proposes the ATPO (Adaptive Tree Policy Optimization) algorithm, which models multi-turn medical dialogues as a Hierarchical Markov Decision Process (H-MDP). It dynamically allocates rollout budgets through an uncertainty-aware adaptive tree expansion mechanism, guiding exploration via a composite uncertainty measure of Bellman error and action-value variance. Using Qwen3-8B, it outperforms GPT-4o on three medical dialogue benchmarks.
Can Large Language Models Match the Conclusions of Systematic Reviews?: The authors constructed the MedEvidence benchmark—rewriting conclusions from 100 Cochrane Systematic Reviews (SRs) into 284 closed-ended questions paired with their source studies. This allows LLMs to replicate expert conclusions under "same material" controlled conditions. Evaluating 25 LLMs revealed: reasoning models are not necessarily better, marginal gains diminish with model size, and medical fine-tuning often decreases performance. Models generally lack "scientific skepticism" regarding low-quality evidence, failing to match expert conclusions in at least 37% of cases.
Can SAEs Reveal and Mitigate Racial Biases of LLMs in Healthcare?: This paper investigates whether Sparse Autoencoders (SAEs) can reveal and mitigate racial biases in LLMs within healthcare contexts. It finds that SAEs can identify harmful racial associations (e.g., Black patients with violence), but the effectiveness of mitigating bias in complex clinical tasks is limited (FLDD < 3%), significantly underperforming simple prompting strategies (FLDD 8-15%).
Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions: This paper constructs Cancer-Myth—an adversarial dataset verified by hemato-oncologists containing 585 oncology patient questions with false presuppositions. The study finds that leading LLMs, including GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet, achieve a success rate of no more than 43% in correcting these false presuppositions. Furthermore, mitigation techniques such as defensive prompting trigger significant over-corrections on "no-false-presupposition" questions and degrade performance on other medical benchmarks, highlighting a critical safety gap in medical LLM patient communication.
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering: The authors collaborated with 100 licensed mental health professionals to construct CounselBench, a dual-component benchmark for open-ended mental health QA. It includes 2,000 expert evaluations with dimension-level scoring and span annotations (CounselBench-Eval), and 120 clinician-authored adversarial prompts designed to induce specific failure modes (CounselBench-Adv). The study reveals that LLMs currently exhibit "high scores alongside persistent safety hazards" in counseling scenarios and demonstrates that LLM-as-Judge is unreliable in this high-risk domain.
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of LLMs in Mental Health QA: In collaboration with 100 licensed mental health experts, CounselBench was constructed as a dual-component benchmark—CounselBench-EVAL (2,000 expert evaluations across six dimensions) and CounselBench-Adv (120 adversarial questions with 1,080 response annotations). The study systematically reveals that while LLMs achieve high superficial scores in open-ended mental health QA, they harbor safety hazards such as overgeneralization and unauthorized medical advice, while also proving that LLM-as-Judge is severely unreliable in safety-critical domains.
Critic-Adviser-Reviser Cyclic Refinement: Towards High-Quality EMR Corpus Generation with LLMs: Addressing the issues where LLMs directly generating Electronic Medical Records (EMR) "only imitate, suffer from distribution distortion, and lack quality constraints," this paper proposes LLM-CARe. This framework employs a "corpus → section → document" three-level granularity, with each level refined by a Critic/Adviser/Reviser agent cycle. Without accessing any real EMR text, it significantly pushes the quality of synthetic records and downstream clinical task performance beyond the SOTA.
Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning: Doctor-R1 models outpatient inquiry as a partially observable multi-turn decision-making process. By utilizing a "multi-agent interaction environment + two-level reward architecture + experience memory" for experiential agentic reinforcement learning, an 8B doctor agent learns to ask questions strategically and empathetically while maintaining diagnostic accuracy. It outperforms 32B open-source models and closed-source models like GPT-4.1 on HealthBench and MAQuE.
From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents: Proposes the EHR-ChatQA benchmark to evaluate end-to-end interaction workflows of database agents in EHR scenarios (clarifying vague queries → resolving term mismatches → generating SQL → returning answers). Findings reveal that while the strongest model (o4-mini) achieves over 90% Pass@5, its Pass∧5 (all successful) drops significantly (gap up to 60%), exposing robustness defects in safety-critical domains.
From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity: This paper proposes a two-stage pipeline of "social media posts \(\rightarrow\) structured electronic medical records (EMR) \(\rightarrow\) multi-agent diagnostic dialogues." By adapting the SCID-5 clinical interview protocol into a Hierarchical Diagnostic State Machine (HDSM) and a Diagnostic Context Tree (DCT), the authors construct PsyCoTalk—the first large-scale psychiatric comorbidity diagnostic dialogue dataset (3,000 multi-turn dialogues)—validated by practicing psychiatrists for clinical authenticity.
GALAX: Graph-Augmented Language Model for Explainable Reinforcement-Guided Subgraph Reasoning in Precision Medicine: GALAX treats a pre-trained GNN as a "process judge," using reinforcement learning to guide an LLM in incrementally constructing disease-related subgraphs. This enables explainable, patient-specific cancer target prediction without the need for step-by-step annotations.
HistoPrism: Unlocking Functional Pathway Analysis from Pan-Cancer Histology via Gene Expression Prediction: This paper proposes HistoPrism, an efficient Transformer architecture that predicts pan-cancer gene expression from H&E histology images by injecting cancer type conditions via cross-attention. It introduces the Gene Pathway Coherence (GPC) evaluation framework based on Hallmark/GO pathways, significantly outperforming STPath in pathway-level prediction, especially for core biological pathways with low variance.
KnowGuard: Knowledge-Driven Abstention for Multi-Round Clinical Reasoning: Addressing the overconfidence issue where LLMs provide diagnoses despite incomplete information in multi-round clinical consultations, KnowGuard proposes an "investigate-before-abstain" paradigm. This approach shifts abstention decisions from model self-assessment to systematic cross-round evidence exploration over a medical knowledge graph. By using a rolling-updated contextual evidence pool to identify "missing evidence," the model decides whether to continue questioning or provide a diagnosis. On a self-constructed open-ended multi-round benchmark, KnowGuard improved average diagnostic accuracy by 3.93% and converged in an average of only 5.74 rounds.
Knowledgeable Language Models as Black-Box Optimizers for Personalized Medicine: This paper proposes LEON (LLM-based Entropy-guided Optimization with kNowledgeable priors), a mathematically rigorous method that models personalized medical treatment design as a conditional black-box optimization problem. It guides an LLM to serve as a zero-shot optimizer for personalized treatment plans without fine-tuning, utilizing entropy constraints and an adversarial source critic model.
mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules: The paper proposes mCLM (Modular Chemical Language Model), which represents molecules as sequences of synthesizable building blocks. This allows LLMs to generate molecules that satisfy both pharmacological functions and automated synthesis feasibility, showing significant improvements in pharmacokinetic and toxicological properties across 430 FDA-approved drugs.
MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science: The authors constructed MedAgentGym, the first unified Agent training environment for biomedical data science. It comprises 72,413 task instances covering 12 real-world scenarios across 129 categories, equipped with executable sandboxes and verifiable ground truth. A systematic benchmark evaluation of 29 LLMs revealed a gap between commercial and open-source models. By employing efficient multi-threaded trajectory sampling and offline/online RL, they trained Med-Copilot, achieving +43.02%/+45.28% improvements and reaching performance competitive with GPT-4o.
MedAraBench: Large-scale Arabic Medical Question Answering Dataset and Benchmark: The authors manually digitized and cleaned paper-based exam questions from medical schools in the Arabic region into 24,883 medical Multiple-Choice Questions (MCQs) with professional department and difficulty annotations. After constructing the large-scale Arabic medical QA benchmark, MedAraBench, and performing double quality checks via expert review and LLM-as-a-judge, 16 open-source and closed-source LLMs were evaluated in a zero-shot setting. Results show that even the strongest model, GPT-o3, achieves an accuracy of only 0.765, exposing significant weaknesses in current models' Arabic medical reasoning.
Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis: Ours proposes the Resp-Agent closed-loop multi-agent framework, which coordinates a controllable respiratory sound generator and a multimodal diagnoser via an active adversarial curriculum planner (Thinker-A2CA). It achieves generation↔diagnosis co-design on a 229k-scale benchmark, significantly improving diagnostic performance for long-tail categories.
SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs: SimpleToM reveals a critical deficiency in LLM Theory of Mind: while frontier models accurately infer others' mental states (Explicit ToM), their performance drops sharply when applying this knowledge to predict or judge behaviors (Applied ToM), exposing a significant gap between "knowing what" and "how to use what is known."
SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis: Proposes SurvHTE-Bench, the first comprehensive benchmark for Heterogeneous Treatment Effect (HTE) estimation for right-censored survival data. It includes 40 synthetic datasets, 10 semi-synthetic datasets, and 2 real datasets to systematically evaluate 53 estimation methods under different causal assumption violations and censoring levels. The study finds that no single method dominates, but survival meta-learners (especially S-Learner-Survival and Matching-Survival) are the most robust in high censoring and assumption violation scenarios.