ICML2025 Medical Imaging AI paper notes paper summaries Segmentation Multimodal/VLM Object Detection Domain Adaptation Reasoning

🏥 Medical Imaging¶

🧪 ICML2025 · 21 paper notes

📌 Same area in other venues: 📷 CVPR2026 (172) · 🔬 ICLR2026 (88) · 🧪 ICML2026 (28) · 🤖 AAAI2026 (75) · 🧠 NeurIPS2025 (77) · 📹 ICCV2025 (31)

🔥 Top topics: Medical Imaging ×6 · Segmentation ×2

Bayesian Inference for Correlated Human Experts and Classifiers: A general Bayesian framework is proposed to model the joint labeling behavior between correlated human experts and classifiers. It captures correlations among experts using latent representations and evaluates the utility of additional queries via simulation-based inference, significantly reducing the number of expert queries in medical classification and image annotation while maintaining predictive accuracy.
Boosting Masked ECG-Text Auto-Encoders as Discriminative Learners (D-BETA): D-BETA proposes a contrastive learning framework that integrates generative masked autoencoders with enhanced discriminative capabilities. Through the ECG-Text Sigmoid (ETS) loss and Nearest Neighbor Negative Sampling (N3S) strategy, it significantly outperforms existing methods in cross-modal ECG-text representation learning, achieving a 15% average AUC improvement in linear probing with only 1% of the training data, and a 2% improvement in zero-shot performance.
Certification for Differentially Private Prediction in Gradient-Based Training: An Abstract Gradient Training (AGT) framework is proposed to compute the upper bounds of the reachable set of model parameters during training using convex relaxation and bound propagation techniques. This leverages the smooth sensitivity mechanism to significantly tighten the privacy analysis of private prediction, achieving privacy bounds several orders of magnitude tighter than global sensitivity on medical imaging and NLP tasks.
Context Matters: Query-aware Dynamic Long Sequence Modeling of Gigapixel Images: This paper proposes the Querent framework, which achieves efficient long-range context modeling in gigapixel Whole Slide Images (WSIs) through query-aware dynamic region importance evaluation. It theoretically achieves a bounded approximation of full self-attention and outperforms state-of-the-art (SOTA) methods in biomarker prediction, gene mutation prediction, cancer subtyping, and survival analysis across 10+ WSI datasets.
Do Multiple Instance Learning Models Transfer?: This work presents the first systematic evaluation of the transfer learning capabilities of MIL models in computational pathology, finding that MIL models pre-trained on a pancancer dataset can generalize across organs and tasks, outperforming self-supervised slide foundation models (CHIEF, GigaPath) using less than 10% of the pre-training data.
EEG-Language Pretraining for Highly Label-Efficient Clinical Phenotyping: This paper pioneers the EEG-Language Model (ELM). Trained on 15,000 EEG recordings and clinical reports, ELM integrates time-series cropping, text segmentation, and multi-instance learning strategies. It achieves zero-shot EEG classification and cross-modal retrieval for the first time, significantly outperforming EEG-only self-supervised methods in low-label scenarios.
Efficient Noise Calculation in Deep Learning-based MRI Reconstructions: An efficient method based on Jacobian Sketching is proposed. By probing the Jacobian diagonal elements of DL reconstruction networks via random phase vectors, it accelerates the computation of voxel-level noise variance in MRI reconstruction using an unbiased estimator. The computation and memory requirements are reduced by more than an order of magnitude while maintaining a 99.8% correlation coefficient with the Monte Carlo reference.
Enhancing Statistical Validity and Power in Hybrid Controlled Trials: A Randomization Inference Approach with Conformal Selective Borrowing: An inference framework for hybrid controlled trials is proposed based on the Fisher Randomization Test (FRT) and Conformal Selective Borrowing (CSB). It achieves finite-sample exact Type I error rate control and model-free statistical inference, minimizing MSE through adaptive thresholding to enhance statistical power while maintaining strict Type I error control.
From Token to Rhythm: A Multi-Scale Approach for ECG-Language Pretraining: MELP proposes a multi-scale ECG-language pretraining model. By utilizing cross-modal supervisory signals at three levels (Token, Beat, and Rhythm) combined with domain-specific cardiology language model pretraining, it comprehensively outperforms existing self-supervised and multimodal ECG methods in zero-shot classification, linear probing, and transfer learning.
I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts: I2MoE proposes an interpretable multimodal interaction-aware mixture-of-experts framework. By incorporating four interaction experts (uniqueness \(\times 2\) + synergy + redundancy) combined with a weakly supervised interaction loss, it explicitly models heterogeneous interactions between modalities. Furthermore, it provides sample-level and dataset-level interpretability through a reweighting model, improving accuracy on the ADNI dataset by 5.5%.
iDPA: Instance Decoupled Prompt Attention for Incremental Medical Object Detection: This work proposes the iDPA framework, which implements Incremental Medical Object Detection (IMOD) on a frozen vision-language object detection model through two key modules: Instance-level Prompt Generation (IPG) and Decoupled Prompt Attention (DPA). By training only 1.4% of the parameters, it highly outperforms SOTA methods across 13 cross-modal medical datasets.
Implementing Adaptations for Vision AutoRegressive Model: This paper presents the first systematic implementation and evaluation of various adaptation methods (FFT/LoRA/LNTuning) and differential privacy (DP) adaptation for the Vision AutoRegressive (VAR) model. It finds that VAR significantly outperforms diffusion model adaptation (DiffFit) in non-DP scenarios with faster convergence and higher computational efficiency, but its DP adaptation performance remains poor, revealing an important research gap in the field of privacy-preserving image generation.
LangDAug: Langevin Data Augmentation for Multi-Source Domain Generalization in Medical Image Segmentation: LangDAug utilizes an energy-based model (EBM) trained via contrastive divergence to generate intermediate samples by traversing between source domains using Langevin dynamics, thereby achieving multi-source domain generalization for medical image segmentation, and theoretically proving its induced regularization effect while bounding the Rademacher complexity.
Mastering Multiple-Expert Routing: Realizable H-Consistency and Strong Guarantees: This paper proposes new surrogate loss functions and efficient algorithms for the multiple-expert routing (learning to defer) problem, establishing theoretical guarantees for realizable H-consistency, H-consistency bounds, and Bayes consistency across both single-stage and two-stage learning scenarios.
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding: MedXpertQA constructs an expert-level medical QA benchmark comprising 4,460 questions across 17 specialties and 11 organ systems. Utilizing rigorous filtering enhancement and data synthesis for leakage prevention, it evaluates 18 mainstream models and introduces a specialized reasoning subset designed specifically for assessing o1-like reasoning models.
Neural Stochastic Differential Equations on Compact State Spaces: Theory, Methods and Applications: This paper proposes a Neural SDE parameterization method (WSP) based on stochastic viability theory, ensuring that SDE trajectories are provably constrained within compact polytopic spaces with continuous dynamics and a strong inductive bias, overcoming the limitations of chain-rule methods and reflected SDEs.
Raptor: Scalable Train-Free Embeddings for 3D Medical Volumes Leveraging Pretrained 2D Foundation Models: Proposes Raptor (Random Planar Tensor Reduction), a completely train-free method that leverages a frozen 2D foundation model (DINOv2-L) to extract visual tokens from 3D medical volumes along three orthogonal axes, and then substantially compresses the dimensions via random projection, outperforming all SOTA methods requiring large-scale pre-training across 10 medical tasks.
SGD Jittering: A Training Strategy for Robust and Accurate Model-Based Architectures: Proposes the SGD jittering training strategy, which progressively injects zero-mean Gaussian noise during the iterative reconstruction process of the model. Theoretical analysis demonstrates that it simultaneously enhances model robustness and generalization accuracy without the high computational overhead of adversarial training.
The Brain's Bitter Lesson: Scaling Speech Decoding With Self-Supervised Learning: Developing neuroscience-inspired self-supervised pretext tasks and a heterogeneous brain signal processing architecture, this work scales MEG speech decoding to approximately 400 hours and 900 subjects, outperforming the SOTA by 15-27%. It matches surgical-grade decoding performance using non-invasive data for the first time, demonstrating robust generalization across datasets, subjects, and tasks.
The Disparate Benefits of Deep Ensembles: Through large-scale empirical studies on facial analysis and medical imaging datasets, this paper reveals a neglected phenomenon—the "disparate benefits effect": while Deep Ensembles improve overall performance, they disproportionately benefit different protected groups (often favoring already advantaged groups), thereby undermining group fairness. The authors further attribute the root cause of this to disparities in predictive diversity across groups and demonstrate that the classic Hardt Post-Processing (HPP) can effectively repair fairness while preserving performance gains.
The Four Color Theorem for Cell Instance Segmentation: This paper introduces the four-color theorem to cell instance segmentation, treating each cell as a "country" and the background as the "ocean", thereby replacing instance segmentation with a constrained 4-class semantic segmentation task. It designs a progressive training strategy and an encoding transformation method to resolve the non-uniqueness of four-color coding, achieving SOTA performance across diverse imaging modalities while significantly reducing model complexity.