ICML2026 Medical Imaging AI paper notes paper summaries Segmentation Reasoning Multimodal/VLM Alignment/RLHF Federated Learning

🏥 Medical Imaging¶

🧪 ICML2026 · 28 paper notes

📌 Same area in other venues: 📷 CVPR2026 (172) · 🔬 ICLR2026 (88) · 🤖 AAAI2026 (75) · 🧠 NeurIPS2025 (77) · 📹 ICCV2025 (31) · 🧪 ICML2025 (21)

🔥 Top topics: Medical Imaging ×14 · Segmentation ×4 · Reasoning ×4 · Multimodal/VLM ×3 · Alignment/RLHF ×3

Are We Overconfident in Models and Results for Semi-Supervised 3D Medical Image Segmentation?: This paper highlights two types of overconfidence in semi-supervised 3D medical image segmentation: model overconfidence in pseudo-labels and overly optimistic evaluation protocols. It proposes TCSeg, which utilizes confidence-uncertainty dual-axis reliability and tri-space calibration (probability, feature, and image spaces) to suppress confirmation bias. It also advocates for a rigorous evaluation protocol involving multiple random seeds and the Simultaneous reporting of both best and last checkpoints.
Auditing Sybil: Explaining Deep Lung Cancer Risk Prediction Through Generative Interventional Attributions: This paper proposes S(H)NAP—a generative interventional framework based on 3D diffusion bridges for "removal + insertion." It decomposes the decisions of Sybil, a leading lung cancer risk prediction model, into a Linear + Second-order Interaction Model (LMPI) consisting of "nodule main effects + pairwise interactions + background." For the first time, it audits the model's dependence on in-hospital artifacts (e.g., ECG electrodes, metal buttons) and identifies a severe "radial insensitivity" failure mode for peripheral nodules through causal rather than correlative methods.
CASCADE Conformal Prediction: Uncertainty-Adaptive Prediction Intervals for Two-Stage Clinical Decision Support: The CASCADE framework is proposed to propagate epistemic uncertainty from a first-stage classifier (quantified via Venn-Abers predictors) into second-stage regression prediction intervals. This enables a 38.9% reduction in interval width for high-confidence patients while automatically expanding safety buffers for uncertain cases, achieving adaptive coverage guarantees.
DGNO: Discontinuous Galerkin Neural Operator for Pathology Defocus Deblurring: DGNO reformulates defocus deblurring of pathological microscopy images as an inverse problem of "spatially varying integral operators." Using a Discontinuous Galerkin (DG) style, it decomposes the global kernel into element-local integral operators and interface numerical fluxes. This preserves the physical interpretability of neural operators while effectively handling the inherently local discontinuous blur in pathological images, surpassing SOTAs such as NAFNet, Restormer, and MambaIRv2 on datasets like BBBC006w1.
DIYHealth Suite: Dataset, Model, and Benchmark for Health Management at Home: Addressing the "Diagnosis-It-Yourself" scenario—a field overlooked by existing medical LLMs—this work delivers an integrated suite comprising a dataset (DIYHealth-900K, 900,000 multimodal home health QAs), a model (DIYHealthGPT, centered on the newly proposed H2LoRA parameter-efficient fine-tuning mechanism), and a benchmark (DIYHealthBench, the first evaluation covering 11 home health tasks). The suite achieves SOTA performance across both general and medical-specific baselines.
DP-KFC: Data-Free Preconditioning for Privacy-Preserving Deep Learning: This paper proposes DP-KFC: based on the observation that "the scaling of the Fisher matrix is determined by the architecture, and the correlation structure can be approximated by modality-level spectral statistics," it reconstructs KFAC preconditioners by probing the network with structured synthetic noise (1/f^\alpha pink noise for images, Zipf sampling for text). This approach neither consumes the privacy budget nor introduces distribution shifts, consistently outperforming DP-SGD and public data preconditioning methods under strong privacy (\(\varepsilon \le 3\)).
EEG-Based Multimodal Learning via Hyperbolic Mixture-of-Curvature Experts: EEG-MoCE assigns a Lorentz manifold expert with learnable curvature to each modality in EEG-based multimodal learning (emotion/sleep/cognition). It utilizes curvature-aware attention, where "higher curvature signifies richer hierarchical structure and thus higher weight in fusion," to perform cross-modal integration. This approach achieves cross-subject accuracy gains of +14.14%, +3.34%, and +7.98% on the EAV, ISRUC, and Cognitive datasets, respectively.
Evidential Reasoning Advances Interpretable Real-World Disease Screening: EviScreen utilizes "Normal + Pathological" dual knowledge banks for region-level evidence retrieval, followed by cross-attention and self-attention to perform evidential reasoning between the current case and retrieved evidence. This approach provides both retrospective interpretability (identifying which historical cases support the current judgment) and localization interpretability (abnormality maps from contrastive retrieval), achieving SOTA specificity at high recall levels across four real-world external test sets.
Factored Classifier-Free Guidance: This paper identifies the "attribute amplification" failure mode of Classifier-Free Guidance (CFG) in counterfactual generation—where a single global \(\omega\) amplifies attributes that should remain unchanged. The authors propose FCFG: grouping attributes based on a causal graph and assigning independent guidance weights to each group. This approach significantly reduces off-target attribute drift and improves counterfactual reversibility on CelebA-HQ, EMBED, and MIMIC-CXR.
Federated Distillation for Whole Slide Image via Gaussian-Mixture Feature Alignment and Curriculum Integration: This paper proposes FedHD: In heterogeneous federated pathology scenarios, it employs Gaussian-mixture feature alignment for "one-to-one" WSI feature-level distillation. It then progressively injects cross-institutional synthetic features into local training via curriculum learning. This allows institutions to collaborate without sharing raw data or exchanging model parameters. Compatible with heterogeneous MIL architectures and feature extractors, it comprehensively outperforms existing federated and distillation baselines on TCGA-IDH, CAMELYON16, and CAMELYON17.
Foundation VAEs for 3D CT Reconstruction, Augmentation, and Generation: This paper demonstrates a counter-intuitive yet practical finding: Foundation VAEs pre-trained on natural images/videos can serve as a unified interface for CT reconstruction, augmentation, and generation without any medical fine-tuning. The reconstruction acts as a boundary-preserving denoiser (improving pancreatic/lung tumor NSD by +3.9%), while its latent space supports conditional CT diffusion generation (FVD −3.9%, CT-CLIP +36.2%, and multi-disease fidelity AUC +2.76%).
OT-Bridge Editor: Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport: OT-Bridge Editor reformulates "editing a vascular stenosis on coronary angiography" as a constrained entropic OT problem in a vessel-structure composite domain. By employing Schrödinger Bridge with path-level geometric projection supervision, it achieves pixel-level shape/position controllable synthetic angiography, resulting in a 27.8% relative gain in downstream stenosis detection [email protected] on the ARCADE public dataset.
Learning Multi-Scale Hypergraph for High-Order Brain Connectivity Analysis: MuHL utilizes graph wavelets with learnable scales to decompose brain ROI features into multi-resolution representations. It dynamically generates soft hyperedges via a "node embedding × shared projection matrix" mechanism, achieving 93.2% Acc for multi-stage AD classification on ADNI and 76.8% Acc for PD classification on PPMI, while providing interpretable key ROIs and hypergraphs.
Marrying Generative Model of Healthcare Events with Digital Twin of Social Determinants of Health for Disease Reasoning: This paper proposes DiffDT: a conditional Latent Diffusion framework connecting electronic health records (ICD-coded event sequences) with multi-organ biomarker digital twins (tabular features derived from brain/heart/liver/kidney imaging and brain functional connectivity SPD matrices). The key innovation is an SPD-VQVAE based on Cholesky decomposition that reduces \(\mathcal{O}(N^3)\) SPD manifold diffusion to a manifold-preserving and efficient latent space. An AR model performs multi-pathway disease reasoning via the intermediary path "Generate Digital Twin \(\to\) Predict next ICD." On UKB data, it achieves a next-event prediction AUC of 0.91 for 1,944 disease categories, setting a new SOTA.
MedCRP-CL: Continual Medical Image Segmentation via Bayesian Nonparametric Semantic Modality Discovery: The authors utilize the Chinese Restaurant Process (CRP) for online Bayesian nonparametric clustering of clinical text prompts to automatically discover "semantic modalities." They assign independent LoRA adapters to each semantic modality and implement intra-modality EWC. This approach pushes the Dice coefficient to 73.3% while reducing the forgetting rate to 4.1% across 16 medical segmentation tasks, using only 1/6 of the parameters required by MoE baselines.
MEG-XL: Data-Efficient Brain-to-Text via Long-Context Pre-Training: MEG-XL utilizes a 2.5-minute (191k tokens) MEG context for masked token pre-training (5–300× longer than previous methods) and fine-tunes on a 50-word brain-to-text task. With only 1 hour of data, it achieves the decoding accuracy of SOTA supervised methods trained on 50 hours of data, significantly outperforming all existing brain foundation models.
PaCX-MAE: Physiology-Augmented Chest X-Ray Masked Autoencoder: PaCX-MAE builds upon an MAE pre-trained chest X-ray ViT, using LoRA fine-tuning to treat ECG and laboratory test encoders as frozen teachers. Through dual distillation involving InfoNCE contrastive loss and cosine regression, "invisible physiological context" is injected into the image-only encoder. During inference, the model requires only chest X-rays to outperform the same-architecture MAE baseline across 9 downstream benchmarks, with significant gains on physiology-dependent tasks (MedMod +2.7 AUROC, VinDr +6.5 F1).
Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction: This paper reintroduces ADMM dual variables into the PnP diffusion prior loop, utilizing "duality" to provide integral feedback that eliminates steady-state bias. A frequency-domain Spectral Homogenization module is employed to whiten structured dual residuals into pseudo-AWGN, preventing the triggering of OOD hallucinations in the diffusion denoiser. It achieves SOTA fidelity and approximately 3× inference acceleration in sparse-view/limited-angle CT and accelerated MRI.
PRISM: A 3D Probabilistic Neural Representation for Interpretable Shape Modeling: PRISM bridges implicit neural representations (INRs) with uncertainty-aware statistical shape analysis. It models the mean trajectory and spatially heterogeneous variation of anatomical structures evolving with covariates (e.g., age) using a conditional heteroscedastic Gaussian field. By deriving a closed-form Fisher information metric, it analytically quantifies the local uncertainty of "intrinsic developmental time," supporting shape evolution, personalized prediction, and anomaly detection on both synthetic and clinical pediatric airway data.
Scaling Vision Transformers for Functional MRI with Flat Maps: By projecting 3D fMRI volumes into 2D videos via "cortical flat maps" and feeding them into a standard spacetime MAE-ViT, the authors develop CortexMAE trained on 2.1K hours of HCP data. It significantly outperforms SOTA in cognitive state decoding, validating that the flat map is the "goldilocks zone" between voxel-wise (volume) and region-averaged (parcellation) representations. Simultaneously, the first open-source fMRI foundation model benchmark, Brainmarks, reveals the first systematic scaling laws for fMRI models and a "honest null result" showing that trait prediction still fails to beat simple functional connectivity baselines.
Seizure-Semiology-Suite (S³): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding: This paper constructs S³, the first large-scale expert-annotated seizure video dataset (438 videos, 35,000+ dense labels, 20 ILAE semiological features), accompanied by a seven-level hierarchical task benchmark and a clinically aligned Seizure-RQI report quality metric. It systematically exposes the failure modes of 11 open-source MLLMs in temporal localization, spatial lateralization, and clinical faithfulness, and elevates seizure vs. non-seizure classification F1 to 0.96 through domain fine-tuning and a two-stage neuro-symbolic framework.
SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation: SEMIR treats the voxel grid as a parent graph \(G\) and compresses it into a "boundary-aligned" graph minor \(H\) via parameterized edge contraction, node deletion, and edge deletion (reducing nodes from \(\sim10^7\) to \(\sim10^3\)). It utilizes 5–20 few-shot samples to maximize boundary Dice via black-box optimization of \(\Theta\), uses a GNN for supernode classification on the minor, and finally returns to the original grid through a bijective exact lifting. It consistently outperforms nnU-Net on minority class Dice for BraTS, KiTS, and LiTS tumor segmentation tasks while requiring only a 16GB T4 GPU.
Shift-Dependent Asymmetry: Orthogonal Inverse Low-Rank Adaptation for Federated Medical Segmentation: Addressing the issue of data heterogeneity across clients when using Federated LoRA to fine-tune large medical segmentation models, this paper discovers that encoders and decoders face fundamentally different sources of heterogeneity (encoders are dominated by appearance/acquisition shifts, while decoders are dominated by annotation/concept shifts). Consequently, it proposes IAT to inversely allocate shared/local LoRA factors across these two modules and utilizes SOR subspace orthogonal regularization to block the leakage of "local updates into shared directions" caused by bilinear parameterization. This approach consistently outperforms strong Federated LoRA baselines on histopathology and fundus medical segmentation tasks.
SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment: SynerMedGen proposes the "generation-aligned understanding" principle—deriving understanding tasks directly from the same paired synthetic data (via CTS, MI, and TIA tasks). By employing a two-stage training process, the understanding branch first learns representations beneficial for synthesis before transitioning to the latent flow matching generation branch. This approach outperforms both specialized synthesis models and existing unified MLLMs across 22 medical synthesis tasks.
CAME-Grad: The Double Dilemma in Multi-Task Radiology Report Generation — A Gradient Dynamics Analysis and Solution: This paper utilizes an SDE framework to analyze the dual nature of gradient conflicts between "report generation vs. clinical constraints" in Radiology Report Generation (RRG) — drift term deviation from Pareto optimality and diffusion term decay failing to escape local optima. The authors propose the CAME-Grad optimizer (Direction Rectification + Energy Injection + Adaptive Fusion) as a plug-and-play alternative to linear scaling, achieving average gains of +2.3% and +1.9% in clinical efficacy across 8 RRG methods on MIMIC-CXR and IU X-Ray.
PathCTM: Thinking in Scales — Accelerating Gigapixel Pathology Image Analysis via Adaptive Continuous Reasoning: PathCTM reframes Whole Slide Image (WSI) analysis from "exhaustive high-magnification patching" to "low-magnification global to high-magnification local" continuous multi-scale reasoning. Based on the Continuous Thought Machine, it introduces a "thinking-in-scales" paradigm combined with attention-guided region pruning and confidence-aware early stopping, reducing patch counts by 95.95% and inference time by 95.62% while maintaining or even improving AUC.
Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments: This paper reinterprets the reasoning "drift" among multiple MLLMs as negative constraints in DPO. By utilizing a Plackett-Luce preference loss to simultaneously suppress divergent trajectories from \(N\) source models, a 7B student model outperforms all source teachers in chest X-ray classification and report generation tasks using only 10% of MIMIC-CXR without requiring ground-truth reports.
Which Anatomy Matters Under Limited Labels? A Data-Efficient Anatomy-Aware Benchmark for Cardiac Pathology Prediction: This paper constructs a "low-label + constrained compute" anatomy-aware benchmark using the public ACDC cardiac MRI dataset. By performing 5-class cardiac pathology classification using patient-level shape descriptors derived from segmentation masks, it systematically demonstrates that when labels are scarce, choosing the right anatomical representation is more important than increasing model complexity—specifically, the myocardium (MYO) provides the strongest signal among single structures, while multi-structure combinations achieve the best overall performance.