NeurIPS2025 Medical Imaging AI paper notes paper summaries Multimodal/VLM Segmentation Self-Supervised Learning Diffusion Models Adversarial Robustness

🏥 Medical Imaging¶

🧠 NeurIPS2025 · 74 paper notes

📌 Same area in other venues: 📷 CVPR2026 (163) · 🔬 ICLR2026 (88) · 🧪 ICML2026 (28) · 🤖 AAAI2026 (75) · 📹 ICCV2025 (31)

🔥 Top topics: Medical Imaging ×30 · Multimodal/VLM ×8 · Segmentation ×8 · Self-Supervised Learning ×4 · Diffusion Models ×4

3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks: This paper introduces 3D-RAD — the first large-scale 3D medical VQA benchmark, comprising 170K CT-based question-answer pairs across six clinical task categories (including a novel multi-temporal diagnosis task), accompanied by a 136K training set. The benchmark reveals critical deficiencies of existing VLMs in 3D temporal reasoning.
A Novel Approach to Classification of ECG Arrhythmia Types with Latent ODEs: This work combines a path-minimized Latent ODE encoder with a gradient-boosted decision tree (GBDT) into a two-stage ECG arrhythmia classification pipeline. On the MIT-BIH dataset, the macro AUC-ROC degrades only marginally from 0.984 at 360 Hz to 0.976 at 45 Hz, demonstrating strong robustness to sampling frequency variation.
A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking: This paper proposes UniVF, the first unified video fusion framework based on multi-frame learning, optical flow feature warping, and temporal consistency loss, along with VF-Bench, the first video fusion benchmark covering four major fusion tasks (multi-exposure, multi-focus, infrared-visible, and medical), achieving state-of-the-art performance across all sub-tasks.
Active Target Discovery under Uninformative Prior: The Power of Permanent and Transient Memory: This paper proposes EM-PTDM, a framework inspired by the dual-memory system in neuroscience. It leverages a pretrained diffusion model as "permanent memory" and incorporates a lightweight "transient memory" module based on Doob's h-transform to achieve efficient active target discovery without any domain-specific prior data, with theoretical guarantees of monotonic prior improvement.
Are Pixel-Wise Metrics Reliable for Sparse-View Computed Tomography Reconstruction?: This paper reveals that pixel-level metrics such as PSNR and SSIM fail to capture anatomical structural completeness in sparse-view CT reconstruction (correlation only 0.16–0.30), and proposes anatomy-aware metrics (NSD/clDice) based on automated segmentation alongside the CARE framework—which incorporates segmentation-guided loss into diffusion model training—achieving 32% improvement in structural completeness for large organs and 36% for vessels.
Brain Harmony: A Multimodal Foundation Model Unifying Morphology and Function into 1D Tokens: The first multimodal brain foundation model that unifies structural morphology (T1 sMRI) and functional dynamics (fMRI), compressing high-dimensional neuroimaging data into compact 1D token representations via Geometric Harmonics Pre-alignment and Temporally Adaptive Patch Embedding (TAPE). The model consistently outperforms prior methods on neurodevelopmental/neurodegenerative disease diagnosis and cognitive prediction tasks.
BrainOmni: A Brain Foundation Model for Unified EEG and MEG Signals: This paper proposes BrainOmni—the first brain signal foundation model that unifies EEG and MEG—by discretizing heterogeneous brain signals into a unified token space via BrainTokenizer (incorporating a physical Sensor Encoder), followed by self-supervised masked prediction pretraining with a Criss-Cross Transformer. The model achieves an 11.7 percentage-point improvement on Alzheimer's disease detection and demonstrates zero-shot reconstruction generalization to completely unseen devices.
Care-PD: A Multi-Site Anonymized Clinical Dataset for Parkinson's Disease Gait Assessment: This work introduces Care-PD — the largest multi-site anonymized 3D mesh dataset for Parkinson's disease (PD) gait analysis to date, comprising 9 cohorts, 8 clinical centers, 362 subjects, and 8,477 walking bouts. It provides a systematic benchmark for UPDRS gait scoring and motion pre-training tasks, demonstrating that fine-tuning on Care-PD reduces MPJPE from 60.8 mm to 7.5 mm and improves F1 by 17 percentage points.
Convolutional Monge Mapping between EEG Datasets to Support Independent Component Labeling: This paper extends CMMN (Convolutional Monge Mapping Normalization) by proposing two strategies — channel-averaged PSD with \(\ell_1\)-normalized barycenter and subject-to-subject matching — to generate a single time-domain filter for domain adaptation across EEG datasets with differing channel counts. On independent component (IC) brain/non-brain classification, the F1 score improves from 0.77 to 0.84, surpassing ICLabel (0.88→0.91).
CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays: This paper proposes CheXStruct and CXReasonBench — a structured diagnostic reasoning evaluation framework for chest X-rays that employs multi-path, multi-stage assessment to reveal critical deficiencies in existing LVLMs at intermediate reasoning steps.
DCA: Graph-Guided Deep Embedding Clustering for Brain Atlases: DCA (Deep Cluster Atlas) proposes a graph-guided deep embedding clustering framework that combines voxel-level spatiotemporal embeddings from a pretrained Swin-UNETR with KNN graph spatial regularization. By aligning soft assignments with atlas clustering auxiliary labels via KL divergence, the framework generates functionally homogeneous and spatially contiguous individualized brain atlases. On the HCP dataset, DCA achieves 98.8% improvement in homogeneity and 29% improvement in silhouette coefficient, and outperforms existing atlases on downstream tasks including autism diagnosis and cognitive decoding.
Demo: Generative AI helps Radiotherapy Planning with User Preference: This paper proposes the Flexible Dose Proposer (FDP), a two-stage training framework (VQ-VAE pretraining + multi-condition encoding) that enables slider-based interactive 3D dose distribution prediction incorporating user preferences. The system is integrated into the Eclipse clinical treatment planning system and outperforms Varian RapidPlan in head-and-neck cancer radiotherapy scenarios.
DermaCon-IN: A Multi-concept Annotated Dermatological Image Dataset of Indian Skin Disorders: This work introduces DermaCon-IN—the first densely annotated dermatological image dataset predominantly featuring Indian skin tones (5,450 images / 3,002 patients / 245 diagnoses)—providing three-level hierarchical diagnostic labels, 47 lesion descriptors, and 49 anatomical site annotations, with benchmark evaluations using CNN, ViT, and concept bottleneck model architectures.
DIsoN: Decentralized Isolation Networks for Out-of-Distribution Detection in Medical Imaging: This paper proposes Decentralized Isolation Networks (DIsoN), which detects OOD samples by training a binary classifier to "isolate" a test sample from training data, and leverages training data information without sharing it through decentralized parameter exchange. The method achieves state-of-the-art performance across 12 OOD detection tasks on 4 medical imaging datasets.
Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum: A fully self-supervised noise-robust representation learning framework is proposed, leveraging a "denoised→noisy" data curriculum strategy combined with denoised-teacher regularization. This enables SSL models such as DINOv2 to directly process noisy inputs at inference time without any denoiser, achieving a 4.8% improvement in linear probing accuracy under extreme Gaussian noise on ImageNet-1k.
Doctor Approved: Generating Medically Accurate Skin Disease Images through AI-Expert Feedback: This paper proposes MAGIC, a framework that encodes dermatologist-defined clinical checklists into structured evaluation prompts executable by MLLMs (e.g., GPT-4o), and uses the resulting feedback to fine-tune diffusion models via DPO or reward-based fine-tuning (RFT), generating clinically accurate skin disease images for data augmentation. MAGIC achieves +9.02% improvement on a 20-class skin disease classification task and +13.89% in few-shot settings.
Domain-Adaptive Transformer for Data-Efficient Glioma Segmentation in Sub-Saharan MRI: This paper proposes SegFormer3D+, a domain-adaptive Transformer architecture tailored for heterogeneous MRI data from Sub-Saharan Africa. By integrating histogram matching, radiomics-guided stratified sampling, a frequency-aware dual-path encoder, and a dual attention mechanism, the model achieves a mean Dice of 0.81 for glioma segmentation with only 60 annotated cases for fine-tuning, outperforming nnU-Net by +2.5%.
Dual Mixture-of-Experts Framework for Discrete-Time Survival Analysis: This paper proposes a Dual Mixture-of-Experts (Dual MoE) framework for discrete-time survival analysis, combining a feature encoder MoE (for modeling patient subgroup heterogeneity) with a hazard network MoE (for capturing temporal dynamics). The framework achieves improvements of up to 0.04 in time-dependent C-index on the METABRIC and GBSG breast cancer datasets.
DyG-Mamba: Continuous State Space Modeling on Dynamic Graphs: DyG-Mamba introduces continuous state space models (SSMs) into dynamic graph learning. It proposes a temporal span-aware continuous SSM that models irregular time intervals via an exponential decay function inspired by the Ebbinghaus forgetting curve, combined with input-dependent parameters constrained by spectral norm for Lipschitz robustness. The method achieves an average rank of 2.42 across 12 dynamic graph benchmarks (vs. DyGFormer's 2.92) while maintaining \(O(bdL)\) linear complexity.
Dynamic Causal Discovery in Alzheimer's Disease through Latent Pseudotime Modelling: This paper applies BN-LTE (Bayesian Network with Latent Time Embedding) to real-world ADNI data from AD patients to infer dynamic causal graphs that evolve along a disease pseudotime axis. The learned pseudotime achieves a diagnostic AUC of 0.82, substantially outperforming chronological age (AUC 0.59), and reveals dynamic causal relationships between emerging biomarkers NfL/GFAP and established AD markers.
EEGReXferNet: A Lightweight Gen-AI Framework for EEG Subspace Reconstruction via Cross-Subject Transfer Learning and Channel-Aware Embedding: This paper proposes EEGReXferNet, a lightweight generative AI framework that achieves EEG subspace reconstruction under a cross-subject transfer learning setting via neighborhood channel-aware input selection, band-specific sub-window convolutional encoding/decoding, a dynamic sliding-window latent space, and reference statistics scaling. The framework reduces parameter count by approximately 45% and achieves inference latency <1ms, while maintaining PSD correlation \(\geq 0.95\) and spectrogram RV coefficient \(\geq 0.85\).
EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis: This paper introduces EndoBench, the first comprehensive MLLM evaluation benchmark covering 4 endoscopic scenarios, 12 clinical tasks, and 5 levels of visual prompt granularity, comprising 6,832 clinically validated VQA pairs. Evaluation of 23 MLLMs reveals that commercial models generally outperform open-source and medical-specific counterparts, yet all remain below human expert performance.
EWC-Guided Diffusion Replay for Exemplar-Free Continual Learning in Medical Imaging: This paper proposes an exemplar-free continual learning framework that combines class-conditional DDPM diffusion replay with Elastic Weight Consolidation (EWC), achieving an AUROC of 0.851 on MedMNIST v2 (8 tasks across 2D/3D) and CheXpert, reducing forgetting by over 30% compared to DER++, approaching the joint training upper bound (0.869), while requiring no storage of original patient data.
FAPEX: Fractional Amplitude-Phase Expressor for Robust Cross-Subject Seizure Prediction: This paper proposes FAPEX, a framework that achieves adaptive time-frequency decomposition via a learnable Fractional Neural Frame Operator (FrNFO), combined with Amplitude-Phase Cross-Encoding (APCE) and Spatial Correlation Aggregation (SCA). FAPEX comprehensively outperforms 33 baseline methods across 12 cross-species, cross-modality seizure prediction benchmarks.
Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling: This paper proposes HiVE-MIL, a hierarchical vision-language MIL framework that constructs a unified heterogeneous graph to model cross-scale hierarchical relationships (5× and 20×) and intra-scale multimodal alignment. Combined with a text-guided dynamic filtering mechanism and a hierarchical contrastive loss, HiVE-MIL consistently outperforms existing methods under the 16-shot setting on three TCGA datasets (lung, breast, and renal cancer), achieving up to 4.1% improvement in Macro F1.
FireGNN: Neuro-Symbolic Graph Neural Networks with Trainable Fuzzy Rules for Interpretable Medical Image Classification: This paper proposes FireGNN, which for the first time embeds trainable fuzzy rules into the GNN forward pass. Using three topological descriptors—node degree, clustering coefficient, and label consistency—FireGNN achieves endogenous interpretability for medical image classification, outperforming standard GCN/GAT/GIN and auxiliary-task baselines on 5 MedMNIST datasets and MorphoMNIST.
FOXES: A Framework For Operational X-ray Emission Synthesis: This paper proposes FOXES, a Vision Transformer-based framework that translates multi-channel solar EUV observation images into soft X-ray (SXR) flux, achieving an overall Pearson correlation of 0.982. The framework lays the groundwork for far-side solar flare detection and the construction of more complete flare catalogs.
Generalizable, Real-Time Neural Decoding with Hybrid State-Space Models: POSSM proposes a hybrid SSM-attention architecture that combines spike-level tokenization with a recurrent state-space model backbone, achieving generalizable real-time neural decoding with inference speeds up to 9× faster than Transformers while maintaining comparable accuracy.
GeoDynamics: A Geometric State-Space Neural Network for Understanding Brain Dynamics on Riemannian Manifolds: This paper proposes GeoDynamics, which generalizes the classical state-space model (SSM) from Euclidean space to the symmetric positive definite (SPD) manifold. By employing weighted Fréchet mean aggregation and orthogonal group translations, it achieves geometrically consistent state evolution on the manifold, attaining state-of-the-art performance on brain connectome analysis (early diagnosis of AD/PD/ASD) and human action recognition.
ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression: By proposing a systematic feature suppression framework—rather than cue-conflict experiments—this work re-evaluates the feature reliance of CNNs, finding that CNNs are not inherently texture-biased but instead rely primarily on local shape features; moreover, feature reliance patterns differ substantially across domains (CV/MI/RS).
Interpretable Next-token Prediction via the Generalized Induction Head: This paper proposes Induction-Gram (GIM), an interpretable language model that combines exact n-gram matching with fuzzy matching. By constructing a "generalized induction head" to retrieve similar sequences from the input context for next-token prediction, it achieves up to 25 percentage points improvement over interpretable baselines and a 20% improvement in fMRI brain response prediction.
LoMix: Learnable Weighted Multi-Scale Logits Mixing for Medical Image Segmentation: LoMix introduces a Combinatorial Mutation Module (CMM) that generates "mutant" logits from multi-scale outputs via four fusion operators (addition / multiplication / concatenation / attention-weighted fusion) across all subset combinations, paired with NAS-style Softplus learnable weights for automatic contribution balancing. On Synapse 8-organ segmentation, Dice improves from 80.9% to 85.1% (+4.2%), and by +9.23% under 5% training data.
Magical: Medical Lay Language Generation via Semantic Invariance and Layperson-tailored Adaptation: This paper proposes Magical, an asymmetric LoRA architecture for medical lay language generation (MLLG) that enforces a semantic invariance constraint on the shared matrix \(A\) while employing multiple independent matrices \(B\) to enable semantically faithful and stylistically diverse lay language generation. Magical reduces trainable parameters by 31.66% while outperforming all LoRA variants.
Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation: This paper proposes Mamba-HoME, an architecture that integrates a Hierarchical Soft Mixture-of-Experts (HoME) with the Mamba SSM. Through a two-level token routing mechanism, it achieves local-to-global feature modeling and surpasses existing state-of-the-art methods on 3D medical image segmentation across CT, MRI, and ultrasound modalities, while maintaining linear computational complexity.
MATCH: Multi-faceted Adaptive Topo-Consistency for Semi-Supervised Histopathology Segmentation: This paper proposes the MATCH framework, which tightly couples topological reasoning with the perturbation-robustness principle of semi-supervised learning. By exploiting dual-level topological consistency across random perturbations and temporal training snapshots, MATCH adaptively identifies reliable topological structures without requiring manually defined thresholds, substantially reducing topological errors in histopathology image segmentation.
MEGState: Phoneme Decoding from Magnetoencephalography Signals: This paper proposes MEGState, an architecture combining multi-resolution convolution and sensor-wise state space models (SSMs) for decoding phonemes from magnetoencephalography (MEG) signals, achieving substantial improvements over baseline methods on the LibriBrain dataset.
Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex: This paper proposes BraInCoRL (Brain In-Context Representation Learning), a Transformer-based meta-learning framework that predicts voxel-level neural responses for new subjects directly from a small number of stimulus–response samples via in-context learning (ICL), requiring no fine-tuning to generalize to new subjects or stimuli. With only 100 images as context, it approaches the performance of a reference model fully trained on 9,000 images.
Mind the (Data) Gap: Evaluating Vision Systems in Small Data Applications: This paper systematically compares MLLMs (e.g., Gemini, Qwen2.5-VL) and vision encoder + SVM pipelines on the NeWT ecological classification benchmark across the "small data regime" (10–1000 labeled samples). MLLMs plateau after 10–30 samples, whereas vision-based methods exhibit near-logarithmic growth throughout, calling on the community to prioritize small-data evaluation.
Modeling X-ray Photon Pile-up with a Normalizing Flow: This paper proposes a Simulation-Based Inference (SBI) framework based on Normalizing Flows. A CNN extracts spatially resolved X-ray spectral features, which are then passed to a neural spline flow to perform accurate posterior estimation of astrophysical source parameters in the presence of photon pile-up, substantially outperforming the conventional PSF-core excision approach.
MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding: This paper proposes MoRE-Brain, a neuroscience-inspired fMRI visual decoding framework that employs a hierarchical Mixture-of-Experts (MoE) architecture to simulate the specialized processing of the brain's visual pathway. Combined with a dynamic temporal-spatial dual-routing mechanism that guides image generation via a diffusion model, MoRE-Brain achieves high-fidelity reconstruction while enabling efficient cross-subject generalization and unprecedented mechanistic interpretability.
MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology: This paper introduces MTBBench—the first clinical benchmark simultaneously covering three dimensions: multimodality, longitudinal temporal sequencing, and interactive agent workflows. It simulates the decision-making process of Molecular Tumor Boards (MTBs) to evaluate and enhance the multimodal longitudinal reasoning capabilities of AI agents in precision oncology.
Multimodal Bayesian Network for Robust Assessment of Casualties in Autonomous Triage: This paper proposes an expert-knowledge-driven Bayesian network decision-support framework that fuses outputs from multiple computer vision models to assess casualty conditions. Requiring no training data and supporting inference under incomplete information, the framework improved triage accuracy from 14% to 53% and diagnostic coverage from 31% to 95% in the DARPA Triage Challenge.
NeurIPT: Foundation Model for Neural Interfaces: NeurIPT is an EEG foundation model for diverse brain–computer interface (BCI) applications. Through four key innovations—Amplitude-Aware Masking Pre-training (AAMP), Progressive Mixture-of-Experts (PMoE) architecture, 3D electrode spatial encoding, and Intra- and Inter-Lobe Pooling (IILP)—it achieves state-of-the-art performance across eight downstream BCI tasks.
Online Feedback Efficient Active Target Discovery in Partially Observable Environments: This paper proposes DiffATD, which leverages the reverse process of diffusion models to construct a belief distribution for balancing exploration and exploitation, enabling efficient target region discovery in partially observable environments without any supervised training. The framework is applicable across multiple domains including medical imaging, species discovery, and remote sensing.
Ordinal Label-Distribution Learning with Constrained Asymmetric Priors for Imbalanced Retinal Grading: This paper proposes CAP-WAE (Constrained Asymmetric Prior Wasserstein Autoencoder), which addresses the challenges of long-tailed distribution and ordinal structure in diabetic retinopathy (DR) grading through three innovations: asymmetric priors, a margin-aware orthogonality and compactness loss, and a direction-aware ordinal loss, achieving state-of-the-art performance on multiple DR benchmarks.
Orochi: Versatile Biomedical Image Processor: This paper proposes Orochi—the first general-purpose foundation model for low-level biomedical image processing. Through Task-related Joint-embedding Pre-training (TJP) and a Multi-head Hierarchy Mamba architecture, Orochi matches or surpasses task-specific state-of-the-art models across four tasks—registration, fusion, restoration, and super-resolution—with lightweight fine-tuning of fewer than 5% of parameters.
Pancakes: Consistent Multi-Protocol Image Segmentation Across Biomedical Domains: This paper proposes the Pancakes framework, which, given a collection of biomedical images from an unseen domain, automatically generates label maps for multiple plausible segmentation protocols, ensuring semantic consistency across images within the same protocol—i.e., the same label refers to the same anatomical structure across all images.
PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation: This paper proposes PhysioWave, a multi-scale Transformer architecture based on learnable wavelet decomposition and frequency-guided masking. It establishes, for the first time, large-scale pretrained foundation models for EMG and ECG, and achieves state-of-the-art performance on both unimodal and multimodal physiological signal tasks through a multimodal fusion framework.
PolyPose: Deformable 2D/3D Registration via Polyrigid Transformations: This paper presents PolyPose, a deformable 2D/3D registration method based on polyrigid transformations. Leveraging the anatomical prior that bones are rigid bodies, PolyPose parameterizes complex 3D deformation fields as weighted combinations of multiple rigid transformations in the Lie algebra \(\mathfrak{se}(3)\), enabling accurate 3D volumetric registration from as few as two X-ray images without any regularization or hyperparameter tuning.
Posterior Sampling by Combining Diffusion Models with Annealed Langevin Dynamics: This paper proposes an algorithm combining diffusion models with annealed Langevin dynamics that requires only \(L^4\)-accurate score estimates to achieve polynomial-time posterior sampling under (locally) log-concave distributions, providing the first theoretical guarantees for warm-started inverse problem solving.
QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training: QoQ-Med constructs a multimodal clinical foundation model spanning 9 clinical modalities (1D ECG + 6 types of 2D images + 2 types of 3D scans), and proposes Domain-aware Relative Policy Optimization (DRPO)—which employs hierarchical temperature scaling (inter-domain × intra-domain K-means clustering) to address modality/difficulty imbalance. Trained on 2.61 million instruction-tuning pairs, it achieves an average F1 of 0.295 (vs. GRPO 0.193, +52.8%), ranking best in 6 out of 8 modalities.
RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray: This paper proposes RadZero, a framework centered on VL-CABS (Vision-Language Cross-Attention Based on Similarity), enabling explainable and fine-grained vision-language alignment on chest X-rays with unified support for zero-shot classification, localization, and segmentation.
RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis: RAM-W600 is the first publicly available multi-task wrist conventional radiograph dataset, comprising 1,048 images and supporting two clinically relevant tasks: carpal bone instance segmentation and SvdH bone erosion (BE) scoring, accompanied by comprehensive benchmarking.
Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology: This paper revisits end-to-end (E2E) learning with slide-level supervision in computational pathology, and is the first to identify optimization difficulties induced by sparse-attention MIL under E2E training. It proposes ABMILX, which addresses this issue via multi-head attention and a global attention correction module, enabling E2E-trained ResNets to surpass state-of-the-art foundation models on multiple benchmarks.
Riemannian Flow Matching for Brain Connectivity Matrices via Pullback Geometry: This paper proposes DiffeoCFM, which leverages pullback metrics induced by global diffeomorphisms to equivalently reformulate conditional flow matching on Riemannian manifolds as standard CFM in Euclidean space. The method enables efficient generation of brain connectivity matrices (SPD/correlation) while strictly preserving manifold constraints, achieving state-of-the-art performance on 3 fMRI and 2 EEG datasets.
Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis: This paper proposes the first diffusion Transformer for voxel-level whole-brain 4D fMRI conditional generation, combining 3D VQ-GAN latent space compression, a CNN-Transformer hybrid backbone, and strong conditioning via AdaLN-Zero and cross-attention. The model achieves a task activation map correlation of 0.83, RSA of 0.98, and perfect condition specificity across seven cognitive tasks from the HCP dataset.
Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation: This paper proposes DISCOVR, a self-supervised dual-branch framework that transfers fine-grained spatial semantics from an image encoder to the temporal representations of a video encoder via online semantic cluster distillation, achieving state-of-the-art performance across six cross-population cardiac ultrasound datasets on anomaly detection, classification, and segmentation tasks.
Self-Supervised Learning via Flow-Guided Neural Operator on Time-Series Data: This paper proposes FGNO (Flow-Guided Neural Operator), which combines Flow Matching with operator learning for self-supervised pre-training on time-series data. By leveraging STFT for resolution-invariant function-space learning and treating flow time and network layer depth as adjustable "knobs" for controlling feature granularity, FGNO substantially outperforms baselines such as MAE on biomedical tasks.
Semantic and Visual Crop-Guided Diffusion Models for Heterogeneous Tissue Synthesis in Histopathology: This paper proposes HeteroTissue-Diffuse (HTD), a dual-conditioned Latent Diffusion Model that generates heterogeneous pathology images by simultaneously conditioning on semantic segmentation maps and real tissue crops (visual crops). On Camelyon16, the method reduces Fréchet Distance from 430 to 72 (a 6× improvement). DeepLabv3+ segmentation IoU trained on synthetic data falls within 1–2% of models trained on real data. The approach is further extended to 11,765 unannotated TCGA whole-slide images via self-supervised clustering.
Sequential Attention-based Sampling for Histopathological Analysis: This paper proposes SASHA, a framework integrating a Hierarchical Attention-based Feature Distillation (HAFED) module with deep reinforcement learning (RL). By sampling only 10–20% of high-resolution patches, SASHA achieves classification performance on par with full-resolution SOTA methods, while yielding a 4–8× inference speedup and a WSI compression ratio exceeding 16×.
SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning: This paper presents SMMILE — the first expert-driven benchmark for multimodal medical in-context learning (ICL), comprising 111 questions (517 image-text QA triplets) spanning 6 medical specialties and 13 imaging modalities, constructed by 11 clinical experts. The benchmark systematically exposes critical deficiencies of current MLLMs in medical multimodal ICL and reveals the pivotal impact of in-context example quality and ordering on model performance.
STAMP: Spatial-Temporal Adapter with Multi-Head Pooling: STAMP introduces a lightweight spatial-temporal adapter with only 750K parameters for Time Series Foundation Models (TSFMs). Through three sets of positional encodings (token/spatial/temporal), cross-gated MLP mixing, and multi-head attention pooling, it enables a frozen TSFM (e.g., MOMENT 385M) to compete with or surpass EEG-specific models with 29M parameters (CBraMod) across 8 EEG datasets, achieving 193% higher Kappa than CBraMod on BCIC-IV-2a.
STARC-9: A Large-scale Dataset for Multi-Class Tissue Classification for CRC Histopathology: This paper introduces STARC-9, a large-scale colorectal cancer (CRC) tissue classification dataset comprising 630K patches across 9 tissue classes, along with its construction framework DeepCluster++. The framework combines domain-specific autoencoder feature extraction, K-means clustering, and equal-frequency binning sampling to ensure morphological diversity. Models trained on STARC-9 significantly outperform those trained on NCT and HMU.
Surf2CT: Cascaded 3D Flow Matching Models for Torso 3D CT Synthesis from Skin Surface: This paper proposes Surf2CT, a cascaded 3D Flow Matching framework that, for the first time, synthesizes complete high-resolution 3D CT volumes solely from external body surface scans and demographic data (age, sex, height, weight), without requiring any internal imaging input.
SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning: This paper proposes SynBrain, a framework that models fMRI responses as visual-semantic-conditioned probability distributions via BrainVAE, and employs an S2N Mapper for one-step semantic-to-neural-space mapping. SynBrain substantially outperforms MindSimulator on visual-to-fMRI synthesis (65% reduction in MSE, 96% improvement in Pearson correlation), and the synthesized fMRI signals effectively enhance few-shot cross-subject decoding performance.
The Boundaries of Fair AI in Medical Image Prognosis: A Causal Perspective: FairTTE is the first comprehensive framework to systematically investigate fairness in time-to-event (TTE) prediction for medical imaging. It leverages causal analysis to quantify five sources of bias, and through training over 20,000 models, reveals the limitations of existing fairness methods — particularly the fundamental challenge of maintaining fairness under distribution shift.
The Human Brain as a Combinatorial Complex: This paper proposes a data-driven framework that constructs Combinatorial Complexes (CCs) directly from fMRI time series using information-theoretic measures—namely S-information and O-information—encoding higher-order synergistic interactions among brain regions into topological structures, thereby laying the groundwork for applying topological deep learning to brain network analysis.
THUNDER: Tile-level Histopathology image UNDERstanding benchmark: This paper presents THUNDER, a comprehensive tile-level benchmark for digital pathology foundation models, enabling efficient comparison of 23 foundation models across 16 datasets, covering downstream task performance, feature space analysis, robustness, and uncertainty estimation.
Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation: This work introduces ViMed-PET, the first Vietnamese PET/CT image-report dataset comprising 2,757 whole-body PET/CT volumes paired with complete clinical reports. Through a data augmentation strategy and a three-stage fine-tuning pipeline, the approach substantially improves VLM performance on medical report generation and VQA tasks. Novel evaluation metrics based on clinically critical information are also proposed.
UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation: This paper proposes UniMRSeg, a unified missing-modality segmentation framework that employs a Hierarchical Self-Supervised Compensation (HSSC) mechanism—spanning input-level modality reconstruction, feature-level contrastive learning, and output-level consistency regularization—to achieve optimal average performance and minimal performance variance across all possible modality combinations using 100% shared parameters.
Unpaired Image-to-Image Translation for Segmentation and Signal Unmixing: This paper proposes Ui2i, a model built upon CycleGAN that achieves high content-fidelity unpaired image-to-image translation through four key innovations: a UNet-based generator, approximate bidirectional spectral normalization (ABSN) as a replacement for feature normalization, channel-spatial attention, and scale augmentation. The model is successfully applied to two biomedical tasks: IHC→H&E domain adaptation for nucleus segmentation and single-channel immunofluorescence signal unmixing.
Variational Autoencoder with Normalizing Flow for X-ray Spectral Fitting: This work embeds a Normalizing Flow (NF) into an autoencoder architecture to enable fast physical parameter inference and full posterior distribution estimation for NICER spectral data of black hole X-ray binaries, achieving approximately 2000× speedup over traditional MCMC methods while maintaining comparable accuracy.
VQ-Seg: Vector-Quantized Token Perturbation for Semi-Supervised Medical Image Segmentation: VQ-Seg is proposed as the first method to introduce vector quantization into semi-supervised medical image segmentation. A Quantization Perturbation Module (QPM) replaces conventional dropout to achieve more controllable feature perturbation, complemented by a dual-branch architecture and foundation-model-guided alignment to compensate for quantization information loss.
Zebra: Towards Zero-Shot Cross-Subject Generalization for Universal Brain Visual Decoding: This paper proposes Zebra, the first zero-shot brain visual decoding framework, which disentangles fMRI representations into subject-invariant and semantic-specific components via adversarial training and residual decomposition, enabling cross-subject visual reconstruction generalization without fine-tuning on new subjects.