🏥 Medical Imaging¶

📹 ICCV2025 · 40 paper notes

AcZeroTS: Active Learning for Zero-shot Tissue Segmentation in Pathology Images: This work proposes AcZeroTS, a framework that integrates active learning with a VLM-based prototype-guided zero-shot segmentation model (ProZS). By simultaneously accounting for uncertainty, diversity, and the ability of selected samples to improve prototype coverage over unseen classes, the framework selects the most informative samples for annotation, achieving high-quality segmentation of both seen and unseen tissue types under minimal annotation budgets.
Alleviating Textual Reliance in Medical Language-guided Segmentation via Prototype-driven Semantic Approximation: This paper proposes ProLearn, a framework that introduces a Prototype-driven Semantic Approximation (PSA) module to fundamentally alleviate textual reliance in medical language-guided segmentation. The prototype space is initialized from a small number of image-text pairs; thereafter, both training and inference require no text input. ProLearn maintains strong performance under 1% text availability (QaTa-COV19 Dice = 0.857), with parameters 1000× fewer than LLM-based solutions and inference speed 100× faster.
An OpenMind for 3D Medical Vision Self-Supervised Learning: This work releases OpenMind, the largest publicly available 3D medical imaging pre-training dataset (114k brain MRIs), and systematically compares 7+ SSL methods across two architectures — a CNN (ResEnc-L) and a Transformer (Primus-M) — on 15 downstream datasets. Key findings: MAE pre-training yields the best segmentation performance, contrastive learning excels at classification, and for the first time, a pre-trained Transformer is shown to outperform a randomly initialized CNN on select datasets.
An OpenMind for 3D Medical Vision Self-supervised Learning: This work releases OpenMind, the largest publicly available 3D medical imaging pretraining dataset (114k brain MRI volumes), and conducts a systematic benchmark of existing 3D SSL methods on this dataset using state-of-the-art CNN (ResEnc-L) and Transformer (Primus-M) architectures, establishing the current SOTA for 3D medical image SSL.
Beyond Brain Decoding: Visual-Semantic Reconstructions to Mental Creation Extension Based on fMRI: This paper proposes NeuroCreat — a multimodal brain architecture that integrates the visual and textual capabilities of LLMs — extending fMRI decoding from single-task visual stimulus reconstruction to three levels: image reconstruction + text captioning + mental creation. A Prompt Variant Alignment (PVA) module is introduced to effectively bridge the gap between low-resolution fMRI signals and high-level semantic representations.
Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training: This paper proposes ViSD-Boost, which addresses the alignment bias caused by low visual semantic density in medical vision-language pre-training (VLP). The method employs disease-level visual contrastive learning to enhance visual semantics and VQ-VAE-based anatomical normality modeling to amplify abnormality signals, achieving 84.9% AUC in zero-shot diagnosis across 54 diseases spanning 15 organs.
COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation: This paper proposes COIN, a three-stage framework that addresses the critical "error-free instance absence" problem in annotation-free cell instance segmentation. The framework combines unsupervised semantic segmentation with optimal transport for pixel-level cell propagation, model–SAM consistency for instance-level confidence scoring, and confidence-guided recursive self-distillation, achieving performance on MoNuSeg and TNBC that surpasses semi-supervised and weakly supervised methods.
Controllable Latent Space Augmentation for Digital Pathology: This paper proposes HistAug — a lightweight Transformer-based latent space augmentation model that simulates realistic image transformations (hue shifts, erosion, etc.) in feature space via conditional cross-attention, providing controllable and computationally efficient data augmentation for pathology MIL training at minimal overhead.
Coordinate-based Speed of Sound Recovery for Aberration-Corrected Photoacoustic Computed Tomography: This paper proposes an efficient self-supervised joint reconstruction method that parameterizes the speed of sound (SOS) as either a pixel grid or a neural field, recovering SOS and high-quality photoacoustic images by backpropagating gradients through a differentiable imaging forward model. The method surpasses the current state of the art in accuracy while achieving a 35× speedup (40 seconds vs. 23 minutes).
CryoFastAR: Fast Cryo-EM Ab initio Reconstruction Made Easy: CryoFastAR is proposed as the first geometric foundation model for cryo-EM, which employs a ViT architecture to directly predict Fourier Planar Maps from multi-view noisy particle images in a feed-forward manner for pose estimation, achieving over 10× speedup while maintaining comparable reconstruction quality on both synthetic and real datasets.
CryoFastAR: Fast Cryo-EM Ab initio Reconstruction Made Easy: The first work to introduce the DUSt3R-style geometric foundation model paradigm into cryo-EM, achieving feedforward pose prediction from large sets of noisy particle images via a ViT encoder with cross-view attention decoder—without iterative optimization—enabling ab initio protein 3D reconstruction 10–33× faster than traditional methods.
CuMPerLay: Learning Cubical Multiparameter Persistence Vectorizations: This paper proposes CuMPerLay, a differentiable Cubical Multiparameter Persistence (CMP) vectorization layer that decomposes CMP into multiple learnable single-parameter persistence lines. By jointly learning bifiltration functions for end-to-end training and embedding the layer into Swin Transformer, the method achieves significant improvements on medical image classification and semantic segmentation tasks, particularly in data-scarce settings.
DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup: Inspired by the intuition of human inspectors "consulting a dictionary," DictAS reformulates few-shot anomaly segmentation (FSAS) as a dictionary lookup task—a query feature is deemed anomalous if it cannot be retrieved from a dictionary of normal samples. Through self-supervised training, the framework acquires class-agnostic lookup capability and achieves state-of-the-art FSAS performance and inference speed across 7 industrial and medical datasets.
G2PDiffusion: Cross-Species Genotype-to-Phenotype Prediction via Evolutionary Diffusion: This paper proposes G2PDiffusion, the first diffusion model-based cross-species genotype-to-phenotype prediction framework, which generates morphological images conditioned on evolutionary signals (multiple sequence alignments, MSA, and environmental context) to predict species appearance from DNA sequences.
GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule: This paper proposes GDKVM, an echocardiography video segmentation architecture based on linear key-value association and the gated delta rule, achieving state-of-the-art performance on CAMUS and EchoNet-Dynamic through efficient memory management and multi-scale feature fusion while maintaining real-time inference speed.
GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology: GECKO is proposed as a WSI-level MIL aggregator pretraining method that requires no additional clinical data modalities. By automatically extracting interpretable concept priors from H&E WSIs and aligning them with deep features via contrastive learning, GECKO surpasses existing unimodal and multimodal pretraining methods on five classification tasks while providing pathologist-interpretable WSI-level descriptions.
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis: This paper presents GEMeX, the largest chest X-ray VQA dataset to date (151K images, 1.6M questions), which for the first time simultaneously provides textual reasoning explanations and visual region grounding across four question types, and systematically evaluates 12 representative large vision-language models.
IDF: Iterative Dynamic Filtering Networks for Generalizable Image Denoising: This paper proposes Iterative Dynamic Filtering Networks (IDF), which achieves strong out-of-distribution (OOD) denoising performance using only ~0.04M parameters. By combining per-pixel dynamic kernel prediction with an adaptive iterative refinement strategy, IDF generalizes to diverse unseen noise types (Gaussian, Poisson, salt-and-pepper, Monte Carlo rendering, and real noise) while trained exclusively on single-level Gaussian noise.
InsideOut: Integrated RGB-Radiative Gaussian Splatting for Comprehensive 3D Object Representation: InsideOut extends 3D Gaussian Splatting (3DGS) beyond RGB surface modeling to simultaneously represent internal X-ray structures, achieving joint representation of RGB appearance and internal radiative structure through hierarchical fitting and an X-ray reference loss.
Integrating Biological Knowledge for Robust Microscopy Image Profiling on De Novo Cell Lines: This paper proposes integrating external biological knowledge — protein–protein interaction graphs and transcriptomic features from single-cell foundation models — into microscopy image pretraining, explicitly decoupling perturbation-specific and cell-line-specific representations to improve generalization of perturbation screening on unseen (de novo) cell lines.
M-Net: MRI Brain Tumor Sequential Segmentation Network via Mesh-Cast: M-Net reinterprets the spatial continuity between adjacent MRI slices as "quasi-temporal" data, and proposes the Mesh-Cast mechanism to seamlessly integrate arbitrary sequential models (LSTM, Transformer, Mamba SSM, etc.) into both channel and temporal information processing. Combined with a Two-Phase Sequential training strategy (TPS), M-Net achieves state-of-the-art segmentation performance on BraTS2019 and BraTS2023.
MRGen: Segmentation Data Engine for Underrepresented MRI Modalities: To address the lack of segmentation annotations for scarce MRI modalities, this work constructs a large-scale radiological image dataset MRGen-DB (~250K slices, 100+ modalities) and trains a controllable diffusion-based data engine MRGen. Using dual-condition control via text prompts and segmentation masks, MRGen generates high-quality MR images in target modalities for training segmentation models. Across 10 cross-modal segmentation experiments, the average DSC improves from 10%–27% to 43%–45%, enabling "zero-shot" segmentation for annotation-scarce modalities.
MultiverSeg: Scalable Interactive Segmentation of Biomedical Imaging Datasets with In-Context Guidance: This paper proposes MultiverSeg, a progressive interactive segmentation system in which each image annotated by the user reduces the number of interactions required for subsequent images. By incorporating previously segmented images as in-context inputs, the system improves with use. On 12 unseen datasets, it reduces click counts by 36% and scribble steps by 25% compared to ScribblePrompt.
NEURONS: Emulating the Human Visual Cortex Improves Fidelity and Interpretability in fMRI-to-Video Reconstruction: This paper proposes NEURONS, a framework inspired by the hierarchical structure of the human visual cortex that decouples fMRI-to-video reconstruction into four sub-tasks (key object segmentation, concept recognition, scene description, and blurry video reconstruction), emulating the functional specialization of cortical regions V1/V2/V4/ITC. NEURONS substantially outperforms state-of-the-art methods in video consistency (+26.6%) and semantic accuracy (+19.1%).
ProGait: A Multi-Purpose Video Dataset and Benchmark for Transfemoral Prosthesis Users: This paper presents ProGait—the first multi-purpose video dataset targeting transfemoral amputee prosthesis users—supporting three tasks: video object segmentation, 2D human pose estimation, and gait analysis. Baseline models are provided to demonstrate the dataset's effectiveness in improving prosthesis detection.
Progressive Test Time Energy Adaptation for Medical Image Segmentation: This paper proposes a progressive test-time adaptation method based on energy-based models. A shape energy model is trained as an in-distribution/out-of-distribution discriminator; at test time, energy minimization guides the segmentation model to adapt to the target domain. The method consistently outperforms baselines across 8 public datasets covering cardiac, spinal cord, and lung segmentation tasks.
PVChat: Personalized Video Chat with One-Shot Learning: This paper proposes PVChat, the first video large language model supporting personalized subject learning from a single reference video. Through a ReLU-routed Mixture-of-Heads (ReMoH) attention mechanism, a systematic data augmentation pipeline, and a progressive image-to-video training strategy, PVChat achieves identity-aware video question answering and surpasses existing state-of-the-art ViLLMs across diverse scenarios including medical, TV drama, and anime settings.
RadGPT: Constructing 3D Image-Text Tumor Datasets: This paper proposes RadGPT — an anatomy-aware vision-language AI pipeline that converts radiologist-revised tumor segmentation masks into structured reports via deterministic algorithms, then adapts them into narrative-style reports using an LLM. This pipeline is used to construct AbdomenAtlas 3.0, the first large-scale public abdominal CT image-text tumor dataset (9,262 CT scans with per-voxel annotations and reports). The work demonstrates that segmentation assistance significantly improves tumor detection rates in AI-generated reports.
Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data: Through a systematic study of data scaling laws on a large-scale private dataset, this work demonstrates that synthetic tumors can substantially reduce the need for real annotations (from 1,500 to 500 cases). Building on these findings, the authors construct AbdomenAtlas 2.0—the first large-scale manually annotated CT dataset with over 10,000 scans covering six organ tumor types—achieving significant improvements on both in-distribution and out-of-distribution benchmarks.
SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications: This paper introduces SciVid, a benchmark comprising five interdisciplinary scientific video tasks—including animal behavior classification, tissue tracking, and weather forecasting—that systematically evaluates six categories of Video Foundation Models (ViFMs). The study finds that adapting a frozen ViFM backbone with a simple trainable readout suffices to achieve state-of-the-art performance on multiple scientific tasks, providing the first systematic evidence of the transferability of general-purpose ViFMs to scientific domains.
SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images: This paper introduces PETS-5k, the largest PET segmentation dataset to date (5,731 3D whole-body PET scans, over 1.3 million 2D slices), and proposes SegAnyPET — the first 3D promptable segmentation foundation model tailored for PET imaging. Through a Cross-Prompt Confidence Learning (CPCL) strategy to handle inconsistent annotation quality, SegAnyPET substantially outperforms existing foundation models and task-specific models on both seen and unseen targets.
Semi-supervised Deep Transfer for Regression without Domain Alignment: This paper proposes CRAFT (Contradistinguisher-based Regularization Approach for Flexible Training), a semi-supervised transfer learning framework that requires neither source data nor domain alignment, specifically designed for regression tasks. CRAFT jointly optimizes a supervised loss and an unsupervised Contradistinguisher-based regularization term to substantially improve prediction performance under label-scarce conditions.
SIC: Similarity-Based Interpretable Image Classification with Neural Networks: This paper proposes SIC, an inherently interpretable neural network that simultaneously provides local, global, and faithful explanations. By extracting class-representative support vectors from training images and computing input-to-support-vector similarities via B-cos transformations for classification, SIC achieves accuracy comparable to black-box models while delivering pixel-level contribution maps and case-based global explanations. On the FunnyBirds benchmark, SIC outperforms ProtoPNet on 8 out of 9 interpretability metrics.
SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality: This paper proposes SimMLM, a simple yet effective framework for multi-modal learning under missing modality conditions. It consists of a Dynamic Mixture of Modality Experts (DMoME) architecture and a More vs. Fewer (MoFe) ranking loss. SimMLM comprehensively outperforms state-of-the-art methods on brain tumor segmentation and multi-modal classification tasks with fewer parameters and lower computational cost, while providing interpretable modality importance estimates.
TeethGenerator: A Two-Stage Framework for Paired Pre- and Post-Orthodontic 3D Dental Data Generation: This paper proposes TeethGenerator, a two-stage framework for generating paired pre- and post-orthodontic 3D dental point cloud models. Stage I employs a VQ-VAE combined with a diffusion model to generate post-treatment tooth morphology, while Stage II uses a Transformer conditioned on a style model to generate the corresponding pre-treatment dental arrangement.
Toward Long-Tailed Online Anomaly Detection through Class-Agnostic Concepts: This paper proposes a new task and benchmark for Long-Tailed Online Anomaly Detection (LTOAD). The core innovation is replacing class-label dependency with a learnable class-agnostic concept set, combined with a Concept VQ-VAE and a comprehensive prompt learning framework. The proposed method achieves state-of-the-art performance in both offline and online settings without requiring class labels.
UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation: This paper introduces UKBOB—the largest annotated medical image segmentation dataset to date (51,761 MRI 3D volumes, 72 organ classes, 1.37 billion 2D segmentation masks)—and proposes a Specialized Organ Label Filter (SOLF) for cleaning automated annotations and an Entropy Test-Time Adaptation (ETTA) method for handling domain shift under noisy labels. The resulting Swin-BOB foundation model achieves state-of-the-art performance on the BRATS and BTCV benchmarks.
Vector Contrastive Learning for Pixel-wise Pretraining in Medical Vision: This paper proposes Vector Contrastive Learning (Vector CL), which reformulates standard contrastive learning from a binary optimization problem into a vector regression problem. By modeling feature distances to quantify the degree of dispersion, it addresses the over-dispersion problem in pixel-wise medical vision pretraining, achieving significant improvements over 17 methods across 8 downstream tasks.
ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis: This paper proposes ViCTr, a two-stage framework that combines Rectified Flow with a Tweedie-corrected diffusion process to achieve high-fidelity pathology-aware medical image synthesis. The method reduces inference steps from 50 to 3–4 and, for the first time, enables graded-severity pathology synthesis for abdominal MRI.
Visual Surface Wave Elastography: Revealing Subsurface Physical Properties via Visible Surface Waves: This paper proposes VSWE (Visual Surface Wave Elastography), a method that extracts the dispersion relation from a video of surface wave propagation and combines it with physics-based finite element optimization to infer subsurface layer thickness and stiffness. High-accuracy parameter recovery is demonstrated in both simulated and real gelatin experiments, providing a proof-of-concept for at-home health monitoring.