CVPR2025 Medical Imaging AI paper notes paper summaries Segmentation Diffusion Models Multimodal/VLM Reasoning Adversarial Robustness

🏥 Medical Imaging¶

📷 CVPR2025 · 78 paper notes

📌 Same area in other venues: 📷 CVPR2026 (172) · 🔬 ICLR2026 (88) · 🧪 ICML2026 (28) · 🤖 AAAI2026 (75) · 🧠 NeurIPS2025 (77) · 📹 ICCV2025 (31)

🔥 Top topics: Medical Imaging ×30 · Segmentation ×21 · Diffusion Models ×9 · Multimodal/VLM ×5 · Reasoning ×4

A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement: This paper proposes a semi-supervised breast ultrasound segmentation framework combining training-free pseudo-label generation from VLMs (Grounding DINO + SAM driven by appearance description prompts) and dual-teacher uncertainty fusion refinement, achieving performance close to fully supervised learning with only 2.5% of labeled data.
Accelerating Stroke MRI with Diffusion Probabilistic Models through Large-Scale Pre-training and Target-Specific Fine-Tuning: Drawing on the foundation model paradigm, a Diffusion Probabilistic Model (DPM) is pre-trained on large-scale public brain MRI data and then fine-tuned on data from only 20 stroke patients. This workflow enables accelerated MRI reconstruction in data-constrained scenarios. A clinical reader study confirms that the image quality with 2× acceleration is non-inferior to the standard-of-care.
Adaptation of Weakly Supervised Localization in Histopathology by Debiasing Predictions: Proposes the SFDA-DeP method. Inspired by machine unlearning, it identifies and corrects the prediction bias (over-predicting certain classes) of the source model in the target domain. This addresses the challenge of amplified prediction bias in weakly supervised localization models during cross-organ/cross-center domain adaptation in histopathology.
Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding: A two-stage label-efficient learning framework is proposed: first, a 3D U-Net encoder is pre-trained via self-supervised Masked Image Modeling on 1,206 unlabeled CT scans; then, combined with VDETR + Vertex RPE and Mean Teacher semi-supervised learning, it achieves a 3D abdominal trauma detection [email protected] of 45.30% (+115%) using only 144 labeled cases.
Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation?: By establishing unified training and evaluation protocols, this study compares 11 specialized and general-purpose vision models across three heterogeneous medical datasets. The findings reveal that General-Purpose Vision Models (GP-VMs) can systematically outperform most Specialized Medical Segmentation Architectures (SMAs) in both segmentation accuracy and interpretability, challenging the traditional assumption that "medical image segmentation necessitates domain-specific architectures."
Association of Radiologic PPFE Change with Mortality in Lung Cancer Screening Cohorts: This study validates that the progression of PPFE (pleuroparenchymal fibroelastosis) automatically quantified by deep learning is independently associated with all-cause mortality across two large-scale lung cancer screening cohorts (NLST: 7,980 cases; SUMMIT: 8,561 cases). It proposes that longitudinal changes in PPFE can serve as an imaging biomarker to identify individuals at high risk for respiratory morbidity in screening populations.
Automated Detection of Malignant Lesions in the Ovary Using Deep Learning Models and XAI: Detects ovarian cancer and its subtypes on histopathology images using 15 CNN variants (LeNet, ResNet, VGG, Inception), selects InceptionV3 (ReLU) as the optimal model (average 94.58% accuracy), and interprets model predictions using three XAI methods: LIME, SHAP, and Integrated Gradients.
BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation: BiCLIP proposes a bidirectional consistent vision-language segmentation framework. Through bidirectional multimodal fusion (BMF, letting visual features reversely refine text embeddings) and image augmentation consistency (IAC, regularization across weak/strong perturbations), it maintains robust performance on COVID-19 CT segmentation with only 1% of labeled data and shows tolerance to clinical image degradation (noise/blur).
Boltzmann Attention Sampling for Image Analysis with Small Objects: Proposes BoltzFormer, a novel transformer decoder architecture that dynamically samples sparse attention regions using a Boltzmann distribution to focus on small objects. Combining an annealing temperature schedule (exploration in early layers, exploitation in later layers) and the PiGMA multi-query aggregation module, it achieves a 3-12% improvement in Dice score compared to SOTA on small object segmentation (where objects occupy <0.1% of the image area), while reducing attention computation by an order of magnitude.
Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD: Proposes CBCTRepD, the first bilingual report generation system for oral and maxillofacial CBCT. By constructing a dataset of 7,408 high-quality CBCT-report pairs and establishing a multi-level clinical evaluation framework, it consistently improves report quality across radiologists of different experience levels, especially in reducing missed lesions and standardizing report structures.
CARL: A Framework for Equivariant Image Registration: Proposes CARL (Coordinate Attention with Refinement Layers), a deep registration framework that achieves \([W,U]\)-equivariance to translations and rotations via a coordinate attention mechanism. By replacing only the first step in a multi-step registration pipeline with CARL, global \([W,U]\)-equivariance is obtained. It matches or exceeds SOTA performance on three medical registration benchmarks (abdomen, lung, and brain), significantly leading on abdominal registration tasks featuring varying fields of view.
CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools: This paper presents the CholecTrack20 dataset, which is the first to introduce three perspective-based trajectory definitions (intraoperative, intra-abdominal, and visibility) for laparoscopic tool tracking. It comprises 20 full surgical videos, 35K+ frames, and over 65K+ annotated tool instances. Benchmarking results indicate that current SOTA methods (<45% HOTA) fall far short of clinical demands.
CLoE: Expert Consistency Learning for Missing Modality Segmentation: This paper proposes the CLoE framework, which reformulates the robustness challenge of missing modality segmentation as a decision-level expert consistency control problem. It reduces expert drift through dual-branch constraints: Modality Expert Consistency (MEC) globally and Region Expert Consistency (REC) regionally. A lightweight gating network is employed to convert consistency scores into reliability weights to guide feature fusion, outperforming SOTA methods on BraTS 2020 and MSD Prostate.
CrossSDF: 3D Reconstruction of Thin Structures From Cross-Sections: CrossSDF is proposed to reconstruct a 3D SDF from 2D cross-sectional Signed Distance Fields. By combining hybrid encoding (hash grid + random Fourier features) and symmetric difference loss, it achieves accurate reconstruction of thin tubular structures (such as blood vessels) for the first time.
CycleULM: A Unified Label-Free Deep Learning Framework for Ultrasound Localisation Microscopy: CycleULM is proposed as the first unified, label-free deep learning framework for ultrasound localization microscopy (ULM). It bridges the simulation-to-real domain gap by employing CycleGAN to learn a physics-informed bidirectional translation between contrast-enhanced ultrasound (CEUS) frames and a simplified microbubble (MB) domain. This delivers improvements in MB localization accuracy by up to 40% recall and 46% precision, while enabling real-time processing at 18.3 fps.
Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation: Deco-Mamba is proposed, a decoder-centric hybrid Transformer-CNN-Mamba architecture. It enhances decoder capabilities via a Co-Attention Gate, Visual State Space Mamba Block, and Deformable Residual Block. By introducing a windowed KL-divergence based distribution-aware deep supervision strategy, it achieves SOTA performance across 7 medical image segmentation benchmarks while maintaining moderate model complexity.
Deep Learning-based Assessment of the Relation Between the Third Molar and Mandibular Canal on Panoramic Radiographs using Local, Centralized, and Federated Learning: This paper compares the performance of three learning paradigms—Local Learning (LL), Federated Learning (FL), and Centralized Learning (CL)—in automatically classifying the overlapping relationship between the third molar and the mandibular canal on panoramic radiographs. Utilizing a pre-trained ResNet-34 as the backbone, the study finds that centralized training achieves the best performance (AUC 0.831), while FL significantly outperforms purely local training under privacy-preserving assumptions.
Deep Learning Based Estimation of Blood Glucose Levels from Multidirectional Scleral Blood Vessel Imaging: Proposes ScleraGluNet, which utilizes five-direction scleral blood vessel images integrated with a multi-branch CNN, MRFO feature selection, and Transformer cross-view fusion, achieving a three-class metabolic state classification accuracy of 93.8% and continuous fasting plasma glucose estimation with an MAE of 6.42 mg/dL, offering a novel approach for non-invasive blood glucose monitoring.
Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography: Constructed the largest PET segmentation dataset PETWB-Seg11K (11,041 whole-body PET cases + 59,831 segmentation masks) and proposed SegAnyPET—a foundation model for universal PET segmentation based on a 3D architecture + prompt engineering. It demonstrates strong zero-shot generalization capabilities across multi-center, multi-tracer, and multi-disease scenarios.
DFLMoE: Decentralized Federated Learning via Mixture of Experts for Medical Data: DFLMoE is proposed to handle medical data heterogeneity in decentralized federated learning using a Mixture of Experts (MoE) mechanism, enabling collaborative training without a central server while preserving privacy.
Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification: Presents a brain tumor classification framework combining Non-Negative Matrix Factorization (NNMF) feature extraction, statistical feature selection, lightweight CNN classification, and diffusion-based feature space denoising. While maintaining ~85% clean accuracy, it improves robust accuracy under AutoAttack from 0.47% to 59.5%.
DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels: This paper proposes the DiN framework, applying diffusion models to the noisy-label medical VQA (NM-VQA) scenario for the first time. Through a diffusion-based answer classifier, it screens answers from coarse to fine from a generative perspective. Combined with a noisy label refinement module to dynamically correct labels, DiN achieves an accuracy of 74.24% on VQA-RAD under 10% semantic noise, outperforming SNLC's 69.65%.
Distilled Prompt Learning for Incomplete Multimodal Survival Prediction: This paper proposes DisPro (Distilled Prompt Learning), a two-stage prompt learning framework—UniPro to distill the knowledge distribution of each modality, and MultiPro to leverage LLMs to infer missing modalities from available ones. By simultaneously compensating for both modality-specific and modality-common information of the missing modalities, DisPro achieves SOTA performance on five TCGA survival prediction datasets.
Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data: This paper proposes GUES (Generative Unadversarial Examples) to improve the performance of frozen source models on target domain diabetic retinopathy (DR) grading under the extreme scenario of Online Model-aGnostic Domain Adaptation (OMG-DA), where target data arrives in a streaming fashion without accessing source model parameters and labels. Specifically, the method generates personalized unadversarial perturbations via a VAE and utilizes saliency maps as pseudo-supervision.
EchoONE: Segmenting Multiple Echocardiography Planes in One Model: This paper proposes EchoONE, the first unified model to address the Multi-Plane Segmentation (MPS) problem in echocardiography. By employing a Prior Composable Mask learning (PC-Mask) module to generate semantically-aware dense prompts, and designing a Local Feature Fusion and Adaptation (LFFA) module to inject local CNN features into the SAM decoder, EchoONE consistently achieves SOTA performance across 6 planes.
EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance: This paper proposes EchoWorld, a motion-aware world modeling framework for echocardiography probe guidance. It first undergoes pre-training via spatial world modeling (masked reconstruction) and motion world modeling (predicting visual changes based on probe motion) to encode cardiac anatomical knowledge. In the fine-tuning stage, a motion-aware attention mechanism is introduced to fuse historical visual-motion sequences, significantly reducing guidance errors across 10 standard views.
Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation: This paper proposes MLRG, a two-stage framework that integrates the spatial information of current multi-view images and the temporal information of historical longitudinal data during vision-language pre-training via multi-view longitudinal contrastive learning. It flexibly handles missing patient prior knowledge using tokenized absence encoding, achieving a 2.3% improvement in BLEU-4 on MIMIC-CXR and a 5.5% improvement in F1 on MIMIC-ABN.
Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-Supervised Medical Image Segmentation: Proposes a semi-supervised medical image segmentation framework that enhances SAM. By utilizing CLIP and VQA, it unsupervisedly generates efficient prompts containing semantic, location, and shape information (without requiring expert annotations). It then employs Direct Preference Optimization (DPO) combined with a virtual annotator (replacing human annotators to provide rankings/scores) to train the optimal segmentation policy, achieving SOTA performance on multi-modal tasks including lung, breast tumor, and organ segmentation.
Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise Scheduling: This paper proposes enhancing training data for virtual try-on by extracting synthetic garment-person pairs backward from human images. It designs an Error-Aware Refinement Schrödinger Bridge (EARSB) model to perform local error correction on the generation results of existing try-on models, achieving SOTA performance on VITON-HD and DressCode with a high user preference (59%).
EquivAnIA: A Spectral Method for Rotation-Equivariant Anisotropic Image Analysis: EquivAnIA, a spectral method based on cake wavelets and ridge filters, is proposed for rotation-equivariant anisotropic image analysis, demonstrating superior rotation robustness to traditional angular binning on synthetic and real images (including CT).
Evidential learning driven Breast Tumor Segmentation with Stage-divided Vision-Language Interaction: This paper proposes the TextBCS model, which utilizes text prompts to assist breast tumor segmentation through a Stage-divided Vision-Language Interaction (SVLI) module and an Evidential Learning (EL) strategy. It achieves a Dice score of 85.33% on the Duke-Breast-Cancer-MRI dataset, outperforming all baseline methods.
Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation: This work proposes FedMEPD, a federated learning framework that simultaneously addresses inter-modality heterogeneity and client personalization in multimodal MRI brain tumor segmentation through modality-specific encoders (globally federated) and a partially personalized fusion decoder. It achieves an average client mDSC of 75.70%/75.90% on BraTS 2018/2020.
GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis: This paper proposes GIIM, a multi-view medical image classification framework based on Multi-Heterogeneous Graphs (MHG). It simultaneously models intra-view and inter-view lesion dependencies, significantly outperforming existing multi-view methods across three modalities—liver CT, breast mammography, and breast MRI—while maintaining robustness to missing views.
Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization: Proposes GenEval, which quantifies the causal coverage gap through the Domain Conformal Boundary (DCB) theory and integrates human expert knowledge with the MedGemma-4B vision-language model to achieve single-source domain generalization (SDG). It significantly outperforms existing methods on diabetic retinopathy grading (8 datasets) and seizure onset zone detection (2 datasets).
Interactive Medical Image Analysis with Concept-based Similarity Reasoning: This paper proposes the CSR (Concept-based Similarity Reasoning) network, which performs classification reasoning by learning the similarity of concept prototypes in local image regions. It simultaneously supports interactive intervention by clinicians across spatial and conceptual levels during both training and testing, outperforming existing explainable methods by up to a 4.5% F1 gain across three medical datasets.
Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline: This paper proposes IMed-361M, a large-scale interactive medical image segmentation benchmark dataset containing 6.4 million images and 361 million masks (averaging 56 masks per image) covering 14 imaging modalities and 204 segmentation targets. Based on this, the authors develop an IMIS baseline network that supports click, bounding box, text, and combined interactions, which outperforms existing vision foundation models across multiple scenarios.
Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis: This paper proposes Latent Drifting (LD), which introduces a scalar drift parameter \(\delta\) into both the forward and reverse processes of diffusion models to bridge the gap between pre-trained natural image models and target medical image distributions, significantly improving medical image generation and counterfactual image synthesis across various fine-tuning strategies.
MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification: This paper proposes the MIL-PF framework, which leverages precomputed features from frozen foundation vision models and employs an ultra-lightweight MIL aggregation head with only ~40k parameters to achieve SOTA performance on mammography classification tasks, significantly reducing training costs.
MoEdit: On Learning Quantity Perception for Multi-Object Image Editing: Proposes MoEdit, an auxiliary-tool-free multi-object image editing framework. It compensates for the cross-attribute confusion in CLIP embeddings via the FeCom module, and injects quantity perception into the U-Net through the QTTN module, ensuring quantity consistency and attribute independence during editing.
Multi-modal Vision Pre-training for Medical Image Analysis (BrainMVP): BrainMVP proposes the first multi-modal vision pre-training paradigm. By using three pretext tasks—cross-modal masked reconstruction, modal template distillation, and modality-aware contrastive learning—it pre-trains a ViT on 16,022 multi-parametric brain MRI scans (over 2.4 million images). It outperforms SOTA methods on six segmentation and four classification downstream tasks, with an improvement in Dice Score of up to 14.47%.
Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation: This work proposes MR-PLIP, the first vision-language model for pathology-language pre-training across multiple resolutions (5×/10×/20×/40×). By leveraging Cross-Resolution Visual-Textual Alignment (CVTA) and Multi-Resolution Text-guided Visual representation Alignment (MRTVA), and being trained on 34M image-text pairs, it comprehensively outperforms existing state-of-the-art (SOTA) foundation models across 26 benchmark datasets.
Multimodal Classification of Radiation-Induced Contrast Enhancements and Tumor Recurrence Using Deep Learning: This work proposes RICE-NET, a multimodal 3D deep learning model that integrates longitudinal MRI data with radiotherapy radiation dose (RD) maps to distinguish between post-operative radiation-induced contrast enhancement (RICE) and tumor recurrence in glioblastoma, achieving an F1-score of \(0.92\) on an independent test set.
MultiMorph: On-demand Atlas Construction: This paper proposes MultiMorph, a feed-forward brain atlas construction model. By leveraging a linear-complexity GroupBlock feature-sharing layer and a Centrality Layer, it generates an unbiased group atlas in a single forward pass given an arbitrary number of 3D brain images. It operates over 100 times faster than traditional optimization methods and generalizes to unseen modalities and populations without any fine-tuning.
Multiscale Structure-Guided Latent Diffusion for Multimodal MRI Translation: This paper proposes the MSG-LDM framework, which explicitly decouples style and structure information in the latent space. It extracts modality-invariant multiscale structural priors to guide the diffusion process through the High-Frequency Injection Block (HFIB), Multimodal Structural Feature Fusion (MMSF), and Multiscale Structural Feature Enhancement (MSSE), thereby addressing anatomical inconsistency and texture degradation in MRI translation under arbitrary missing modalities.
NOIR: Neural Operator Mapping for Implicit Representations: NOIR reformulates medical image computation tasks as operator learning problems between continuous function spaces. It embeds discrete medical signals into a continuous function space via Implicit Neural Representations (INR), and then learns mappings between functions using Neural Operators (NO), achieving resolution-agnostic segmentation, shape completion, image translation, and synthesis.
Noise-Consistent Siamese-Diffusion for Medical Image Synthesis and Segmentation: A Siamese-Diffusion dual-component model (Mask-Diffusion + Image-Diffusion) is proposed, wherein the noise consistency loss allows the predicted noise from the Image-Diffusion to guide the Mask-Diffusion toward high morphological fidelity. During inference, only the Mask-Diffusion is used to maintain diversity, improving SANet's mDice by 3.6 and mIoU by 4.4 on Polyps.
Novel Architecture of RPA In Oral Cancer Lesion Detection: This paper integrates Singleton and Batch Processing design patterns into a Python-based RPA automation pipeline, combining them with the EfficientNetV2B1 model for oral cancer lesion detection, achieving a 60-100× inference speedup compared to traditional RPA platforms such as UiPath and Automation Anywhere.
Nyxus: A Next Generation Image Feature Extraction Library for the Big Data and AI Era: Nyxus is a next-generation image feature extraction library designed for the big data and AI era. It supports out-of-core scalable extraction of 2D/3D data, covering 261+ features across both radiomics and cell profiling domains, enabling speedups of 3–131× over CellProfiler and several to hundreds of times over PyRadiomics/MITK.
OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection: This paper proposes OpenMIBOOD, a comprehensive benchmark framework for OOD detection specifically designed for medical imaging. It contains 14 datasets from three medical domains (histopathology, endoscopy, and brain MRI), evaluates 24 post-hoc methods, and reveals that findings from natural image OOD benchmarks cannot be directly transferred to medical scenarios.
LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments: LoV3D proposes an end-to-end longitudinal 3D brain MRI vision-language model pipeline. Through a structured verifiable output design, the framework achieves concurrent anatomical region assessment, longitudinal comparison, and three-class diagnostic reasoning. Fueled by a clinically-weighted verifier to drive Direct Preference Optimization (DPO) training without human annotation, it achieves a 93.7% three-class classification accuracy on ADNI and zero non-adjacent diagnostic errors.
Prototype-Based Knowledge Guidance for Fine-Grained Structured Radiology Reporting: ProtoSR proposes to mine template-aligned prototype knowledge bases from large-scale free-text radiology reports and inject them into structured report prediction through a prototype-conditioned late-fusion residual module, achieving SOTA on the Rad-ReStruct benchmark, particularly gaining a 72.1% relative improvement on fine-grained attribute questions (L3).
Reanimating Images using Neural Representations of Dynamic Stimuli: BrainNRDS framework is proposed to decouple static image representations from motion generation. By leveraging fMRI brain activity to decode optical flow information and combining it with motion-conditioned diffusion models, the model generates videos from an initial frame. Additionally, video encoders (VideoMAE) are found to outperform image encoders in predicting brain activity.
Reinforcing the Weakest Links: Modernizing SIENA with Targeted Deep Learning Integration: Modular replacement of classic skull stripping and tissue segmentation steps in the SIENA pipeline with deep learning modules (SynthStrip/SynthSeg) significantly improves the clinical sensitivity and robustness of longitudinal brain volume change (PBVC) estimation while preserving pipeline interpretability. Validated on two longitudinal cohorts, ADNI and PPMI.
Residual SODAP: Residual Self-Organizing Domain-Adaptive Prompting with Structural Knowledge Preservation for Continual Learning: To address domain-incremental learning (DIL) without task IDs and data replay, this paper proposes the Residual SODAP framework. It concurrently solves representation adaptation and classifier forgetting through \(\alpha\)-entmax sparse prompt selection with residual aggregation, pseudo-replay distillation based on feature statistics, prompt usage pattern drift detection, and uncertainty weighting. It achieves state-of-the-art (SOTA) performance on diabetic retinopathy (DR), skin cancer, and CORe50 datasets.
Revisiting MAE Pre-Training for 3D Medical Image Segmentation: This work systematically addresses three major pitfalls in 3D medical imaging SSL research (small datasets, non-SOTA architectures, and insufficient evaluation). By leveraging an optimized MAE to pre-train a ResEnc U-Net CNN on 39K brain MRI scans, it outperforms the nnU-Net baseline by an average of approximately 3 Dice points across 11 downstream segmentation datasets.
SACB-Net: Spatial-Awareness Convolutions for Medical Image Registration: This paper proposes a 3D Spatial-Awareness Convolutional Block (SACB) that performs unsupervised clustering on feature maps and generates adaptive convolution kernels for different spatial regions. Combined with a pyramid flow estimator to achieve multi-scale deformation field composition, this method outperforms existing state-of-the-art (SOTA) methods on brain and abdomen CT registration tasks.
SALIENT: Frequency-Aware Paired Diffusion for Controllable Long-Tail CT Detection: Proposes SALIENT, a mask-conditioned generative framework based on wavelet-domain diffusion. Through frequency-aware, interpretable optimization objectives and paired lesion-mask volume generation, it achieves controllable and efficient synthetic data augmentation and precision recovery in long-tail CT detection. It systematically characterizes the augmentation dose-response curve for the first time.
Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation: This paper proposes an enhanced SAM framework that generates unsupervised semantic/locational/shape prompts using BiomedCLIP, VQA, and GPT-4, and introduces a DPO-inspired preference alignment loss to simulate human feedback, achieving superior performance in lung, breast tumor, and abdominal organ segmentation under a semi-supervised setup with only 10% labeled data.
SapiensID: Foundation for Human Recognition: This paper proposes SapiensID, a unified human recognition model. Through three key designs—Retina Patch (dynamic patch allocation), Masked Recognition Model (variable token length training), and Semantic Attention Head (keypoint-based pose-invariant feature pooling)—it addresses both face and full-body recognition within a single model for the first time, achieving SOTA performance on multiple ReID benchmarks.
SeaLion: Semantic Part-Aware Latent Point Diffusion Models for 3D Generation: This paper proposes SeaLion, a semantic part-aware latent point diffusion technique that jointly predicts noise and point-wise segmentation labels during the denoising process, and decodes point clouds conditioned on these segmentation labels. It generates 3D point clouds with high-quality inter-part coherence and precise segmentation labels. Additionally, a part-aware Chamfer distance (p-CD) evaluation metric is proposed, achieving substantial improvements over DiffFacto on ShapeNet and IntrA datasets.
Semantic Class Distribution Learning for Debiasing Semi-Supervised Medical Image Segmentation: Proposes the plug-and-play Semantic Class Distribution Learning (SCDL) module, which learns class-conditional proxy distributions and performs Class-conditional Distribution Bi-directional Alignment (CDBA) along with Semantic Anchor Constraint (SAC). This explicitly reshapes the class-conditional feature structure in the embedding space to alleviate supervision bias and representation imbalance in semi-supervised medical image segmentation.
SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation: This paper proposes SemiTooth, a multi-teacher-student semi-supervised framework, which achieves cross-domain generalization for multi-source CBCT tooth segmentation via a Stricter Weighted-Confidence Constraint.
Show and Segment: Universal Medical Image Segmentation via In-Context Learning: Iris is proposed as a framework that extracts task embeddings from reference image-label pairs using a lightweight task encoding module to guide target image segmentation. It adapts to new tasks without fine-tuning, achieving or exceeding the performance of task-specific models across 12 datasets, while demonstrating excellent generalization capabilities on 7 unseen datasets.
Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support: Surg-R1 proposes a hierarchical reasoning visual-language model (VLM) for surgical scenes. Through a three-level reasoning hierarchy (Perception-Relationship-Context) and a four-stage training pipeline (SFT \(\rightarrow\) GRPO \(\rightarrow\) self-iteration), trained on the largest surgical CoT dataset containing 320K reasoning pairs, it achieves a 64.9% Arena Score on SurgBench, significantly outperforming Gemini 3.0 Pro (46.1%) and GPT-5.1 (37.9%).
T-FAKE: Synthesizing Thermal Images for Facial Landmarking: Proposes the T-FAKE dataset and the RGB2Thermal loss function, which utilize semi-supervised thermal image synthesis to generate the first large-scale synthetic thermal facial landmark dataset (200k images), achieving SOTA sparse/dense facial landmark detection in the thermal domain.
Thin-Shell-SfT: Fine-Grained Monocular Non-Rigid 3D Surface Tracking with Neural Deformation Fields: Thin-Shell-SfT proposes a monocular non-rigid 3D surface tracking method based on continuous neural deformation fields and Kirchhoff-Love thin-shell physical priors, combined with surface-induced 3D Gaussian splatting for differentiable rendering, achieving unprecedented accuracy in fine-grained wrinkle reconstruction.
TopoCellGen: Generating Histopathology Cell Topology with a Diffusion Model: This paper proposes TopoCellGen, the first diffusion model for generating multi-class cell topological layouts in digital pathology. It introduces intra-class spatial consistency and inter-class structural regularization constraints via persistent homology, and proposes the Topological Fréchet Distance (TopoFD) evaluation metric.
Transformer-Based Multi-Region Segmentation and Radiomic Analysis of HR-pQCT Imaging for Osteoporosis Classification: This paper first applies SegFormer to automatic multi-region (bone + soft tissue) segmentation and radiomic analysis of HR-pQCT imaging, finding that tendon tissue characteristics outperform traditional bone metrics in osteoporosis classification.
UltrasoundAgents: Hierarchical Multi-Agent Evidence-Chain Reasoning for Breast Ultrasound Diagnosis: This paper proposes UltrasoundAgents, a hierarchical multi-agent framework. By aligning with the clinical breast ultrasound diagnostic workflow through a pipeline of a main agent locating lesions, sub-agents identifying attributes, and evidence-chain reasoning, it achieves traceable BI-RADS assessment and benign/malignant classification.
Uncertainty-Aware Concept and Motion Segmentation for Semi-Supervised Angiography Videos: This paper proposes the SMART framework, which utilizes a SAM3-based teacher-student architecture combined with text concept prompts, confidence-aware consistency regularization, and dual-stream temporal consistency to achieve semi-supervised vessel segmentation in X-ray coronary angiography videos.
UNIStainNet: Foundation-Model-Guided Virtual Staining of H&E to IHC: This paper proposes UNIStainNet, which is the first to utilize frozen dense spatial tokens from the pathology foundation model UNI as direct conditioning signals for a generator. This achieves virtual H&E-to-IHC staining, where a single unified model simultaneously supports four IHC markers and achieves state-of-the-art performance.
Unleashing Video Language Models for Fine-grained HRCT Report Generation: This work proposes the AbSteering framework, which efficiently transfers general Video Language Models (VideoLMs) to the HRCT report generation task via abnormality-centered CoT training and DPO optimization based on clinically confusing hard negatives, outperforming specialized CT foundation models.
Unmasking Biases and Reliability Concerns in Convolutional Neural Networks Analysis of Cancer Pathology Images: By training four CNNs (ResNet50, DenseNet121, InceptionV3, VGG16) on \(20 \times 20\) pixel background patches cropped from 13 cancer pathology benchmark datasets (containing no clinical diagnostic information), this work discovers that classification accuracy far exceeds random guessing (up to 93%). This systematically reveals that CNNs in cancer pathology analysis may rely on dataset collection biases (such as staining protocols and scanner differences) rather than genuine pathological features for decision-making.
Unraveling Normal Anatomy via Fluid-Driven Anomaly Randomization: UNA proposes a fluid-driven anomaly randomization method that online-generates infinitely diverse pathology patterns via advection-diffusion PDEs, achieving the first contrast-agnostic brain normal anatomy reconstruction model that can simultaneously process healthy and diseased CT/MRI scans.
vesselFM: A Foundation Model for Universal 3D Blood Vessel Segmentation: vesselFM is the first foundation model dedicated to 3D blood vessel segmentation. By integrating three heterogeneous data sources—a curated large-scale real annotated dataset, domain-randomized synthetic data, and flow matching-based generative data—it achieves state-of-the-art (SOTA) results in zero-shot, one-shot, and few-shot segmentation across four clinical imaging modalities.
VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging: This paper proposes VISTA3D, the first unified 3D medical image segmentation foundation model. It simultaneously supports automatic segmentation of 127 classes, 3D interactive editing, and zero-shot segmentation. By utilizing a 3D supervoxel technology distilled from SAM, VISTA3D achieves state-of-the-art (SOTA) zero-shot performance and matches or exceeds specialized expert models on 14 datasets.
Weakly Supervised Teacher-Student Framework with Progressive Pseudo-mask Refinement for Gland Segmentation: This paper proposes a weakly supervised teacher-student framework that leverages sparse pathologist annotations and an EMA-stabilized teacher network to generate progressively refined pseudo-masks, achieving excellent performance with mIoU of 80.10 and mDice of 89.10 on the gland segmentation task using far fewer annotations than full supervision.
WISE: A Framework for Gigapixel Whole-Slide-Image Lossless Compression: To address the failure of existing lossless compression methods caused by the "information irregularity" (widespread high-frequency signals and high volatility) of WSI images, this paper proposes the WISE three-step compression framework (hierarchical projection coding \(\rightarrow\) bitmap encoding \(\rightarrow\) dictionary coding), achieving an average of 36x and up to 136x lossless compression.