Skip to content

🏥 Medical Imaging

📷 CVPR2026 · 154 paper notes

A protocol for evaluating robustness to H&E staining variation in computational pathology models

A three-step evaluation protocol (select reference staining conditions → characterize test-set staining properties → simulate staining conditions for inference) is proposed to systematically quantify the robustness of 306 MSI classification models to H&E staining variation. The study finds a weak negative correlation between robustness and classification performance (\(r = -0.28\)), indicating that high performance does not imply high robustness.

A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement

This paper proposes a semi-supervised framework for breast ultrasound (BUS) image segmentation. It employs GPT-5-generated appearance descriptions combined with Grounding DINO and SAM for training-free pseudo-label generation (APPG), and refines labels via a dual-teacher framework (static + dynamic) using Uncertainty-Entropy Weighted Fusion (UEWF) and Adaptive Uncertainty-guided Reverse Contrastive Learning (AURCL). The method approaches fully supervised performance using only 2.5% labeled data.

A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement

Simple appearance descriptions (e.g., "dark oval") are used to drive Grounding DINO + SAM for training-free pseudo-label generation in breast ultrasound segmentation. A dual-teacher uncertainty-entropy weighted fusion mechanism and adaptive reverse contrastive learning further refine pseudo-label quality. With only 2.5% labeled data, the proposed method matches or surpasses the fully supervised upper bound.

Accelerating Stroke MRI with Diffusion Probabilistic Models through Large-Scale Pre-training and Target-Specific Fine-Tuning

Inspired by the foundation model paradigm, this work proposes a data-efficient training strategy for diffusion probabilistic models (DPMs) in accelerated MRI reconstruction. A DPM is first pre-trained on large-scale multi-contrast brain MRI data (~4,000 subjects), then fine-tuned with as few as 20 target-domain subjects. The resulting model achieves reconstruction quality comparable to large-dataset training in clinical stroke MRI, with a clinical blind reader study confirming non-inferiority to standard-of-care at 2× acceleration.

Accelerating Stroke MRI with Diffusion Probabilistic Models through Large-Scale Pre-training and Target-Specific Fine-Tuning

Drawing inspiration from the "pre-train then fine-tune" paradigm of foundation models, this work pre-trains a diffusion probabilistic model (DPM) at scale on ~4,000 fastMRI subjects spanning multiple contrasts, then fine-tunes on as few as 20 target-domain subjects using a low learning rate. The resulting model generalizes across contrasts and acquisition protocols for accelerated MRI reconstruction. In a clinical stroke validation, 2× accelerated images are rated non-inferior to fully-sampled images by blinded neuroradiologists.

Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning

This paper proposes HistoSelect, a framework that emulates the coarse-to-fine reasoning process of pathologists through a three-stage filtering mechanism — tissue segmentation → Group Sampler → Patch Selector — grounded in Information Bottleneck (IB) theory. By compressing task-irrelevant visual tokens, the method achieves state-of-the-art performance across three datasets while reducing computational cost by approximately 70%.

Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning

This paper proposes the UAAI framework, which for the first time introduces Active Inference into micro-gesture recognition. By combining EFE-guided temporal frame selection, spatial attention, and UMIX uncertainty-aware augmentation, UAAI achieves 63.47% on the SMG dataset (RGB modality), substantially outperforming conventional RGB-based methods.

Adaptation of Weakly Supervised Localization in Histopathology by Debiasing Predictions

This paper proposes SFDA-DeP, a source-free domain adaptation (SFDA) framework inspired by machine unlearning that models adaptation as an iterative process of identifying and correcting prediction bias. The method selectively reduces confidence on uncertain samples from the dominant class, retains reliable predictions, and jointly trains a pixel-level classifier to recover localization discriminability. It consistently outperforms SFDA baselines in both classification and localization across cross-organ and cross-center histopathology benchmarks.

Adaptation of Weakly Supervised Localization in Histopathology by Debiasing Predictions

This paper proposes SFDA-DeP, a method inspired by machine unlearning that reformulates SFDA as an iterative process of "identifying and correcting prediction bias." It applies a forgetting operation to high-entropy uncertain samples from the dominant class to force the model to abandon biased predictions, maintains self-training on reliable samples, and anchors localization capacity via a pixel-level classifier. The method consistently outperforms existing SFDA approaches on cross-organ and cross-center histopathology benchmarks.

Adaptive Confidence Regularization for Multimodal Failure Detection

This paper proposes the ACR framework, which addresses multimodal misclassification detection for the first time in a systematic manner through two complementary modules: an Adaptive Confidence Loss (ACL) that penalizes "confidence degradation" where multimodal fusion confidence falls below that of individual unimodal branches, and Multimodal Feature Swapping (MFS) that synthesizes failure-aware outlier samples in the feature space. ACR consistently outperforms existing methods across four datasets.

Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding

Under extreme annotation scarcity—only 206 labeled cases (144 for training)—this work combines patch-based MIM pretraining of a 3D U-Net, a VDETR detector with 3D vertex RPE, and Mean Teacher semi-supervised consistency regularization over 2,000 unlabeled volumes. The approach improves 3D abdominal trauma detection mAP@0.50 from 26.36% to 56.57% on the validation set (+115%), while a frozen encoder with a lightweight classification head achieves 94.07% accuracy on 7-class injury classification.

Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding

This paper proposes a two-stage label-efficient framework: a patch-based MIM self-supervised pretraining of a 3D U-Net encoder on 1,206 unlabeled CT volumes, followed by VDETR with 3D vertex relative position encoding for 3D lesion detection, augmented by Mean Teacher semi-supervised consistency regularization over 2,000 additional unlabeled volumes. Using only 144 annotated samples, the framework achieves 56.57% val mAP@0.50, a 115% improvement over fully supervised training.

From Adaptation to Generalization: Adaptive Visual Prompting for Medical Image Segmentation

This paper proposes APEX (Adaptive Prompt EXtraction), which adaptively retrieves input-specific visual prompts from a learnable prompt memory bank—rather than assigning a fixed prompt per domain—and incorporates low-frequency contrastive learning (LFC) to enhance inter-domain discriminability, achieving significant improvements in medical image segmentation on both seen and unseen domains.

Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation? A Cross-Dataset Empirical Study

By evaluating 11 models on three heterogeneous medical datasets under a unified training protocol, this study demonstrates that general-purpose vision models (GP-VMs) systematically outperform most specialized medical segmentation architectures (SMAs) under standardized conditions, challenging the prevailing assumption that medical image segmentation inherently requires domain-specific architectures.

Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation? A Cross-Dataset Empirical Study

Under a unified training and evaluation protocol, this study compares 11 models — 5 specialized medical segmentation architectures (SMAs) and 6 general-purpose vision models (GP-VMs) — across 3 heterogeneous medical datasets. GP-VMs systematically outperform most SMAs on all datasets (average mDSC: VW-MiT 91.0% vs. best SMA SU-Mamba 90.5%), and Grad-CAM analysis demonstrates that GP-VMs capture clinically relevant structures.

Association of Radiologic PPFE Change with Mortality in Lung Cancer Screening Cohorts

Across two independent large-scale lung cancer screening cohorts, deep learning-based automatic segmentation is employed to quantify longitudinal PPFE changes, providing the first validation of the independent prognostic value of PPFE progression in a screening population.

Association of Radiologic PPFE Change with Mortality in Lung Cancer Screening Cohorts

Across two large-scale lung cancer screening cohorts (NLST n=7,980; SUMMIT n=8,561), this study employs deep learning to automatically segment PPFE volumes and defines "progressive PPFE" based on annualized volume change. Cox proportional hazards models demonstrate that PPFE progression is an independent predictor of all-cause mortality (NLST HR=1.25; SUMMIT HR=3.14), and is significantly associated with respiratory hospitalization rates, antibiotic/corticosteroid usage, and other clinical endpoints.

Automated Detection of Malignant Lesions in the Ovary Using Deep Learning Models and XAI

This work systematically compares 15 CNN variants (LeNet/ResNet/VGG/Inception) on five-class classification of ovarian cancer histopathology images. InceptionV3-A (ReLU) is selected as the final model, achieving 94% across comprehensive metrics, with comparative explainability analysis conducted using three XAI methods: LIME, SHAP, and Integrated Gradients.

Automated Detection of Malignant Lesions in the Ovary Using Deep Learning Models and XAI

This paper systematically compares 15 variants across four major CNN families — LeNet, ResNet, VGG, and Inception — for ovarian cancer histopathology image classification. InceptionV3-ReLU is selected as the final model (average metrics ~94%), and three XAI methods — LIME, SHAP, and Integrated Gradients — are applied to provide interpretability for the classification results.

Benchmarking Endoscopic Surgical Image Restoration and Beyond

This work constructs SurgClean, the first multi-source real-world endoscopic surgical image restoration dataset (3,113 images covering three degradation types: smoke, fog, and liquid splash), and systematically benchmarks 22 representative image restoration methods (12 general-purpose + 10 task-specific) on it. The results reveal a significant gap between existing methods and clinical requirements, and further analyze the fundamental differences between surgical-scene and natural-scene degradations.

Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance

This paper presents the first systematic study of aggregation strategies for converting pixel-level uncertainty maps to image-level scores in segmentation tasks. It proposes the Spatial Mass Ratio (SMR)—incorporating spatial structural information via Moran's I, Edge Density, and Shannon Entropy—alongside a GMM meta-aggregator. Experiments across 10 datasets on OoD and failure detection tasks demonstrate that spatially-aware aggregation significantly outperforms global averaging.

Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control

UniPath proposes a semantics-driven pathology image generation framework that achieves diagnostic-level controllable generation through multi-stream control (raw text + diagnostic semantic tokens distilled from a frozen pathology MLLM + prototype bank morphology control), attaining a Patho-FID of 80.9 and outperforming the second-best method by 51%.

BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation

This paper proposes BiCLIP, a framework that employs Bidirectional Multimodal Fusion (BMF) to refine text representations using visual information, and Image Augmentation Consistency (IAC) to enforce perturbation-invariant intermediate features. BiCLIP surpasses state-of-the-art methods on COVID-19 CT segmentation while remaining robust with as little as 1% labeled data.

BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation

This paper proposes BiCLIP, a framework that introduces a Bidirectional Multimodal Fusion (BMF) module enabling text and visual features to mutually refine each other in a closed loop, and an Image Augmentation Consistency (IAC) module that enforces consistency of intermediate features under weak/strong perturbations. BiCLIP achieves robust medical image segmentation under extremely label-scarce (1% annotations only) and image-degraded (low-dose CT noise/motion blur) clinical conditions.

Bidirectional Multimodal Prompt Learning with Scale-Aware Training for Few-Shot Multi-Class Anomaly Detection

This paper proposes AnoPLe — a lightweight multimodal bidirectional prompt learning framework that requires neither manually crafted anomaly descriptions nor external auxiliary modules. Through text–visual prompt bidirectional interaction and scale-aware prefixes, AnoPLe achieves few-shot multi-class anomaly detection, delivering strong competitive results on MVTec-AD/VisA/Real-IAD while maintaining efficient inference (~28 FPS).

Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD

This work constructs a large-scale CBCT-report paired dataset of 7,408 cases covering 55 oral diseases, and develops CBCTRepD, a bilingual oral-maxillofacial CBCT report generation system. Through a collaborative paradigm of AI-generated drafts followed by radiologist editing, the system is shown via multi-level clinical evaluation to elevate junior radiologists to an intermediate level, intermediate radiologists to near-senior level, and reduce omissions for senior radiologists.

Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD

This paper proposes CBCTRepD, a bilingual report generation system for oral and maxillofacial CBCT, trained on a high-quality paired dataset of 7,408 cases. A multi-level evaluation framework is introduced to validate its tiered empowerment effect on novice, intermediate, and senior radiologists within a radiologist–AI collaborative workflow.

CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis

This paper proposes CARE, a slide-level pathology foundation model that employs an Adaptive Region Generator (ARG) to partition WSIs into morphologically coherent irregular regions (analogous to word-level tokens in NLP), combined with two-stage pretraining via cross-modal alignment with RNA/protein expression profiles. Using approximately 1/10 the data of mainstream models, CARE achieves state-of-the-art average performance across 33 downstream tasks.

Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images

This paper proposes CPNN, which constructs cell-type prototypes from publicly available single-cell RNA-seq data and models slide/patch-level gene expression as a weighted combination of these prototypes, achieving state-of-the-art performance on gene expression estimation while providing interpretability.

CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection

This paper revisits CLIP domain adaptation from a data-centric perspective and proposes CHIPS, which computes a utility score for each image-text pair by combining three factors: curvature-aware Newton alignment (fidelity), JL sketching-compressed curvature estimation (scalability), and learnability–domain-relevance weighting (retention). Using only 30% of data, CHIPS matches full-dataset CPT; using 10%, it surpasses 50%-data CPT. The method achieves state-of-the-art data selection performance across 17 medical and 31 general benchmarks.

CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection

This paper proposes CHIPS, a curvature-aware hybrid influence-based data selection method that computes Newton-style alignment scores in the CLIP endpoint subspace and combines them with learnability and domain-relevance weights. Using only 30% of the data, CHIPS matches full-dataset continual pre-training (CPT) performance and achieves state-of-the-art results across 17 medical benchmarks.

CLoE: Expert Consistency Learning for Missing Modality Segmentation

This work reformulates the robustness problem under missing modalities as decision-level expert consistency control. It proposes a dual-branch consistency learning scheme (global MEC + regional REC) coupled with a lightweight gating network that converts consistency scores into modality reliability weights, achieving an average WT Dice of 88.09% across 15 missing-modality combinations on BraTS 2020, surpassing all prior state-of-the-art methods.

CLoE: Expert Consistency Learning for Missing Modality Segmentation

This paper proposes CLoE (Consistency Learning of Experts), which reformulates missing-modality robustness as a decision-level expert consistency control problem. It reduces expert drift via two complementary consistency branches—Modality Expert Consistency (MEC) and Region Expert Consistency (REC)—and achieves reliability-weighted fusion through a consistency-score-driven gating network.

CRFT: Consistent-Recurrent Feature Flow Transformer for Cross-Modal Image Registration

CRFT is a unified coarse-to-fine cross-modal image registration framework that learns modality-agnostic feature flow representations within a Transformer architecture. It employs 1/8-resolution global correspondence at the coarse stage and multi-scale local refinement at 1/2–1/4 resolution at the fine stage, coupled with iterative discrepancy-guided attention and Spatial Geometric Transform (SGT) to recursively refine flow fields and capture subtle spatial inconsistencies. CRFT outperforms SOTA methods including RAFT, GMFlow, and LoFTR across diverse cross-modal datasets covering optical, infrared, SAR, and multispectral imagery.

Cross-Slice Knowledge Transfer via Masked Multi-Modal Heterogeneous Graph Contrastive Learning for Spatial Gene Expression Inference

This paper proposes SpaHGC, a multimodal heterogeneous graph framework that constructs three types of subgraphs—intra-target-slice (TS), cross-slice (CS), and intra-reference-slice (RS)—and integrates masked graph contrastive learning with a cross-node dual attention mechanism to predict spatial gene expression from H&E histopathology images, achieving PCC improvements of 7.3%–27.1% across seven datasets.

cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold

This paper proposes cryoSENSE, the first computational framework for compressed cryo-EM imaging, demonstrating that protein cryo-EM images can be faithfully reconstructed from undersampled measurements under both sparse priors (DCT/Wavelet/TV) and generative priors (diffusion models), achieving up to 2.5× throughput gain while preserving 3D reconstruction resolution.

CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation

This paper proposes CURE — an error-aware curriculum learning framework for multi-task training that dynamically adjusts sampling distributions to emphasize hard samples, improving visual grounding accuracy by +0.37 IoU and reducing hallucination rate by 18.6% without introducing additional data.

Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation

This paper proposes Deco-Mamba, a decoder-centric Transformer-CNN-Mamba hybrid architecture that enhances the decoding process via Co-Attention Gates, Vision State Space Modules (VSSMs), and deformable convolutions, while introducing a distribution-aware deep supervision strategy based on windowed KL divergence. The method achieves state-of-the-art performance across 7 medical image segmentation benchmarks.

Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation

This paper proposes Deco-Mamba, a decoder-centric segmentation network that employs a Co-Attention Gate (CAG) for bidirectional encoder–decoder feature fusion, a Visual State Space Module (VSSM) for long-range dependency modeling, and deformable convolutions for detail recovery. A windowed distribution-aware KL-divergence deep supervision scheme is further introduced. The method achieves state-of-the-art performance on 7 medical segmentation benchmarks at moderate computational cost.

Decoupling Vision and Language: Codebook Anchored Visual Adaptation

CRAFT is proposed to decouple the visual encoder from the language model via a discrete codebook, enabling domain adaptation by fine-tuning only the visual encoder. The adapted encoder can be seamlessly reused across different LLM architectures, achieving an average improvement of 13.51% across 10 domain benchmarks.

Deep Learning-based Assessment of the Relation Between the Third Molar and Mandibular Canal on Panoramic Radiographs using Local, Centralized, and Federated Learning

This paper compares three learning paradigms — local learning (LL), federated learning (FL), and centralized learning (CL) — for binary classification of third molar–mandibular canal overlap on panoramic radiographs. Centralized learning achieves the best performance (AUC 0.831), federated learning serves as a competitive privacy-preserving alternative (AUC 0.757), and both substantially outperform local learning (mean AUC 0.672).

Deep Learning–Based Estimation of Blood Glucose Levels from Multidirectional Scleral Blood Vessel Imaging

This paper proposes ScleraGluNet, a multi-view deep learning framework that combines five-directional scleral vessel imaging with multi-branch CNN feature extraction, MRFO-based feature refinement, and Transformer-based cross-view fusion, achieving 93.8% accuracy on three-class metabolic state classification and an MAE of 6.42 mg/dL for continuous fasting plasma glucose (FPG) estimation.

Deep Learning–Based Estimation of Blood Glucose Levels from Multidirectional Scleral Blood Vessel Imaging

This paper proposes ScleraGluNet, which captures scleral blood vessel photographs from five gaze directions, extracts direction-specific vascular features via parallel CNNs, refines them through MRFO feature selection, and fuses them across views using a Transformer. The model simultaneously performs three-class metabolic state classification (93.8% accuracy) and continuous fasting plasma glucose (FPG) estimation (MAE = 6.42 mg/dL, r = 0.983).

Deep Learning-based Assessment of the Relation Between the Third Molar and Mandibular Canal on Panoramic Radiographs using Local, Centralized, and Federated Learning

This work systematically compares three training paradigms—local learning (LL), federated learning (FL), and centralized learning (CL)—on cropped panoramic dental radiographs partitioned by 8 independent annotators, targeting a binary classification task of third molar–mandibular canal overlap. The study establishes a consistent performance ranking of CL > FL > LL (AUC: 0.831, 0.757, and 0.672, respectively), demonstrating that FL substantially outperforms site-independent training while preserving data privacy.

Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography

This work constructs PETWB-Seg11K, the largest whole-body PET segmentation dataset to date (11,041 3D PET scans and 59,831 segmentation masks), and proposes SegAnyPET, a foundation model enabling prompt-driven universal volumetric segmentation of organs and lesions in PET imaging. The model demonstrates strong performance in zero-shot cross-center and cross-tracer settings.

Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography

This work constructs PETWB-Seg11K, the largest whole-body PET segmentation dataset to date (11,041 3D PET scans + 59,831 masks), and proposes SegAnyPET — the first 3D promptable segmentation foundation model tailored for functional PET imaging — achieving strong zero-shot generalization across multi-center, multi-tracer, and multi-disease scenarios.

Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification

This paper proposes an NNMF+CNN+diffusion defense framework for brain tumor MRI classification. MRI images are first decomposed into compact, interpretable low-rank features via NNMF; the most discriminative components are selected using AUC, Cohen's d, and p-value statistical criteria; a lightweight CNN then performs classification. At inference time, a feature-space purification module combining forward diffusion noise injection and a learned denoiser is introduced. Under AutoAttack (\(L_\infty\), \(\epsilon=0.10\)), robust accuracy improves from 0.47% to 59.53%.

Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification

A four-stage pipeline is proposed consisting of NNMF feature extraction → statistical feature selection → lightweight CNN classification → feature-space diffusion purification. The method maintains 85.1% clean accuracy while substantially improving robust accuracy under AutoAttack (\(L_\infty\), \(\epsilon=0.10\)) from a baseline of 0.47% to 59.5%.

EchoAgent: Towards Reliable Echocardiography Interpretation with "Eyes", "Hands" and "Minds"

This paper proposes EchoAgent, an agent system that simulates the "eyes–hands–minds" collaborative workflow of echocardiography clinicians. Through three stages—an Expertise-Driven Cognition engine (mind), a Hierarchical Collaboration Toolkit (eyes + hands), and an Orchestrated Reasoning Hub—the system achieves end-to-end reliable echocardiography interpretation, attaining state-of-the-art performance on multiple benchmarks.

Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models

This paper proposes EDA, a framework that extends the EDM design space from isotropic Gaussian noise to arbitrary noise patterns. By driving SDEs with multivariate Gaussian distributions and multiple independent Wiener processes, EDA enables flexible noise diffusion while provably introducing no additional sampling overhead. With only 5 sampling steps, EDA achieves performance on par with or superior to 100-step Refusion and task-specific methods across three tasks: MRI bias field correction, CT metal artifact removal, and natural image shadow removal.

EI: Early Intervention for Multimodal Imaging based Disease Recognition

EI proposes injecting cross-modal semantic guidance (the [INT] token) before unimodal embedding (UIE), emulating the clinical workflow in which a clinician first examines one modality to form a preliminary judgment and then uses that judgment to guide interpretation of another modality. Simultaneously, EI introduces MoR (multi-rank LoRA with a relaxed bypass router) for parameter-efficient VFM adaptation to the medical domain. With fewer than 9M trainable parameters, EI surpasses all full fine-tuning and prompt-learning baselines on three datasets covering retinal, dermatological, and knee-joint imaging.

Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models (EDA)

This paper proposes the EDA framework, which extends the EDM design space from Gaussian noise to arbitrary noise patterns by parameterizing a covariance matrix via a multivariate Gaussian distribution. EDA enables flexible noise diffusion and achieves performance at or above 100-step EDM methods and task-specific approaches using only 5 sampling steps across three tasks: MRI bias field correction, CT metal artifact reduction, and natural image shadow removal.

EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease

This paper proposes EMAD, an end-to-end multimodal vision-language framework for AD diagnosis that generates structured reports. Through hierarchical Sentence–Evidence–Anatomy (SEA) Grounding, each diagnostic statement is explicitly linked to clinical evidence and 3D brain anatomy. Executable rule-driven GRPO reinforcement fine-tuning is applied to ensure clinical consistency.

EquivAnIA: A Spectral Method for Rotation-Equivariant Anisotropic Image Analysis

This paper proposes EquivAnIA, which employs a family of oriented filters (cake wavelets and ridge filters) to estimate the angular distribution of an image via weighted averaging in the frequency domain, replacing conventional angular binning. The method achieves truly numerically rotation-robust anisotropic analysis, with a dominant orientation estimation error of only 0.03° on synthetic images and a CT registration error of only 0.02°.

EquivAnIA: A Spectral Method for Rotation-Equivariant Anisotropic Image Analysis

This paper proposes EquivAnIA, a spectral method that computes angular energy distributions via Cake wavelets and Ridge filters in the Fourier domain, achieving strictly numerically rotation-equivariant anisotropic image analysis. The method substantially outperforms conventional angular PSD binning approaches on both synthetic and real images.

Event-Level Detection of Surgical Instrument Handovers in Videos

This paper proposes a spatiotemporal visual framework for detecting instrument handovers in real surgical videos. It combines ViT-based spatial feature extraction with unidirectional LSTM temporal modeling, and employs multi-task learning to jointly predict handover events and their directions, achieving an event-level detection F1 of 0.84 on kidney transplant surgical videos.

Every Error has Its Magnitude: Asymmetric Mistake Severity Training for Multiclass Multiple Instance Learning

This paper proposes PAMS (Priority-Aware Mistake Severity), a framework that significantly reduces the risk of severe misdiagnosis in multiclass MIL-based WSI diagnosis through an asymmetric severity-aware cross-entropy loss (MSCE), semantic feature remix (SFR), and an asymmetric Mikel's Wheel evaluation metric.

Extending ZACH-ViT to Robust Medical Imaging: Corruption and Adversarial Stress Testing in Low-Data Regimes

This work presents the first robustness extension evaluation of ZACH-ViT, a compact permutation-invariant ViT architecture, in low-data medical imaging settings. Across 7 MedMNIST datasets, ZACH-ViT ranks first under both clean and common corruption conditions (Mean Rank 1.57), ranks first under FGSM (2.00), and second under PGD (2.29).

Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning

An attention-based MIL model is built upon a ConvNeXt-Base backbone, employing a gradient reversal layer (GRL) to adversarially eliminate gender information from scan representations. Combined with focal loss (\(\gamma=2\)) + label smoothing (\(\varepsilon=0.1\)), subgroup oversampling, and 5-fold ensemble, the proposed method achieves a mean competition score of 0.685±0.030 on a four-class lung disease diagnosis task over 889 chest CT scans. The female macro-F1 (0.691) slightly exceeds the male macro-F1 (0.679), validating that GRL effectively closes the fairness gap.

Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning

A fairness-aware framework based on attention MIL and gradient reversal layers (GRL) is proposed for multi-class lung disease diagnosis from chest CT volumes, eliminating gender bias while preserving diagnostic accuracy.

Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation

This paper proposes FedMEPD, a framework that employs modality-specific encoders to address intermodal heterogeneity, a filter-level dynamic partial personalization decoder to balance knowledge sharing and personalization, and a multi-anchor cross-attention calibration module to compensate for missing modality information. FedMEPD comprehensively outperforms existing multimodal federated learning methods on BraTS 2018/2020.

Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation

This paper proposes FedMEPD, a framework that addresses two major challenges in federated multimodal brain tumor segmentation — inter-modality heterogeneity and client personalization — through modality-specific encoders (fully federated), a partially personalized multimodal fusion decoder, and a multi-anchor cross-attention calibration module. FedMEPD surpasses existing federated methods on BraTS 2018/2020.

FedVG: Gradient-Guided Aggregation for Enhanced Federated Learning

FedVG proposes to score each client using layer-wise gradient norms computed on a global validation set, assigning higher aggregation weights to clients whose gradients are flatter (i.e., smaller in norm), thereby substantially improving generalization performance of federated learning under high data heterogeneity.

Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis

FPRL is proposed as a clinically-inspired hierarchical self-supervised framework that mitigates motion bias by first "focusing" on intra-frame lesion-centric static semantics and then "perceiving" inter-frame contextual evolution, achieving state-of-the-art performance across 11 endoscopic datasets.

Forecasting Epileptic Seizures from Contactless Camera via Cross-Species Transfer Learning

This paper is the first to systematically define the task of video-based epileptic seizure forecasting (predicting whether a seizure will occur within the next 5 seconds using 3–10-second pre-ictal clips), and proposes a two-stage cross-species transfer learning framework — self-supervised pre-training of VideoMAE on a mixed dataset of rodent and human videos, followed by few-shot fine-tuning on a very limited set of human epilepsy videos. Under 2/3/4-shot settings, the framework achieves an average balanced accuracy (bacc) of 72.30% and ROC-AUC of 75.58%, outperforming all video understanding baselines.

Forecasting Epileptic Seizures from Contactless Camera via Cross-Species Transfer Learning

This work introduces the first purely vision-based epileptic seizure forecasting task, leveraging large-scale rodent epilepsy videos for cross-species self-supervised pre-training via the VideoMAE framework, achieving >70% forecasting accuracy within a 3–10 second prediction window.

Continual Learning for fMRI-Based Brain Disorder Diagnosis via Functional Connectivity Matrices Generative Replay

This paper proposes FORGE, the first continual learning framework specifically designed for cross-site fMRI-based brain disorder diagnosis. FORGE generates realistic functional connectivity matrices via a structure-aware VAE for privacy-preserving generative replay, and combines dual-level knowledge distillation with a hierarchical contextual bandit sampling strategy to effectively mitigate catastrophic forgetting.

GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction

This paper proposes GaussianPile, which extends 3D Gaussian Splatting from surface appearance modeling to slice-based volumetric reconstruction by introducing a focus-aware physical imaging model (Focus Gaussian). On ultrasound and light-sheet microscopy data, the method achieves high-quality volumetric compression and reconstruction that is 11× faster than NeRF-based methods and reduces storage by 16× compared to voxel grids.

GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis

This paper proposes the GIIM framework, which constructs a Multi-Heterogeneous Graph (MHG) to simultaneously model intra-view and inter-view dependencies among lesions in multi-view medical images, and achieves robust diagnosis on incomplete data through four missing-view representation strategies.

GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis

This paper proposes the GIIM framework, which constructs a Multi-Heterogeneous Graph (MHG) with four types of edge relations to simultaneously model the dynamic changes of individual lesions across imaging phases and the spatial associations among different lesions. Four missing-view imputation strategies are designed. GIIM achieves significant improvements over existing methods on three modalities: liver CT, breast mammography, and breast MRI.

GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification

This paper presents GLEAM, the first publicly available trimodal glaucoma dataset (SLO fundus photography + peripapillary OCT + visual field deviation maps, 1,200 cases, four-stage annotation), along with HAMM, a CNN-based hierarchical attention masked modeling framework. HAMM achieves cross-modal fusion via clinically inspired multi-head modality gating and relational graph attention, attaining a four-class classification accuracy of 81.08%.

GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification

This paper introduces GLEAM (Glaucoma Lesion Evaluation and Analysis with Multimodal imaging), the first publicly available three-modality glaucoma dataset comprising SLO fundus images, circumpapillary OCT, and visual field pattern deviation maps, along with HAMM (Hierarchical Attentive Masked Modeling), a framework that concentrates cross-modal representation learning at the encoder side via a hierarchical attentive encoder and a lightweight decoder, enabling accurate four-stage glaucoma classification.

Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization

This paper proposes GenEval, which quantifies causal coverage gaps via a Domain Conformal Bound (DCB), distills human expert knowledge, and integrates it with a medical VLM (MedGemma-4B) through LoRA fine-tuning for single source domain generalization (SDG), achieving substantial gains over baselines on DR grading and seizure onset zone (SOZ) detection.

Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization

This paper proposes the Domain Conformal Bound (DCB) theoretical framework to quantify causal factor discrepancies across domains and derives an optimizable consistency metric SDCD. Expert knowledge is refined accordingly and injected into MedGemma-4B via LoRA, achieving substantial improvements over single source domain generalization SOTA on 8 DR and 2 SOZ datasets.

Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset

This paper introduces instruction-guided lesion segmentation (ILS) for chest X-rays, constructs the first large-scale automatically generated instruction-answer dataset MIMIC-ILS (1.1M samples, 192K images, 91K masks), and trains the ROSALIA model to achieve gIoU of 71.2% and null-target accuracy of 91.8%, substantially outperforming existing general-purpose and medical segmentation models.

Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment

This paper identifies and addresses the degradation of local feature alignment in CLIP under cross-domain few-shot learning (CDFSL), and proposes CC-CDFSL, a cycle-consistency-based framework. Through bidirectional T-I-T and I-T-I cyclic paths and a semantic anchor mechanism, CC-CDFSL improves patch-level vision-language alignment while enhancing model interpretability.

InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models

This paper proposes InvAD, which shifts diffusion-based anomaly detection from a "denoising-reconstruction in RGB space" paradigm to a "noising-inversion in latent space" paradigm. By applying DDIM inversion to directly infer the terminal latent variable and measuring deviation under the prior distribution, anomalies are detected without reconstruction. Only 3 inversion steps suffice to achieve state-of-the-art performance, with approximately 2× inference speedup.

InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models

This paper proposes a "detection via noising" paradigm to replace the conventional "detection via denoising" approach. By mapping images to the latent noise space via DDIM inversion, the method measures deviation from the prior distribution as an anomaly score using only 3 inference steps—without any reconstruction—achieving state-of-the-art accuracy at 88 FPS (more than 2× faster than OmiAD).

Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision

This paper proposes MASS (MAsk-guided Self-Supervised learning), which leverages category-agnostic masks automatically generated by SAM2 as pseudo-annotations and adopts in-context segmentation as a pretext task for self-supervised pretraining. Without any manual annotation, MASS learns semantically rich and highly generalizable 3D medical image representations, achieving strong performance on both few-shot segmentation and frozen-encoder classification.

LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings

This paper presents LEMON, a large-scale endoscopic dataset comprising 4,194 surgical videos (938 hours), and proposes LemonFM, a self-supervised foundation model based on augmented knowledge distillation. LemonFM achieves state-of-the-art performance across four downstream surgical tasks: phase recognition, tool detection, action recognition, and semantic segmentation.

LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings

This paper introduces LEMON, the largest open surgical video dataset to date (4,194 videos, 938 hours, 35 procedure types), and proposes LemonFM, a foundation model based on augmented knowledge distillation, which comprehensively outperforms existing methods across four downstream tasks: surgical phase recognition, tool detection, action recognition, and semantic segmentation.

LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol

This paper introduces LUMINA, a multi-vendor full-field digital mammography (FFDM) dataset comprising 468 patients and 1,824 images, accompanied by a foreground-pixel histogram matching protocol for energy harmonization. The benchmark systematically evaluates CNN and Transformer models across three clinical tasks: diagnosis, BI-RADS classification, and breast density prediction.

Marker-Based 3D Reconstruction of Aggregates with a Comparative Analysis of 2D and 3D Morphologies

This paper proposes a low-cost marker-based photogrammetry approach for high-quality 3D reconstruction of aggregate particles. Through a systematic comparative analysis of 2D and 3D morphological indices, it reveals significant deviations introduced by 2D projection analysis relative to true 3D morphology.

Marker-Based 3D Reconstruction of Aggregates with a Comparative Analysis of 2D and 3D Morphologies

This paper proposes a low-cost marker-based photogrammetric pipeline for high-quality 3D reconstruction of aggregate particles. Through a systematic comparative analysis of 2D and 3D morphological indices, it reveals the significant limitations of relying solely on 2D images for aggregate morphology assessment.

MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation

Built upon frozen CLIP encoders, MedCLIPSeg introduces a probabilistic cross-modal attention adapter (PVL) that enables bidirectional vision-language interaction and explicit prediction uncertainty modeling, complemented by a soft patch-level contrastive loss. The method achieves strong data efficiency, domain generalization, and interpretability across 16 medical segmentation datasets.

MedGEN-Bench: Contextually Entangled Benchmark for Open-Ended Multimodal Medical Generation

This paper introduces MedGEN-Bench, the first comprehensive benchmark for open-ended multimodal medical generation, comprising 6,422 expert-verified image-text pairs spanning 6 imaging modalities and 16 clinical tasks, accompanied by a three-tier evaluation framework. The benchmark reveals that compositional pipelines outperform unified models in cross-modal consistency.

MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

MedGRPO introduces two key innovations to address training collapse in multi-dataset reinforcement learning for medical video understanding: cross-dataset reward normalization (mapping median performance across datasets of varying difficulty to a uniform reward value via a logistic function) and a medical LLM judge (comparative scoring across five clinical dimensions). Built on Qwen2.5-VL-7B and trained on MedVidBench (532K video instruction pairs), the method surpasses GPT-4.1 and Gemini-2.5-Flash.

MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration

This paper proposes MedKCO, a knowledge-driven cognitive orchestration strategy for medical vision-language pretraining. It introduces a hierarchical curriculum (label-level ordering by diagnostic sensitivity + description-level ordering by sample representativeness) and a self-paced asymmetric contrastive loss, enabling the model to progressively learn from simple to complex concepts. MedKCO substantially outperforms baselines on zero-shot and downstream tasks across three medical imaging modalities.

MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification

This paper proposes MIL-PF, a framework that leverages frozen foundation vision encoders (DINOv2/MedSigLIP) to precompute features, followed by a lightweight MIL head of approximately 40K parameters for mammography classification. The method achieves state-of-the-art performance on the large-scale EMBED dataset while substantially reducing training cost.

MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification

Combining frozen general-purpose foundation encoders (DINOv2 ViT-Giant / MedSigLIP) with a lightweight MIL aggregation head of only ~40k parameters, MIL-PF achieves state-of-the-art performance on large-scale mammography classification benchmarks such as EMBED (AUC 0.916, Spec@Sens=0.9 of 0.762) via a dual-stream aggregation strategy (global mean pooling + local Perceiver cross-attention), training in 5–7 minutes with 35–458× fewer trainable parameters than baselines.

Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning

This paper reveals that enhancing visual discriminability during VLM fine-tuning for cross-domain few-shot learning paradoxically degrades cross-modal alignment — a phenomenon termed the "discriminability trap." Two plug-and-play modules, SVL and RA, are proposed to suppress visual learning shortcuts and guide cross-modal alignment, achieving state-of-the-art performance on 4 CDFSL benchmarks and 11 FSL datasets.

Mitigating Object Hallucination in LVLMs via Attention Imbalance Rectification

This paper introduces the concept of Attention Imbalance to explain object hallucination in LVLMs, and proposes a lightweight decoding-time intervention method, AIR, which rectifies attention imbalance via cross-modal attention reallocation and variance-constrained projection regularization. AIR reduces hallucination rates by up to 35.1% and improves general capability by up to 15.9% across four LVLMs.

MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection

MoECLIP introduces Mixture-of-Experts into zero-shot anomaly detection (ZSAD), achieving patch-level dynamic expert routing and specialization via Frozen Orthogonal Feature Separation (FOFS) and an Equiangular Tight Frame (ETF) loss, attaining state-of-the-art performance across 14 industrial and medical benchmarks.

Momentum Memory for Knowledge Distillation in Computational Pathology

This paper proposes MoMKD, which replaces conventional batch-local feature alignment with a momentum-updated class-conditional memory bank to enable cross-modal knowledge distillation from genomics to pathology whole-slide images (WSIs), achieving genomics-level predictive capability at inference using only H&E slides.

MozzaVID: Mozzarella Volumetric Image Dataset

This paper introduces MozzaVID — a mozzarella cheese microstructure volumetric image classification dataset based on synchrotron X-ray CT — comprising 591–37,824 samples of size \(192^3\), with classification targets spanning 25 cheese types and 149 individual cheese specimens. The dataset bridges the large gap in scale and task design between 3D volumetric and 2D datasets, and experiments demonstrate that 3D models significantly outperform their 2D counterparts.

MRI Contrast Enhancement Kinetics World Model

This paper presents the first MRI Contrast Enhancement Kinetics World Model (MRI CEKWorld), which leverages spatiotemporal consistency learning (STCL) on sparsely sampled data to generate continuous, high-fidelity contrast-enhanced sequences from non-contrast MRI, addressing the dual challenges of content distortion and temporal discontinuity.

Multimodal Classification of Radiation-Induced Contrast Enhancements and Tumor Recurrence Using Deep Learning

This paper proposes RICE-NET, a multimodal 3D ResNet-18 model that integrates longitudinal MRI data with radiotherapy dose distribution maps to automatically distinguish radiation-induced contrast enhancements (RICE) from tumor recurrence following glioblastoma surgery, achieving F1=0.92 on an independent test set.

Multimodal Classification of Radiation-Induced Contrast Enhancements and Tumor Recurrence Using Deep Learning

This paper proposes RICE-NET, a multimodal 3D ResNet-18 that fuses longitudinal T1-weighted MRI with radiotherapy dose distribution maps. Evaluated on a cohort of 92 glioblastoma patients, the model achieves F1=0.916 for classifying radiation-induced contrast enhancements (RICE) versus tumor recurrence. Ablation studies reveal that the radiotherapy dose map is the single most informative modality (F1=0.78).

Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation

This paper proposes ERBA (Enzyme-Reaction Bridging Adapter), which reformulates enzyme kinetic parameter prediction as a staged conditioning problem aligned with catalytic mechanisms — first injecting substrate information via MRCA to capture molecular recognition, then fusing active-site 3D geometry via G-MoE to model conformational adaptation, and applying ESDA for distribution alignment to preserve PLM priors — achieving state-of-the-art performance across three kinetic metrics.

Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation

This paper proposes ERBA (Enzyme-Reaction Bridging Adapter), which reformulates enzyme kinetic parameter prediction as a staged multimodal conditional generation problem — first injecting substrate information via MRCA to capture substrate recognition specificity, then integrating active-site 3D geometry via G-MoE to capture conformational adaptation, with ESDA distribution alignment to preserve PLM semantic priors.

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

This paper proposes MMPFN, the first method to extend the pretrained tabular foundation model TabPFN to multimodal settings (tabular + image/text). By introducing a Multi-head Gated MLP (MGM) and a Cross-Attention Pooler (CAP), MMPFN addresses two failure modes — over-compression of non-tabular embeddings and token-count imbalance — and achieves state-of-the-art performance on both medical and general-purpose datasets.

Multiscale Structure-Guided Latent Diffusion for Multimodal MRI Translation

This paper proposes MSG-LDM, which introduces a multiscale structure-style disentanglement mechanism into a latent diffusion model. Through high-frequency injection, multimodal structural feature fusion, and structure-aware losses, MSG-LDM achieves multimodal MRI synthesis that preserves anatomical structures and fine-grained details under missing-modality scenarios.

Multiscale Structure-Guided Latent Diffusion for Multimodal MRI Translation

This paper proposes MSG-LDM, a latent diffusion model-based framework for multimodal MRI translation. By explicitly disentangling style and structural information in the latent space and incorporating High-Frequency Injection Blocks (HFIB), Multi-Modal Structural Feature Fusion (MMSF), and Multi-Scale Structure Enhancement (MSSE) modules, the framework extracts modality-agnostic structural priors to guide diffusion denoising. MSG-LDM outperforms existing methods on the BraTS2020 and WMH datasets.

MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification

This paper proposes the MUSE framework, which significantly improves generalization in few-shot whole slide image (WSI) classification through MoE-driven sample-wise fine-grained semantic enhancement (SFSE) and LLM knowledge base-based stochastic multi-view model optimization (SMMO).

MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality

This paper proposes MUST, a framework that explicitly decomposes multimodal representations into modality-specific and cross-modal shared components via algebraic constraints, and employs a conditional latent diffusion model to generate modality-specific information under missing-modality scenarios. MUST achieves state-of-the-art performance with a C-index of 0.742 across five TCGA cancer datasets, with degradation of only ~0.4%–3.5% under missing-modality conditions.

MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy

This paper proposes MuViT, a multi-resolution Vision Transformer that employs world-coordinate RoPE positional encoding to jointly process crops of the same scene at different physical resolutions within a single encoder, achieving substantial improvements over single-resolution baselines on microscopy image segmentation tasks.

NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization

NeurINO proposes to initialize a 3D neuron segmentation model by inflating DINOv3 pretrained 2D convolutional kernels into 3D operators, while introducing a Topology-Aware Skeleton Loss (TASL) to explicitly supervise skeleton-level structural fidelity. The method achieves average improvements of 2.9% in ESA, 2.8% in DSA, and 3.8% in PDS across four neuroimaging datasets.

Novel Architecture of RPA in Oral Cancer Lesion Detection

This paper compares low-code RPA platforms (UiPath, Automation Anywhere) against a Python-based design pattern approach (Singleton + Batch Processing) for oral cancer detection automation. The proposed OC-RPAv2 reduces per-image inference time from 2.5 seconds to 0.06 seconds, achieving a 60–100× speedup.

Novel Architecture of RPA In Oral Cancer Lesion Detection

This work integrates software design patterns (Singleton + Batch Processing) into an EfficientNetV2B1-based oral cancer lesion detection Python pipeline, achieving a 60–100× inference speedup over conventional RPA platforms (UiPath/Automation Anywhere) — 0.06 s per image vs. 2.58 s — while maintaining diagnostic accuracy.

OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging

This paper proposes OmniFM, a modality-robust and task-agnostic federated learning framework that integrates three complementary components—Global Spectral Knowledge Retrieval, Embedding-wise Cross-Attention Fusion, and Prefix–Suffix Spectral Prompting—to support five medical imaging tasks (classification, segmentation, super-resolution, VQA, and multimodal fusion) within a unified FL pipeline, achieving substantial improvements over existing baselines under cross-modal heterogeneous settings.

OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation

OraPO (Oracle-educated GRPO) injects lightweight DPO supervision when GRPO exploration fails, converting zero-reward rollouts into preference pairs. Combined with a FactScore reward, the method achieves SOTA radiology report generation on CheXpert Plus and MIMIC-CXR (F1=0.341/0.357) using only 1K training samples and a 3B model—reducing training data by 2–3 orders of magnitude compared to prior best methods.

OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation

OraPO is proposed as an adaptive hybrid RL framework combining GRPO and DPO for data-efficient radiology report generation. It dynamically switches between GRPO and DPO via Zero-Reward Rate detection, and employs a FactScore-based clinical fact-level reward. Using only 1K samples (vs. 227K for baselines), OraPO achieves state-of-the-art clinical F1 scores of 0.341/0.357 on CheXpert Plus and MIMIC-CXR.

Parameter-efficient Prompt Tuning and Hierarchical Textual Guidance for Few-shot Whole Slide Image Classification

HIPSS introduces two key innovations for few-shot WSI classification: (1) parameter-efficient prompt tuning via Scaling and Shifting Features (SSF) as a replacement for CoOp, substantially reducing the number of trainable parameters; and (2) a soft hierarchical textual guidance strategy that exploits the pretrained knowledge of VLMs and the inherent hierarchical structure of WSIs without hard patch filtering. The method achieves up to 13.8% improvement across three cancer datasets.

PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation

PGR-Net proposes an explicit ROI-aware brain tumor MRI segmentation network that concentrates computational resources on lesion regions via a data-driven spatial prior template set \(\{(r_i, c_i)\}\) constructed from the training set, a hierarchical Top-K ROI selection mechanism, and a Window Gaussian-Spatial decay guidance module (WinGS-ROI). With only 8.64M parameters, the method achieves state-of-the-art performance on BraTS-2019/2023 and MSD Task01.

Prototype-Based Knowledge Guidance for Fine-Grained Structured Radiology Reporting

This paper proposes ProtoSR, which leverages LLMs to mine a template-aligned visual prototype knowledge base from large-scale free-text radiology reports, and injects it into a structured report generation model via prototype-conditioned residuals (late fusion). ProtoSR achieves state-of-the-art performance on the Rad-ReStruct benchmark, with particularly significant gains on fine-grained attribute questions.

Prototype-Based Knowledge Guidance for Fine-Grained Structured Radiology Reporting

This paper proposes ProtoSR, which employs an LLM-driven pipeline to mine template-aligned visual prototype knowledge bases from 227,000 free-text MIMIC-CXR reports, and introduces a prototype-conditioned late-fusion module that injects retrieved prototype evidence as logit residuals into a hierarchical structured reporting model. ProtoSR achieves state-of-the-art performance on the Rad-ReStruct benchmark, improving L3 fine-grained attribute F1 from 4.3 to 7.4 (+72.1% relative gain).

RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation

This work introduces RDFace, a standardized benchmark comprising 456 pediatric facial images spanning 103 rare genetic diseases, and systematically evaluates phenotype-aware synthetic data augmentation (DreamBooth/FastGAN) for rare disease diagnosis under extremely low-sample regimes. DreamBooth-based augmentation achieves up to 13.7% improvement in diagnostic accuracy in the most data-scarce settings.

Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning

This paper identifies "Lost Layers" in CLIP's text encoder — intermediate layers whose removal paradoxically improves performance under Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL). The authors demonstrate that these layers are not redundant but rather underutilized due to visual domain shift, and propose the VtT model to reclaim this information at both the layer and encoder levels, achieving state-of-the-art performance.

Reinforcing the Weakest Links: Modernizing SIENA with Targeted Deep Learning Integration

This work selectively replaces the classical skull stripping (BET2) and tissue segmentation (FAST) modules in the SIENA longitudinal brain atrophy pipeline with deep learning alternatives (SynthStrip/SynthSeg). Evaluated on two large-scale longitudinal cohorts—ADNI (N=1006) and PPMI (N=310)—the proposed modifications substantially improve the correlation between PBVC and clinical disease progression (correlation coefficients increase by over 100%), while reducing scan-order error by up to 99.1%.

Reinforcing the Weakest Links: Modernizing SIENA with Targeted Deep Learning Integration

By replacing the classical skull stripping (BET2) and tissue segmentation (FAST) modules in the SIENA brain atrophy pipeline with deep learning alternatives (SynthStrip, SynthSeg), this work significantly improves the clinical sensitivity and robustness of PBVC estimation while preserving the interpretability of the overall pipeline.

RelativeFlow: Taming Medical Image Denoising Learning with Noisy Reference

This paper proposes RelativeFlow, a flow matching-based framework that decomposes the absolute noise-to-clean mapping into relative noisier-to-noisy mappings. By incorporating a consistent transport constraint and a simulation-based velocity field, RelativeFlow learns a unified denoising flow from heterogeneous noisy references, overcoming the reference bias limitation.

Residual SODAP: Residual Self-Organizing Domain-Adaptive Prompting with Structural Knowledge Preservation for Continual Learning

This paper proposes the Residual SODAP framework, which jointly addresses prompt-side representation adaptation and classifier-side knowledge preservation through: α-entmax sparse prompt selection with residual aggregation, data-free statistical distillation with pseudo-feature replay, prompt usage pattern drift detection (PUDD), and uncertainty-weighted multi-loss balancing. The framework achieves state-of-the-art performance on medical domain-incremental learning benchmarks.

Residual SODAP: Residual Self-Organizing Domain-Adaptive Prompting with Structural Knowledge Preservation for Continual Learning

This paper proposes the Residual SODAP framework, which jointly addresses representation adaptation (via α-entmax sparse prompt selection with residual aggregation) and classifier preservation (via statistical pseudo-feature replay and knowledge distillation) for domain-incremental learning without task IDs or data buffers, achieving state-of-the-art performance on three benchmarks: DR, Skin Cancer, and CORe50.

Robust Fair Disease Diagnosis in CT Images

This paper proposes a dual-objective training framework combining Logit-Adjusted Cross-Entropy (for class imbalance) and CVaR aggregation (for demographic fairness), achieving a gender-averaged macro F1 of 0.8403 with a fairness gap of only 0.0239 on CT disease diagnosis.

Robust Multi-Source Covid-19 Detection in CT Images

This paper proposes a multi-task learning framework that jointly trains a COVID-19 diagnosis head and a source hospital identification head (supervised by a logit-adjusted loss) on a shared EfficientNet-B7 backbone, encouraging the feature extractor to learn institution-invariant representations. The method achieves an F1 of 0.9098 on a multi-source CT dataset.

SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation

This paper proposes SD-FSMIS, a framework that adapts pretrained Stable Diffusion for few-shot medical image segmentation (FSMIS). Through a Support-Query Interaction module and a Visual-to-Text Conditioning Transformer, the framework achieves efficient adaptation, with particularly strong performance in cross-domain scenarios.

Semantic Class Distribution Learning for Debiasing Semi-Supervised Medical Image Segmentation

This paper proposes SCDL (Semantic Class Distribution Learning), a plug-and-play module that learns structured class-conditional feature distributions and aligns them bidirectionally with learnable class proxies via Class Distribution Bidirectional Alignment (CDBA). Combined with Semantic Anchor Constraints (SAC), which leverage annotated data to guide proxies toward correct semantics, SCDL mitigates both supervision bias and feature representation bias in semi-supervised medical image segmentation (SSMIS), achieving notable improvements on tail-class organs.

SCDL: Semantic Class Distribution Learning for Debiasing Semi-Supervised Medical Image Segmentation

This paper proposes SCDL, a plug-and-play semantic class distribution learning framework that addresses supervision bias and representation imbalance in semi-supervised medical image segmentation (SSMIS) via two components: Class Distribution Bidirectional Alignment (CDBA), which learns structured class-conditional feature distributions through proxy distributions, and Semantic Anchor Constraint (SAC), which guides proxy distributions toward true class semantics. SCDL achieves state-of-the-art performance on minority class segmentation.

SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation

This paper proposes SemiTooth, a framework that addresses annotation scarcity and cross-source domain discrepancy in multi-source CBCT tooth segmentation via a multi-teacher multi-student architecture and Strict Weighted Confidence (SWC) constraints. It also introduces MS3Toothset, the first multi-source semi-supervised tooth segmentation dataset.

SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation

This paper proposes SemiTooth, a multi-teacher multi-student semi-supervised framework coupled with a Stricter Weighted-Confidence (SWC) constraint, which effectively leverages multi-source unlabeled data for multi-source CBCT tooth segmentation and achieves cross-source generalization.

Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors

This paper proposes InvTag, a framework that, for the first time, integrates a physics-based MR forward model with a pretrained diffusion generative prior to jointly solve three sub-tasks in 3D Tagged MRI—anatomical recovery, Cine synthesis, and motion estimation—without requiring any additional training data.

Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in Whole-Slide Image Prognosis

This paper proposes STEPH, which efficiently transfers generalizable prognostic knowledge from multiple cancer-type models to a target cancer type via Task Vector Mixup (TVM) and hypernetwork-driven sparse aggregation, achieving an average C-Index improvement of 5.14% across 13 TCGA datasets without requiring large-scale joint training or multi-model inference.

STEPH: Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in WSI Prognosis

STEPH proposes a model merging framework based on Task Vector Mixup (TVM) and hypernetwork-driven sparse aggregation, which efficiently transfers knowledge from multiple cancer-type-specific prognosis models into a target cancer model. It achieves a mean C-Index of 0.6949 across 13 TCGA datasets (+5.14% vs. cancer-type-specific learning, +2.01% vs. ROUPKT), while requiring only a single-model forward pass at inference—far more efficient than multi-model representation transfer approaches.

SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation

This paper proposes the SPEGC framework, which combines semantic-prompt-enhanced feature representations with a differentiable graph clustering solver to refine raw similarity matrices into higher-order structural representations. These representations guide the adaptation of medical image segmentation models to continuously shifting target domains, effectively mitigating error accumulation and catastrophic forgetting.

SVC 2026: The Second Multimodal Deception Detection Challenge and the First Domain Generalized Remote Physiological Measurement Challenge

This paper organizes the SVC 2026 challenge, comprising two tracks — cross-domain multimodal deception detection and domain-generalized remote physiological measurement — providing a unified evaluation framework and baseline models, with 22 teams submitting final results.

Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos

This work introduces SurgBlood, the first laparoscopic surgical video dataset with annotations for both bleeding regions and bleeding points, and proposes BlooDet, a SAM2-based dual-branch bidirectional guidance online detector that achieves joint bleeding region segmentation and bleeding point localization through synergistic optimization of Mask and Point branches.

T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation

This paper proposes a lightweight Temporal Gated Adapter (T-Gated Adapter) that injects adjacent-slice context into the 2D vision-language model CLIPSeg. Trained on only 30 annotated CT volumes, the method achieves an average Dice of 0.704 (+0.206), with consistent improvements on cross-domain zero-shot evaluation and CT-to-MRI cross-modal evaluation.

Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model

This paper proposes Tell2Adapt, a unified framework that leverages the generalized knowledge of a vision foundation model (BiomedParse) to generate high-quality pseudo labels via Context-Aware Prompt Regularization (CAPR), followed by Visual Plausibility Refinement (VPR) to eliminate anatomically implausible predictions, enabling unified source-free unsupervised domain adaptation for medical image segmentation across 10 domain transfer directions and 22 anatomical targets.

The Invisible Gorilla Effect in Out-of-distribution Detection

This paper reveals a previously unreported bias in OOD detection — the "Invisible Gorilla Effect": detection performance is substantially higher when OOD artifacts are visually similar to the model's region of interest (ROI), and degrades significantly when they are dissimilar, with feature-based OOD methods being most severely affected.

Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data

This paper proposes the Difficulty-Influence Quadrant (DIQ) data selection strategy, which jointly considers sample difficulty and gradient influence to enable VLM language backbones to match full-data SFT performance using only 1% of curated data, and to surpass full-data training with just 10%.

Transformer-Based Multi-Region Segmentation and Radiomic Analysis of HR-pQCT Imaging for Osteoporosis Classification

This paper proposes a fully automatic multi-region HR-pQCT segmentation framework based on SegFormer, combined with radiomic features and machine learning for binary osteoporosis classification. The key finding is that soft tissue (tendon/fat) features demonstrate greater diagnostic value than traditional bone features.

Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding

The core contribution of this paper is not merely an "ultrasound version of CLIP," but rather a redefinition of the image-text alignment objective around ultrasound-specific anatomical hierarchies and diagnostic attributes. The authors first construct the Ultrasonographic Diagnostic Taxonomy (UDT) and the large-scale US-365K dataset, then explicitly inject clinical relationships from text into contrastive learning via semantic soft labels and an attribute heterogeneous graph, yielding visual-language representations that are more genuinely "ultrasound-aware."

Uncertainty-Aware Concept and Motion Segmentation for Semi-Supervised Angiography Videos

This paper proposes SMART, a Teacher-Student semi-supervised framework built upon SAM3's concept-prompt segmentation, integrating progressive confidence regularization and a dual-stream temporal consistency strategy to achieve state-of-the-art vessel segmentation in X-ray coronary angiography videos with minimal annotation.

UNIStainNet: Foundation-Model-Guided Virtual Staining of H&E to IHC

UNIStainNet is proposed as the first method to inject dense spatial tokens from the frozen pathology foundation model UNI directly into a generator as SPADE modulation signals. Combined with misalignment-aware losses and learnable stain embeddings, a single unified model simultaneously generates four IHC stains (HER2/Ki67/ER/PR), achieving state-of-the-art distributional metrics on the MIST and BCI benchmarks.

Unleashing Video Language Models for Fine-grained HRCT Report Generation

This paper proposes AbSteering, a two-stage framework that adapts general-purpose VideoLMs to HRCT report generation via abnormality-centric Chain-of-Thought reasoning and DPO-based hard-negative contrastive learning, substantially outperforming specialized CT foundation models on clinical efficacy metrics.

Unlocking Multi-Site Clinical Data: A Federated Approach to Privacy-First Child Autism Behavior Analysis

This paper proposes the first federated learning framework for children's autism behavior recognition. Through a two-tier privacy strategy—3D skeleton abstraction (identity removal) combined with federated optimization (data never leaves the site)—the proposed approach achieves 87.80% accuracy on the MMASD dataset using the APFL personalized federated method, surpassing local training by 5.2% while satisfying HIPAA/GDPR compliance requirements.

Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework

This paper restructures per-instrument prompt parameters from isolated, independent prompts into a tree-structured hierarchy that progressively decomposes shared knowledge across layers. This design enables new instruments to inherit prior knowledge for rapid learning, while allowing new knowledge to gently revise existing representations, thereby simultaneously improving performance on new, regular, and old classes in surgical instrument class-incremental segmentation.

Unsupervised Domain Adaptation with Target-Only Margin Disparity Discrepancy

This paper addresses unsupervised domain adaptation (UDA) for CT→CBCT liver segmentation. It identifies a contradictory term in the classical MDD objective—where the feature extractor is optimized to maximize the discrepancy between \(f\) and \(f'\) on the source domain—and proposes Target-Only MDD, which removes this contradiction and minimizes prediction discrepancy on both domains. The method achieves state-of-the-art UDA performance in both 2D and 3D experiments.

Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code

This paper proposes CodeBrain, which reformulates the any-to-any brain MRI modality imputation problem as a region-level full-stack quantised code prediction task. Through a two-stage pipeline (scalar quantisation reconstruction + grading-loss code prediction), it achieves unified missing modality synthesis and outperforms five state-of-the-art methods.

CodeBrain: Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code

CodeBrain reformulates any-to-any brain MRI modality imputation as a region-level full-stack quantised code prediction problem. Stage I encodes complete MRI sets into compact code maps and modality-agnostic common features via Finite Scalar Quantisation (FSQ); Stage II predicts code maps from incomplete modalities using a grading loss to preserve the smoothness of the quantisation space. CodeBrain surpasses five SOTA methods on IXI and BraTS 2023, and the synthesised modalities achieve brain tumour segmentation performance approaching that of real data.

VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

This paper revisits the necessity of the text branch in zero-shot anomaly detection (ZSAD) and proposes VisualAD, a purely vision-based framework. Two learnable tokens (anomaly/normal) are inserted into a frozen ViT, enhanced by Spatial-Aware Cross-Attention (SCA) and a Self-Alignment Function (SAF). Without a text encoder, VisualAD achieves state-of-the-art performance across 13 industrial and medical benchmarks.

Weakly Supervised Teacher-Student Framework with Progressive Pseudo-mask Refinement for Gland Segmentation

This paper proposes a weakly supervised teacher-student framework that leverages sparse pathological annotations and an EMA-stabilized teacher network to generate progressively refined pseudo-masks. Through confidence filtering, adaptive fusion, and curriculum-guided refinement, the framework achieves efficient segmentation of glandular structures in colorectal cancer pathology images.

X-WIN: Building Chest Radiograph World Model via Predictive Sensing

X-WIN is a chest radiograph world model that, for the first time, incorporates 3D CT spatial knowledge into CXR representation learning. By learning to predict 2D projections of CT volumes at varying rotation angles, the model internalizes 3D anatomical structure. Combined with affinity-guided contrastive alignment and structure-preserving domain adaptation, X-WIN achieves state-of-the-art linear probing performance across 6 CXR benchmarks.

XSeg: A Large-scale X-ray Contraband Segmentation Benchmark for Real-World Security Screening

This paper introduces XSeg, the largest X-ray contraband segmentation dataset to date (98,644 images, 295,932 instance masks, 30 fine-grained categories), and proposes APSAM, a domain-specialized model that leverages the physical dual-energy properties of X-ray imaging via an Energy-Aware Encoder (EAE) and an Adaptive Point Generator (APG) to intelligently expand user click prompts. APSAM achieves 72.83% mIoU, surpassing SAM fine-tuning by 4.96%.