🏥 Medical Imaging¶

🎞️ ECCV2024 · 28 paper notes

📌 Same area in other venues: 📷 CVPR2026 (172) · 🔬 ICLR2026 (88) · 🧪 ICML2026 (28) · 🤖 AAAI2026 (75) · 🧠 NeurIPS2025 (77) · 📹 ICCV2025 (31)

🔥 Top topics: Medical Imaging ×14 · Segmentation ×4 · Diffusion Models ×2 · Multimodal/VLM ×2

A Cephalometric Landmark Regression Method Based on Dual-Encoder for High-Resolution X-Ray Image: This paper proposes D-CeLR, an end-to-end regression method based on a dual-encoder architecture. Utilizing only Transformer encoders, it designs a three-stage framework comprising feature extraction, a reference encoder, and a finetune encoder to achieve coarse-to-fine cephalometric landmark detection, significantly outperforming existing SOTA methods in Mean Radical Error (MRE) and 2mm Success Detection Rate (SDR) metrics.
A Rotation-Invariant Texture ViT for Fine-Grained Recognition of Esophageal Cancer Endoscopic Ultrasound Images: This paper proposes SRRM-ViT, which introduces a Statistical Rotation-invariant Reinforcement Mechanism (SRRM) into ViT to adaptively select key regions and fuse histogram statistical features. This achieves unbiased fine-grained classification of lesions at any radial position in endoscopic ultrasound (EUS) images of esophageal cancer, obtaining significant performance improvements on clinical and public datasets.
Adaptive Correspondence Scoring for Unsupervised Medical Image Registration: To address the issue of spurious reconstruction errors caused by confounding factors such as noise and occlusions in unsupervised medical image registration, this paper proposes an adaptive correspondence scoring framework (AdaCS). By learning pixel-wise correspondence confidence maps to re-weight error residuals, AdaCS consistently improves the performance of three mainstream registration architectures across three datasets in a plug-and-play manner.
Alternate Diverse Teaching for Semi-supervised Medical Image Segmentation: Proposes AD-MT (Alternate Diverse Mean Teacher), which addresses the confirmation bias problem in semi-supervised medical image segmentation through random periodic alternate updating of two teacher models and an entropy-based conflict-combating strategy, comprehensively outperforming SOTA methods on ACDC, LA, and Pancreas datasets.
Architecture-Agnostic Untrained Network Priors for Image Reconstruction with Frequency Regularization: This paper proposes three architecture-agnostic frequency regularization techniques (bandwidth-constrained input, bandwidth-controllable upsampling, and Lipschitz-regularized convolutional layers) to address the issues of architectural sensitivity, overfitting, and operational inefficiency in untrained network priors, significantly narrowing the performance gap among different architectures in MRI reconstruction tasks.
Brain-ID: Learning Contrast-agnostic Anatomical Representations for Brain Imaging: This paper proposes Brain-ID, a contrast-agnostic brain anatomical representation learning model. Through a "mild-to-severe" intra-subject image synthesis strategy, it is trained on fully synthetic data to obtain anatomical features robust to MRI contrast, resolution, orientation, and artifacts. With only a single-layer adaptation, it achieves SOTA performance on four downstream tasks and six public datasets.
Brain Netflix: Scaling Data to Reconstruct Videos from Brain Signals: This paper proposes a novel method for reconstructing videos from functional magnetic resonance imaging (fMRI) signals. Through multi-dataset, multi-subject training and a three-stage pipeline utilizing pre-trained text-to-video and video-to-video models, it achieves state-of-the-art (SOTA) video reconstruction capabilities across both datasets and subjects.
CardiacNet: Learning to Reconstruct Abnormalities for Cardiac Disease Assessment from Echocardiogram Videos: A reconstruction-based cardiac disease assessment framework, CardiacNet, is proposed. By utilizing a Consistency Deformation Codebook (CDC) and a Consistency Deformation Discriminator (CDD), the model learns structural and motion discrepancies between normal and abnormal echocardiogram videos, achieving state-of-the-art (SOTA) performance in ejection fraction prediction, pulmonary arterial hypertension classification, and atrial septal defect classification.
Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild: This work proposes Chameleon, a data-efficient vision generalist model based on meta-learning and token matching. It adapts to entirely new dense prediction tasks (including medical images, video, 3D, etc.) using only dozens of labeled images, significantly outperforming existing generalist methods across six downstream benchmarks.
CheX: Interactive Localization and Region Description in Chest X-rays: This paper proposes ChEX, an interactive chest X-ray interpretation model that supports both text prompts and bounding box queries. Through a DETR-style prompt detector and multi-task joint training, ChEX achieves competitive performance with SOTA on 9 chest X-ray tasks while providing unique grounding interpretability and user interaction capabilities.
Co-synthesis of Histopathology Nuclei Image-Label Pairs using a Context-Conditioned Joint Diffusion Model: A context-conditioned joint diffusion model is proposed to simultaneously synthesize histopathology nuclei images, semantic labels, and distance maps. Precise control over the synthesis process is achieved through point map (centroid layout) and text prompt conditions, generating high-quality instance-level labels for downstream cell nuclei segmentation and classification tasks.
Domesticating SAM for Breast Ultrasound Image Segmentation via Spatial-Frequency Fusion and Uncertainty Correction: This paper proposes the SF-RecSAM model, which compensates for SAM's deficiencies in low-level feature extraction through a spatial-frequency feature fusion module. Additionally, a Dual False Corrector is designed to identify and correct false positive and false negative regions using uncertainty estimation, significantly outperforming SOTA methods on two breast ultrasound datasets, BUSI and UDIAT.
Energy-induced Explicit Quantification for Multi-modality MRI Fusion: This paper proposes E²PA, an energy-induced explicit propagation and alignment framework. Through two modules—Energy-guided Hierarchical Fusion (EHF) and Energy-regularized Space Alignment (ESA)—it explicitly quantifies and optimizes inter-modality dependency propagation and information flow consistency in multi-modality MRI fusion, outperforming State-Of-The-Art (SOTA) methods across three public datasets.
GTP-4o: Modality-Prompted Heterogeneous Graph Learning for Omni-Modal Biomedical Representation: The authors propose GTP-4o, an omni-modal biomedical representation learning framework based on heterogeneous graphs. It explicitly models cross-modal relationships through heterogeneous graph embeddings, utilizes a graph prompting mechanism to complete missing modalities, and designs knowledge-guided hierarchical cross-modal aggregation. It achieves SOTA on glioma grading and survival prediction tasks.
I-MedSAM: Implicit Medical Image Segmentation with Segment Anything: Proposed is I-MedSAM, which integrates the powerful generalization capability of SAM with the continuous space prediction advantages of Implicit Neural Representations (INR). By enhancing high-frequency boundary information with a frequency adapter and refining segmentation through uncertainty-guided sampling, it outperforms existing discrete and implicit methods with only 1.6M trainable parameters.
Improving Medical Multi-modal Contrastive Learning with Expert Annotations: Proposes eCLIP, which enhances the representation quality of medical multi-modal contrastive learning without modifying the core CLIP architecture by integrating radiologists' eye-tracking gaze heatmaps as an auxiliary supervisory signal, combined with mixup augmentation and curriculum learning strategies.
Is User Feedback Always Informative? Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data: It is discovered that user feedback is not always beneficial in domain adaptation. "Negatively Biased Feedback" (NBF), which biases towards correcting erroneous model predictions, leads to performance degradation in existing semi-supervised domain adaptation methods. To address this, Retrieval Latent Defending (RLD) is proposed to balance supervision signals by introducing pseudo-labeled defending samples to each mini-batch.
OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding: Constructs OphNet, currently the largest video benchmark dataset for ophthalmic surgery (2,278 videos, 285 hours, 66 surgery types, 102 surgical phases, and 150 fine-grained steps). It supports four main tasks: surgery type recognition, phase recognition, temporal localization, and phase prediction, with a scale approximately 20 times larger than the largest prior benchmark for surgical workflow analysis.
Pathology-knowledge Enhanced Multi-instance Prompt Learning for Few-shot Whole Slide Image Classification: The PEMP framework is proposed, which integrates prior pathology knowledge (visual exemplars + textual descriptions) into patch-level and slide-level prompts. Combined with CLIP for multi-instance prompt learning, it outperforms SOTA methods by an average of 4% on few-shot weakly supervised WSI classification tasks.
RadEdit: Stress-Testing Biomedical Vision Models via Diffusion Image Editing: Proposes RadEdit, a diffusion-based medical image editing method that introduces a dual-mask mechanism (edit mask and keep mask) to break spurious correlations in data, generating high-quality synthetic test suites to stress-test the robustness of biomedical vision models against dataset shift.
Radiative Gaussian Splatting for Efficient X-ray Novel View Synthesis: This paper proposes X-Gaussian, the first framework to apply 3D Gaussian Splatting (3DGS) to X-ray novel view synthesis. By designing a radiative Gaussian point cloud model (replacing spherical harmonics) and an angle-pose cuboid uniform initialization strategy (replacing SfM), it outperforms SOTA NeRF methods by 6.5 dB while achieving 73× inference acceleration with only 15% of the training time.
Shape-Guided Configuration-Aware Learning for Endoscopic-Image-Based Pose Estimation of Flexible Robotic Instruments: By leveraging the 3D shape prior of flexible robots to guide image feature learning, this work extracts part-level geometric representations and applies a dynamic shape deformation mechanism. This achieves highly accurate pose estimation of flexible robots from endoscopic images, significantly outperforming baseline methods such as keypoint detection, skeleton extraction, and direct regression in predicting both external orientation and internal bending angles.
NePhi: Neural Deformation Fields for Approximately Diffeomorphic Medical Image Registration: NePhi proposes using neural implicit functions (SIREN) to replace traditional voxel-based deformation fields for representing deformations in image registration. By predicting latent codes via an encoder for fast inference and utilizing instance optimization to enhance accuracy, it matches SOTA precision in 3D lung and brain registration tasks while reducing training memory by fivefold, naturally yielding smooth, approximately diffeomorphic deformations.
TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data: Proposed the TIP framework, which jointly pre-trains on tabular data and images through three self-supervised tasks: masked tabular reconstruction, image-tabular matching, and contrastive learning, to learn robust multimodal representations against incomplete tabular data for downstream classification tasks.
Topology-Preserving Downsampling of Binary Images: This paper proposes the first topology-preserving binary image downsampling method based on discrete optimization (integer programming). By encoding the black-and-white decisions of downsampled pixels as Boolean variables, framing topology preservation as hard constraints, and using similarity to the original image as the objective function, the method guarantees that the downsampled results have the exact same Betti numbers (number of connected components and holes) as the original image, while maintaining competitive pixel-level similarity compared to traditional methods.
UMBRAE: Unified Multimodal Brain Decoding: This paper proposes UMBRAE, which aligns fMRI signals with image features using a universal brain encoder and feeds them into a frozen MLLM to achieve multimodal brain decoding (description, grounding, retrieval, visual reconstruction). It innovatively introduces a cross-subject training strategy, enabling a single model to serve multiple subjects and outperform single-subject models.
Unleashing the Power of Prompt-driven Nucleus Instance Segmentation: The PromptNucSeg framework is proposed, which automatically generates nucleus center point prompts by training a prompter and fine-tunes SAM for nucleus-by-nucleus segmentation. It solves the overlapping nucleus segmentation problem by introducing neighboring nuclei as negative prompts, achieving SOTA performance on three benchmarks without complex post-processing.
Unsupervised Multi-modal Medical Image Registration via Invertible Translation: This paper proposes INNReg, which translates multi-modal medical images into a single modality using an invertible neural network, and then performs registration on the single-modality images. Combined with a barrier loss function based on normalized mutual information, it achieves registration accuracy superior to existing methods on MRI T1/T2 and MRI/CT datasets.