🏥 Medical Imaging¶
📷 CVPR2026 · 163 paper notes
📌 Same area in other venues: 🔬 ICLR2026 (88) · 🧪 ICML2026 (28) · 🤖 AAAI2026 (75) · 🧠 NeurIPS2025 (74) · 📹 ICCV2025 (31)
🔥 Top topics: Medical Imaging ×79 · Segmentation ×34 · Multimodal/VLM ×24 · Reasoning ×12 · Diffusion Models ×11
- A Supervised Multi-task Framework for Joint cryo-ET Restoration Enabled by Generative Physical Simulation
-
cryoDeRec utilizes a "generative noise modeling + physical imaging simulation" pipeline to generate paired tomograms consisting of "noisy inputs \(\leftrightarrow\) clean GT." This transforms cryo-ET denoising and missing wedge restoration, which previously relied on self-supervised methods, into fully supervised multi-task training. A single U-Net performs both tasks simultaneously, outperforming Topaz-Denoise / SC-Net / IsoNet across four real and two simulated datasets.
- Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning
-
The HistoSelect framework is proposed to simulate the coarse-to-fine reasoning process of pathologists. Through a three-tier filtering mechanism consisting of tissue segmentation → Group Sampler → Patch Selector, and based on Information Bottleneck (IB) theory, irrelevant visual tokens are compressed. This achieves SOTA performance across three datasets while reducing computational overhead by approximately 70%.
- Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning
-
Ours proposes the UAAI framework, which introduces Active Inference to micro-gesture recognition for the first time. Through EFE-guided temporal frame selection, spatial attention, and UMIX uncertainty-aware augmentation, it achieves 63.47% on the RGB modality of the SMG dataset, significantly outperforming traditional RGB methods.
- AD-GBC: Anisotropic Granular-Ball Skip-Connection Refiner for UNet-Based Medical Image Segmentation
-
The study upgrades "point prototypes / isotropic balls" used as semantic anchors in UNet to differentiable granular balls with anisotropic vector scales. A bidirectional "Pixel Set ↔ Ball" aggregation-broadcasting mechanism serves as a semantic refiner for skip-connections, supplemented by two geometric regularizations to prevent anchor collapse. This approach yields consistent performance gains (average IoU +1.3~1.7%) across four medical segmentation benchmarks for both Rolling-UNet and U-KAN backbones.
- Adaptive Anisotropic Gaussian Splatting for Multi-contrast MRI Arbitrary-Scale Super-Resolution with Anatomy Guidance
-
GaussM2ASR reformulates multi-contrast MRI arbitrary-scale super-resolution (ASSR) from "INR direct regression of pixel intensity" to "learning parameters for a set of anisotropic 2D Gaussian kernels." By using narrow kernels to fit high-frequency anatomical boundaries and wide kernels for smooth low-frequency regions, combined with three anatomy-driven modules to align structures with high-resolution reference images, it outperforms existing SOTA methods in PSNR/SSIM across IXI, BraTS, and fastMRI datasets.
- Adaptive Confidence Regularization for Multimodal Failure Detection
-
The ACR framework is proposed to systematically address misclassification detection in multimodal scenarios for the first time. By combining Adaptive Confidence Loss (penalizing the "confidence degradation" phenomenon where multimodal fusion confidence is lower than unimodal confidence) and Multimodal Feature Swapping (synthesizing failure samples in the feature space), ACR significantly outperforms existing methods across four datasets.
- Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
-
This work transfers 3D diffusion models pre-trained on natural videos (Wan 2.1) or public CT datasets (MAISI) to radiotherapy dose prediction. It introduces an "Any2Any" modality conditioning paradigm allowing any modality to serve as a generation target, followed by reinforcement learning post-training aligned with clinical Scorecards to match institutional preferences. It achieved a new SOTA on the GDP-HMM challenge, reducing voxel-level MAE from 2.07 to 1.93.
- BackSplit: The Importance of Sub-dividing the Background in Biomedical Lesion Segmentation
-
The paper proposes BackSplit: a paradigm that subdivides the homogeneous "background" in lesion segmentation into semantic auxiliary organ/tissue classes for joint multi-class softmax training. Using Fisher information theory, it proves that this approach retains more information and produces more stable estimates than binary training, consistently improving Dice scores for small lesions across five datasets with zero additional inference overhead.
- Benchmarking Endoscopic Surgical Image Restoration and Beyond
-
The authors constructed SurgClean, the first multi-source real-world endoscopic surgical image restoration dataset (3,113 images across desmoking, dehazing, and desplashing). They systematically evaluated 22 representative methods (12 general and 10 task-specific), revealing a significant gap between existing methods and clinical requirements, while analyzing the intrinsic differences between surgical and natural scene degradations.
- Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance
-
This work presents the first systematic study of aggregation strategies from pixel-level uncertainty to image-level scores in segmentation tasks. It proposes SMR aggregators that integrate spatial structural information (Moran's I, Edge Density, Shannon Entropy) and a GMM-based meta-aggregator. Evaluation across 10 datasets demonstrates that global average (AVG) is a suboptimal choice, while GMM-All meta-aggregation performs robustly in both OoD and failure detection.
- Beyond the Static-World: Lifelong Learning for All-in-One Medical Image Restoration
-
To address the simultaneous "modality conflict" and "catastrophic forgetting" encountered in real-world clinical data streams for all-in-one medical image restoration (sharing one model for MRI SR, CT denoising, and PET synthesis), this paper proposes the ROME framework. It first maps different modalities to a unified Modality-Agnostic Manifold (MIDAB) via adversarial balancing, and then performs Adaptive Feature-level Consolidation (AFC) on this manifold to stabilize old knowledge, reducing average degradation after sequential training by over 10%.
- BiOTPrompt: Bidirectional Optimal Transport Guided Prompting for Disease Evolution-aware Radiology Report Generation
-
Addressing the overlooked characteristic that "lesion evolution is bidirectional and asymmetric (including both new onset and resolution)" in longitudinal chest X-ray report generation, BiOTPrompt utilizes Bidirectional Optimal Transport to establish soft correspondences between current and historical images. By identifying "newly emerging regions" and "disappearing regions" through the asymmetry in transport quality between the two directions, their spatial coordinates are encoded into prompts to guide LLM report generation. A visual-textual consistency constraint is introduced to suppress hallucinations. The model achieves SOTA results in NLG and clinical metrics (CE-F1 0.417) on the Longitudinal-MIMIC dataset.
- Breaking the Continuum: Discrete Distribution Learning for Structural MRI Reconstruction
-
For undersampled MRI reconstruction, DiCoS moves away from the "single-trajectory" continuous manifold refinement used in diffusion models. Instead, it employs a discrete prior network to generate \(K\) anatomical candidates, applies extremely short micro-diffusion cycles for texture refinement and data consistency projection, and uses a Dual-domain Balancing Score (k-space + image domain) to chains-of-select the best hypothesis. It achieves SOTA quality on fastMRI knee/brain datasets with significantly lower inference time (PSNR >1.4 dB higher than the runner-up at 12× acceleration).
- Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction
-
CineNeuron adopts the dual-pathway mechanism of the brain—"bottom-up perception + top-down memory." It first uses multi-task alignment to map noisy fMRI signals into a semantic space that simultaneously encodes images, text, actions, and categories. Then, it utilizes the Mixture-of-Memories (MoM) module to retrieve and fuse multimodal "memories" from historical samples to complete details, eventually driving a video diffusion model. It comprehensively outperforms SOTA on the cc2017 and CineBrain fMRI-to-video benchmarks.
- Bridging RGB and Hematoxylin Components: An Interleaved Guidance and Fusion Framework for Point Supervised Nuclei Segmentation
-
DFGNet treats the RGB image of H&E pathological slides and its extracted Hematoxylin component as a pair of complementary representations. By jointly modeling them with a triplet of Reciprocal Cross-scale Dynamic Fusion (RCDF), Interleaved point-Guided Attention (IGA), and Entropy Confidence Aggregation Unit (ECAU), it achieves SOTA performance on three public nuclei segmentation datasets under the weakly supervised setting using only point annotations.
- Building Robust Vision Encoders for Cross-Dataset Evaluation in Immunofluorescent Microscopy
-
Addressing the inconsistency in channel counts and configurations across immunofluorescence (IF) microscopy laboratories, which prevents existing models from migrating to "unseen channel combinations," this paper proposes C3R. It leverages biological priors to split channels into "context" and "concept" groups. By utilizing the CCE architecture with group-independent encoding and the MCD (Masked Context Distillation) pre-training strategy, the frozen encoder achieves SOTA performance on OOD datasets with unseen channel configurations without retraining.
- CG-Reasoner: Centroid-Guided Positional Reasoning Segmentation for Medical Imaging with a Robust Visual-Text Consistency Metric
-
CG-Reasoner integrates a lightweight encoder-decoder with LLaVA-Med and introduces a Text2Centroid module that regresses reasoning text into lesion centroid coordinates. This enables the model to produce spatially grounded, interpretable reasoning text alongside segmentation masks. Additionally, the proposed PRScore measures semantic, spatial, and visual consistency, achieving performance close to or exceeding SOTA across six medical imaging modalities.
- CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection
-
Ours proposes CHIPS, a data selection method based on curvature-aware hybrid influence. It calculates Newton-style alignment scores in the CLIP endpoint subspace and combines them with learnability and domain relevance weights. With only 30% of the data, it matches the effect of continued pre-training (CPT) on the full dataset, achieving SOTA across 17 medical benchmarks.
- Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis
-
MEDVCR enables medical video diagnosis models to perform counterfactual reasoning similar to physicians (i.e., "how would this tissue look if it were benign?"). It utilizes diffusion models to synthesize tissue evolution under different pathological hypotheses, constrains representation learning with three clinical rules, and integrates the comparison of "factual observation vs. counterfactual hypothesis" into predictions. It improves Recall@1 / AP to 93.0% (+10.2%) and 94.8% (+2.6%) on hysteroscopic biopsy localization and colonoscopic polyp detection, respectively.
- CMR-RD: Long-Tailed Adaptive VLM for Explainable CMR Diagnosis
-
CMR-RD is the first vision-language model for explainable cardiac magnetic resonance (CMR) diagnosis. It establishes a foundation through "medical alignment + Chain-of-Thought (CoT) cold start," then actively strengthens rare disease categories using GPPO—a multi-phase reinforcement learning algorithm with Thompson sampling for dynamic quota allocation. By incorporating lesion IoU grounding into the reward function, it achieves the highest accuracy and the most reliable reasoning chains across six types of heart disease.
- CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference
-
To address the domain shift between "expert dermoscopic images" and "handheld clinical images," CoFiDA-M uses MONET clinical concept scores (privileged information) to guide FiLM-based visual feature editing during training, creating a "clinically reasoning" teacher. The teacher's edited features are then distilled into an image-only student. This allows the student to maintain high AUROC and melanoma recall across six unseen datasets without relying on any concept metadata during deployment.
- Contrastive Cross-Bag Augmentation for Multiple Instance Learning-based Whole Slide Image Classification
-
Addressing the restricted diversity issue in weak-supervised WSI classification where pseudo-bag augmentation "only samples within one or two bags," C2Aug constructs pseudo-bags by sampling instances across all same-class bags in the dataset (addition-and-merge rather than reduction-and-merge). It utilizes bag-level and group-level contrastive learning to mitigate the side effect of "reduced small tumor region samples," achieving superior AUC over existing augmentation methods on CAMELYON-16, TCGA-LUNG, and TCGA-BRCA datasets.
- CRFT: Consistent-Recurrent Feature Flow Transformer for Cross-Modal Image Registration
-
CRFT is proposed as a unified coarse-to-fine cross-modal image registration framework. It learns modality-independent feature flow representations within a Transformer architecture, utilizing \(1/8\) resolution global correspondence in the coarse stage and \(1/2\)-\(1/4\) multi-scale local refinement in the fine stage. Combined with iterative discrepancy-guided attention and Spatial Geometric Transform (SGT) to recursively refine the flow field and capture subtle spatial inconsistencies, it outperforms SOTA methods like RAFT, GMFlow, and LoFTR across various cross-modal datasets including optical/infrared, SAR, and multispectral.
- Cross-domain Dual-stream Feature Disentanglement for Brain Disorder Prediction with Sparsely Labeled PET
-
Addressing the scarcity of PET labels by transferring knowledge from label-rich MRI, this paper proposes the DSDA framework. It explicitly decouples "classification-relevant critical brain regions" from "classification-irrelevant non-critical regions" using a brain region importance map. It then applies differential processing: non-critical regions undergo topological weighted alignment to eliminate domain discrepancy, while critical regions undergo high-confidence feature fusion to preserve pathological discriminative information. The method achieves 86.6%/87.7%/88.9% accuracy on ADNI/AIBL/PPMI respectively, setting a new SOTA.
- Cross-Modal Guided Visual Synthesis for Data-Efficient Multimodal Depression Recognition
-
This work utilizes audio and text as conditions to synthesize novel "visual behavioral features" at the feature level via a CVAE to alleviate clinical depression data scarcity. The synthesis process is backward-guided by the loss of the downstream recognizer, ensuring synthesized features prioritize "utility for recognition" over mere "realism," achieving SOTA on DAIC-WOZ and E-DAIC.
- CROWn: A Unified 3D Medical Segmentation Framework Integrating Anti-Aliased Downsampling and Phase-Calibrated Fusion
-
CROWn integrates sampling theory into the two most information-prone stages of U-shaped segmentation networks—downsampling and skip connection fusion: using µPCAD for "pooling query × wavelet subband value" co-attention with explicit anti-aliasing low-pass filtering during extraction, and OCF to decompose high-resolution skip connections into eight phase cosets followed by phase attention and edge-gated alignment. It achieves comprehensive SOTA in IoU/Dice across 15 CT/MRI/OCT datasets with only 23.78M parameters.
- CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
-
Ours proposes CURE—an error-aware curriculum-guided multi-task training framework. Without introducing additional data, it dynamically adjusts the sampling distribution to focus on difficult samples, improving visual grounding accuracy by +0.37 IoU and reducing the hallucination rate by 18.6% for medical VLMs.
- D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation
-
The "convex segmentation" prior is reformulated from a global constraint on binary sets into a quasi-concavity constraint on the network's output probability map \(u\). This yields a threshold-free, differentiable, and densely computable convexity loss. Further, a Convex Gradient Projection Module (CGPM) is used to enforce hard convexity during inference. The method consistently improves Dice/IoU and reduces Hausdorff distance for near-convex structures such as retinal and cardiac anatomy.
- D2T2 - Multimodal Automated Planning for Brachytherapy
-
D2T2 utilizes a two-stage network—where a DiT predicts dwell times for each position and a physical layer linearly combines these into dose—to directly predict clinically deliverable brachytherapy machine parameters. Combined with a proxy network that renders the Gamma index into a differentiable loss, the model achieves higher accuracy than current SOTA and reduces planning time from tens of minutes to 0.1 seconds in a single forward pass.
- DARC: Dual Adjustment Reasoning with Counterfactuals for Trustworthy Chest X-ray Classification
-
DARC separately treats two types of spurious correlations in multi-label chest X-ray classification—shortcut learning from non-pathological visual confounders and feature entanglement caused by pathological co-occurrence—using a global stream for backdoor adjustment and a local stream for counterfactual reasoning. These are fused at the logit level, leading to superior classification performance, interpretability, and robustness.
- Decoupling Vision and Language: Codebook Anchored Visual Adaptation
-
Ours proposes CRAFT, which decouples the vision encoder from the language model through a discrete codebook. By fine-tuning only the vision encoder, domain adaptation is achieved in a way that allows the adapted encoder to be seamlessly reused across different LLM architectures, resulting in an average improvement of 13.51% across 10 domain benchmarks.
- Delving Aleatoric Uncertainty in Medical Image Segmentation via Vision Foundation Models
-
This work utilizes a frozen medical vision foundation model (MedSAM2) to extract features, performs Singular Value Decomposition (SVD) on class-wise feature matrices, and quantifies the Shannon entropy of their energy distribution. This yields a label-free "Aleatoric Uncertainty Value (AUV)" to characterize sample difficulty and noise. This value drives two plug-and-play strategies—"Data Filtering" and "Dynamic Uncertainty-aware Optimization (DUO)"—achieving consistent segmentation performance gains across five CT/MRI datasets.
- Depth Any Endoscopy: Towards Self-Supervised Generalizable Depth Estimation in Monocular Endoscopy
-
DAE transforms a Vision Foundation Model (Depth Anything v2) into a unified self-supervised endoscopic depth network through "Dual-layer MoE adaptation + Learnable Gradient Harmonization + Semantic Distribution Calibration." Without depth annotations, it achieves State-of-the-Art (SOTA) performance in both zero-shot and in-domain depth estimation across diverse procedures like laparoscopy and colonoscopy.
- Diffusion-Based Native Adversarial Synthesis for Enhanced Medical Segmentation Generalization
-
This paper points out that when diffusion models are used for medical segmentation data augmentation, the true driver of generalization is not visual realism but "synthetic adversariality" (the empirical loss induced by synthetic samples). Furthermore, only native adversariality residing on the manifold is effective. Based on this, a lightweight plugin, Adversariality Miner, is proposed. By resampling initial noise without modifying or retraining the frozen diffusion model, it amplifies native adversariality, further improving downstream Dice gains by 4–5 points across multiple medical segmentation benchmarks.
- Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)
-
To address the unique geometric structure of diffusion MRI (dMRI) data—where each volume corresponds to a sampling direction on a sphere and protocols vary across subjects—this paper proposes D-RoPE, a generalization of Rotary Positional Embedding (RoPE) to the diffusion spherical space. Combined with a Transformer using alternating spatial/diffusion attention and Masked Autoencoder (MAE) pre-training, the learned general representations achieve approximately 6% higher accuracy in Mild Cognitive Impairment (MCI) classification and a 0.05 improvement in correlation coefficients for cognitive score regression compared to baselines.
- Diffusion with a Linguistic Compass: Steering the Generation of Clinically Plausible Future sMRI Representations for Early MCI Conversion Prediction
-
MCI-Diff utilizes a baseline sMRI to "reconstruct" longitudinal imaging features for the future 6–36 months. It trains a diffusion model via multi-task sequence reconstruction to address irregular follow-up intervals and employs a fine-tuned LLM as a "linguistic compass" to score candidate features based on clinical biomarkers. This steers autoregressive generation toward clinically plausible trajectories, improving early MCI conversion prediction accuracy by 5–12% while maintaining immediacy.
- Divide, Conquer, and Aggregate: Asymmetric Experts for Class-Imbalanced Semi-Supervised Medical Image Segmentation
-
To address the "small organs drowned by large organs" issue in multi-organ semi-supervised segmentation, DCA employs a "divide and conquer" strategy using a shared encoder and three asymmetric expert decoders tailored for head, medium, and tail classes. By integrating predictions and features through logit concatenation and a Dynamic Feature Aggregation Module (DFAM), it produces unbiased results, pushing the average Dice from 68.4 to 73.2 on the Synapse 20% labeled benchmark.
- DK-DDIL: Adaptive Knowledge Retention for Dynamic Domain-Incremental Learning in Medical Imaging
-
Addressing dynamic domain-incremental scenarios in real-world clinical practice where "imaging equipment/institutions/diseases constantly change and label spaces expand," DK-DDIL introduces a differentiable dynamic-rank LoRA adapter (DAM) to automatically scale model capacity based on domain complexity. It utilizes a knowledge inheritance mechanism (KIR) combining model fusion and prototype contrastive learning to suppress catastrophic forgetting. Without replaying historical data, it outperforms existing DIL methods on skin pathology, 3D MRI, and OfficeHome benchmarks while training only 0.26% of parameters.
- Dual-Level Confidence based Implicit Self-Refinement for Medical Visual Question Answering
-
To address the train/test distribution drift in Medical VQA, DuCoR introduces pseudo-answers from test samples into training. It adaptively fuses dual-level complementary signals—"loss-level confidence" (modeling clean/noisy loss distributions) and "feature-level confidence" (measuring the distance from sample representations to pseudo-answer prototypes)—to estimate per-sample reliability weights for weighted pseudo-supervision. This improves performance across multiple Medical VQA benchmarks and significantly enhances cross-domain generalization.
- Dual-Level Hypergraph Generation for Addressing Feature Scarcity in Whole-Slide Image Classification
-
Addressing the dual scarcity of minority class samples (ITC, micrometastasis) and positive nodes in quaternary lymph node metastasis classification, this paper proposes Dual-HGNet. It uses a category-prompt-guided hierarchical hypergraph VAE at the hypergraph level to synthesize topologically consistent minority hypergraphs and employs anchor-diffusion mixup at the node level to enhance high-attention positive node features. This approach significantly improves minority class recognition (e.g., ITC F1 on NIMM increased from 52.7 to 57.1).
- Dynamic Stream Network for Combinatorial Explosion Problem in Deformable Medical Image Registration
-
Addressing the combinatorial explosion of feature relationships in Deformable Medical Image Registration (DMIR) caused by "dual-image inputs," this paper proposes DySNet. It utilizes an AdSB module to dynamically deform the receptive field (shrinking the search space) and a DySA module to dynamically generate attention weights (calibrating the search direction). These two dynamic mechanisms are unified into a single dynamic convolution kernel. On 3D cardiac CT and 3D/2D brain MRI tasks, it achieves an average Dice of 82.0%, outperforming 8 SOTA methods.
- EchoPOSE: 6D Pose Estimation of Sparse Echocardiograms for Left-Ventricular 3D Shape Reconstruction
-
This paper proposes EchoPOSE, a Transformer-based network that automatically regresses the 6D pose (3 translations + 3 rotations) of 5 sparse 2D ultrasound slices typically collected in clinical practice. By feeding the posed segmentation masks into a Graph Harmonic Deformation (GHD) algorithm, the 3D shape of the left ventricle (LV) is reconstructed across the cardiac cycle. On synthetic MITEA data, it achieves a pose error of 3.78 mm / 8.65°, 87.5% Dice, and 1.44% Ejection Fraction (EF) error, outperforming the clinical gold standard Simpson’s biplane method without any external tracking hardware.
- EchoVDiff: Cardiac-Cycle Echocardiography Video Generation from Arbitrary Single Frame
-
EchoVDiff explicitly equips echocardiogram video generation with a "cardiac phase axis." By fitting left ventricular area changes into a continuous cyclic phase via multi-task learning, and then using two phase-conditioned diffusion models to reconstruct physiologically consistent ED→ES→ED cardiac cycle videos from an arbitrary single frame, it reduces FVD from 630 to 535 on EchoNet-Dynamic.
- EEGiT: Teaching Vision Transformers to Understand the EEG signal
-
EEGiT "paints" 1D EEG time-series signals into 2D EEG patches similar to image patches. This allows a ViT pre-trained on ImageNet-21K to be directly used as an EEG encoder, leveraging visual priors from the image domain to alleviate EEG data scarcity. It achieves SOTA performance in both THINGS-EEG retrieval and EEG-3D classification.
- Efficient Unrolled Networks for Large-Scale 3D Inverse Problems
-
Addressing the pain point where unrolled networks for 3D inverse problems cause memory explosion because "network steps must run on the full-resolution volume," this paper employs domain partitioning (reconstructing one patch while treating the rest as known context) and a diagonal-circulant matrix approximation for the normal operator \(A^\top A\). This allows an unrolled network with a forward operator to be trained and deployed on a single GPU for \(501^3\) voxel sparse-view CBCT and multi-coil accelerated MRI, achieving SOTA performance.
- EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease
-
Ours proposes EMAD, an end-to-end multimodal vision-language framework that generates structured reports for AD diagnosis. It explicitly associates each diagnostic statement with clinical evidence and 3D brain anatomy through hierarchical Sentence–Evidence–Anatomy (SEA) Grounding and ensures clinical consistency via executable rule-driven GRPO reinforcement fine-tuning.
- Every Error has Its Magnitude: Asymmetric Mistake Severity Training for Multiclass Multiple Instance Learning
-
Ours proposes the PAMS (Priority-Aware Mistake Severity) method, which significantly reduces the risk of severe misdiagnosis in multiclass MIL WSI diagnosis through Asymmetric Mistake Severity Cross-Entropy loss (MSCE), Semantic Feature Remix (SFR), and Asymmetric Mikel's Wheel metrics.
- F\(^2\)-Assist: Multi-Phase Fetal Growth Forecast and Report Generation from Ultrasound Examination
-
F\(^2\)-Assist feeds multi-organ ultrasound images and continuous biometry (HC/AC/BPD/FL) from multiple prenatal examinations into a unified multimodal LLM. By employing "Cross-Phase Organ Alignment," "History-Aware Temporal Encoding," and "Growth Parameter Adapter," it predicts the next-phase fetal growth parameters and generates ultrasound reports simultaneously, improving the numerical prediction R² from the previous SOTA of 0.59 up to 0.78.
- Factorized Context Aggregation for Robust Cancer Risk Estimation via Soft Re-Ranked Retrieval and Hierarchical Anchors
-
This paper addresses the clinical scenario where multimodal data (e.g., genomics, pathology reports) is available during training, but only Whole Slide Images (WSI) are available during inference. It proposes using WSI as an anchor to retrieve multimodal features of similar patients from a memory bank with soft re-ranking, followed by factorized cross-attention to reconstruct proxy representations of missing modalities into three paths: "modality-specific," "shared with WSI," and "shared with other modalities." Finally, it employs a full-modality teacher for hierarchical anchor distillation. Across 24 missing modality scenarios in 8 cancer types, it improves the survival prediction C-index to 0.617, a relative gain of ~8.5% over histology-only baselines, lagging only ~1.4% behind the full-modality upper bound.
- FBTA: Enabling Single-GPU End-to-End Gigapixel WSI Classification with Feature Bridging and Translation Alignment
-
FBTA employs "pseudo-bag proxy + feature translation + three-view consistency constraint" to compress Multiple Instance Learning (MIL) of gigapixel Whole Slide Images (WSI) into a single 24GB GPU for true end-to-end training. Compared to direct full-image end-to-end approaches, it achieves a speedup of over 100\(\times\) and provides plug-and-play performance gains across three MIL architectures and two feature extractors (e.g., ABMIL accuracy +15.8% on STAD).
- fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
-
fMRI-LM utilizes a three-stage framework that first discretizes brain signals into tokens aligned with the text embedding space, and then enables a pre-trained LLM to model brain activity as a predictable and describable "language." By complementing the lack of natural pairs with a synthetic fMRI-to-text description corpus, it achieves zero-shot/few-shot performance on diverse tasks including sex, age, fluid intelligence, and AD/ADHD/ASD diagnosis using a single model. Furthermore, LoRA fine-tuning achieves or even surpasses the performance of full fine-tuning.
- Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis
-
Ours proposes FPRL, a cognition-inspired hierarchical self-supervised framework that mitigates motion bias by first "focusing" on key static intra-frame semantics and then "perceiving" inter-frame contextual evolution, achieving SOTA on 11 endoscopic datasets.
- Focus on Background: Exploring SAM's Potential in Few-shot Medical Image Segmentation with Background-centric Prompting
-
This work redefines "Few-Shot Medical Image Segmentation (FSMIS) using SAM" as a background point prompt localization problem. It proposes FoB, a plug-and-play prompt generator that generates accurate background prompt points outside foreground boundaries through background prototype construction, background-centric contextual modeling, and structure-guided iterative refinement. This constrains SAM's over-segmentation and significantly advances the SOTA in FSMIS across three medical datasets.
- Continual Learning for fMRI-Based Brain Disorder Diagnosis via Functional Connectivity Matrices Generative Replay
-
This paper proposes FORGE, the first continual learning framework specifically designed for cross-site fMRI brain disorder diagnosis. It utilizes a structure-aware VAE to generate realistic functional connectivity (FC) matrices for privacy-preserving generative replay. Combined with dual-level knowledge distillation and a hierarchical contextual bandit sampling strategy, it effectively mitigates catastrophic forgetting.
- Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models
-
PRIMED introduces Retrieval-Augmented Generation (RAG) into the continual learning of medical VLMs. It utilizes an 18-million-scale multi-modal medical retrieval library and a 3,000-item question pool as "dynamic memory." During fine-tuning, image-text pairs are retrieved in real-time as replay data. Combined with Contrastive Knowledge Distillation and Dynamic Fisher Weight constraints, it achieves SOTA across all metrics on the self-developed MGTIL benchmark.
- From Infusion to Assimilation Distillation for Medical Image Segmentation
-
Addressing the issue where existing Knowledge Distillation (KD) "simply pours knowledge in without allowing the student to digest it," leading to degraded generalization, this paper proposes a two-stage framework, IAD. It first "infuses" the semantic knowledge of a SAM teacher into a lightweight student via soft labels and class-weighted prototype alignment, then "assimilates" the knowledge through contrastive semantic self-optimization and reverse feature constraints to preserve the student's inherent advantages. It achieves DICE improvements of 4.32%, 1.85%, and 2.42% on Synapse, ACDC, and Polyp datasets, respectively, with an average cross-dataset generalization gain of 4.16%.
- From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature
-
Addressing the issue that biomedical literature figures are typically composite images containing multiple panels and annotated arrows, whereas existing VLP methods compress the entire figure into a single coarse image-text pair, this paper proposes the Panel2Patch data pipeline. It utilizes off-the-shelf LVLMs to automatically decompose literature figures into three levels of aligned image-text pairs: "Global-Panel-Region." Combined with a zoom-in pretraining framework featuring cross-layer message passing, the method achieves SOTA on multiple biomedical benchmarks using only 1/6 of the data compared to previous works.
- Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis
-
Gastric-X constructs a multimodal benchmark of 1.7K cases based on real-world gastric cancer clinical workflows. It aligns four-phase 3D CT, endoscopy images, structured biochemical indicators, and clinical reports at the patient level. By defining five tasks—VQA, report generation, cross-modal retrieval, staging classification, and lesion detection—it systematically evaluates six general/medical VLMs, revealing a significant gap in the ability of current models to achieve "cross-modal evidence corroboration."
- GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction
-
GaussianPile is proposed to extend 3D Gaussian Splatting from surface appearance modeling to slice-based volumetric reconstruction by introducing a focus-aware physical imaging model (Focus Gaussian). It achieves high-quality volumetric compression and reconstruction on ultrasound and light-sheet microscopy data, performing \(11\times\) faster than NeRF-based methods and achieving \(16\times\) compression compared to voxel grids.
- GeneVAR: Causal MeanFlow for Autoregressive Gene-to-WSI Tile Synthesis
-
GeneVAR reformulates the synthesis of H&E pathology tiles from RNA-Seq expression profiles into a multi-scale coarse-to-fine autoregressive process. By embedding an RNA-conditioned Causal MeanFlow module within the autoregressive trajectory, it utilizes average velocity fields and counterfactual interventions to disentangle genuine gene-driven morphology from non-biological confounders like staining and contrast. It achieves SOTA performance in FID and downstream classification across five TCGA cohorts.
- GenTract: Generative Global Tractography
-
GenTract reformulates brain white matter tractography from a local search of "step-by-step progression along local directions" into a global conditional generation task that "samples entire streamline coordinates in parallel, conditioned on whole-brain dMRI." By utilizing a VAE to encode fODF and a conditional Transformer (Diffusion / Flow Matching), it achieves SOTA precision on high-quality data and outperforms the second-best method by up to ~3.5x in low-resolution and noisy scenarios.
- GeoSemba: Reconstructing State Space Model for Cross Paradigm Representation in Medical Image Segmentation
-
Addressing Mamba's issues where 2D images are flattened into 1D sequences—causing information to "propagate by scanning order rather than semantic relevance" and "spatial-channel decoupling"—GeoSemba introduces the Semantic-guided State Refiner (SSR) for geometrically conditioned cross-region semantic propagation and the Cross-dimensional Affinity Refiner (CAR) for coarse-to-fine spatial-channel selective enhancement. It refreshes segmentation accuracy across six medical modalities with lower computational overhead.
- GH-NAF: Grid-Adaptive Hash-Level-Attended Neural Attenuation Fields for Discrepancy-Aware CBCT
-
GH-NAF introduces a "spatial-position adaptive hash resolution level selection" attention mechanism to NeRF-style CBCT reconstruction. Combined with differentiable discrepancy-aware rendering and uncertainty-weighted supervision, the model suppresses high-frequency noise in homogeneous tissues while preserving details at structural boundaries, improving both intra-material contrast and edge sharpness in real CBCT.
- GraPHFormer: A Multimodal Graph Persistent Homology Transformer for the Analysis of Neuroscience Morphologies
-
Ours aligns complementary "Graph Structure" and "Topological Persistent Homology" views of neuronal skeletons into a shared embedding space using CLIP-style symmetric InfoNCE. The graph encoder (TreeLSTM) captures local geometry, while the visual encoder (DINOv2 processing 3-channel persistence images) captures global topology. This approach achieves SOTA on 5 out of 6 neuronal morphology benchmarks, with gains up to 4.9 points in self-supervised settings over the previous generation.
- H2-Surv: Hierarchical Hyperbolic Multimodal Representation Learning for Survival Prediction
-
H2-Surv embeds pathological WSI and genomic features into a hyperbolic (Poincaré ball) space. It models the tree-like hierarchy (patient→WSI/pathway→patch/gene) and the biological fact that genomics is more abstract than pathology using hierarchical distance constraints and cross-modal entailment cones. By employing a temporal ordinal contrastive loss to preserve the continuous ranking of survival time, it improves the average C-index from 0.684 to 0.716 across six TCGA/CPTAC datasets.
- Harmonized Feature Conditioning and Frequency-Prompt Personalization for Multi-Rater Medical Segmentation
-
Addressing the issue that "different doctors draw different contours for the same lesion," this paper employs a lightweight Harmonizer network to first "remove" scanning device noise/artifacts from features. It then uses a high-frequency prompt module in the wavelet frequency domain to capture the stylistic preferences of each doctor. Finally, it employs GED regularization to align the model’s predicted distribution with the ground truth annotation distribution, achieving superior population-level diversity and personalized segmentation on LIDC-IDRI and NPC-170 (GED 0.1048 vs. D-Persona 0.1358).
- Hyperbolic Relational Prompts for Intersectional Fairness in Medical VLMs
-
FRP transforms "prompt generation" in medical VLMs from isolated sample processing into dynamic relational reasoning: it employs a sample relational graph to capture fine-grained dependencies and utilizes hyperbolic graph layers to explicitly model the hierarchical structure of intersectional identities (e.g., race × gender). This mitigates "intersectional blindness" while achieving SOTA diagnostic AUC (FairVLMed 77.50%, Harvard-GF 85.94%).
- IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
-
Addressing the issues where existing medical MLLM segmentation relies on joint fine-tuning of implicit
<SEG>tokens and external pixel decoders (prone to catastrophic forgetting, poor cross-domain performance, and restricted to a single forward pass), IBISAgent reformulates segmentation as a multi-step Markov Decision Process (MDP) of "think \(\rightarrow\) click action \(\rightarrow\) call segmentation tool \(\rightarrow\) observe mask." By training Qwen2.5-VL-7B with cold-start SFT and agentic RL (fine-grained rule rewards) without changing the architecture, it enables iterative mask refinement. It significantly outperforms closed-source and open-source SOTA on multiple in-domain and out-of-domain biomedical segmentation benchmarks (In-domain IoU 85.58 vs. 50.74 for the second best). - IEBGL:An Interpretability-Enhanced Brain Graph Learning Framework with LLM-Instructed Topology and Literature-Augmented Semantics
-
IEBGL injects two streams of external knowledge—"LLM reasoning" and "biomedical literature semantics"—into rs-fMRI brain graphs. Specifically, it uses LLMs to reconstruct brain connection topology and literature embeddings to enhance brain region node features. These are processed by a graph-bidirectional Mamba network for depression/autism diagnosis. While improving accuracy, it also aligns abnormal brain regions with relevant literature, providing interpretable diagnostic evidence.
- Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
-
Ours proposes the Instruction-guided Lesion Segmentation (ILS) task for chest X-rays, constructs the first large-scale automatically generated instruction-answering dataset MIMIC-ILS (1.1M samples, 192K images, 91K masks), and trains the ROSALIA model to achieve 71.2% gIoU and 91.8% null-target accuracy, significantly outperforming existing general and medical segmentation models.
- Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment
-
Ours identifies and addresses the degradation of local feature alignment in CLIP during Cross-Domain Few-Shot Learning (CDFSL). The proposed CC-CDFSL framework utilizes a cycle-consistency-based approach with T-I-T and I-T-I bidirectional paths and a Semantic Anchor mechanism to rectify patch-level vision-language alignment while enhancing model interpretability.
- InvCoSS: Inversion-driven Continual Self-supervised Learning in Medical Multi-modal Image Pre-training
-
InvCoSS utilizes "model inversion" to synthesize images from self-supervised models of previous stages, replacing privacy-sensitive real data replay buffers. It performs continual self-supervised pre-training without storing any original data, matching or even exceeding the performance of data replay methods across nine medical downstream tasks while reducing storage overhead by up to \(590\times\).
- IVAAN: Instance-level Vision-Language Alignment via Attribute-Guided Text Prompts Generation for Nuclei Analysis
-
To address class imbalance and organ/stain variations in nuclei "instance-level segmentation + classification" within pathological images, this paper proposes automatically generating attribute-guided pseudo-text prompts from ground truth masks. It performs instance-level vision-language contrastive alignment and models intra-class multi-modality using multiple learnable "category tokens" per class and a semantic interaction module, improving both segmentation and classification without manual text labels.
- KAMP: Knowledge-Anchored Multimodal Pretraining Framework for Medical Image Representation
-
KAMP utilizes LLM-generated "patient-personalized diagnostic knowledge" as a semantic anchor to align medical images with multimodal biomedical signals (pathological, genomic, etc.). Through a three-stage training process (alignment → GRPO-refined generator → retraining alignment), knowledge accuracy is iteratively improved. It outperforms unimodal, bimodal, and trimodal baselines in few-shot classification for brain, bladder, and liver cancers.
- Keep It Frozen: Domain-Routed Conditional Residual Modulation for Multi-Domain Vision Transformers
-
A set of lightweight Residual Modulation Modules (RMB) is attached to a completely frozen ViT backbone. A Domain Router (DR) estimates the soft probability of a sample belonging to "medical/natural" domains in real-time. Subsequently, a Parameter Synthesis Network (PSN) generates low-rank correction parameters on-the-fly based on these probabilities to be injected into Q/V projections and attention biases. Combined with MAML-style bi-level optimization, this enables a single model to adapt to both medical (Ultrasound/CT/MRI) and natural images simultaneously without mutual performance degradation, using only approximately 3.5% trainable parameters.
- KLIP: localized distribution shift detection via KL-divergence with diffusion priors in Inverse Problems
-
In the process of solving inverse problems (e.g., sparse-view CT, Gaussian deblurring) using diffusion priors, the "KL divergence between the prior distribution \(p(x)\) and the posterior distribution \(p(x|y)\)" is utilized as an OOD signal. By restricting this signal to spatial blocks and specific sampling time windows, the method detects and localizes subtle, local, yet diagnostically significant distribution shifts (such as tumors in healthy liver CT scans) without requiring any OOD calibration data.
- Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision
-
Ours proposes MASS (MAsk-guided Self-Supervised learning), which utilizes class-agnostic masks automatically generated by SAM2 as pseudo-labels. By using in-context segmentation as a pretext task for self-supervised pre-training without any manual annotation, it learns semantically rich and highly generalizable 3D medical image representations. It achieves superior performance in few-shot segmentation and frozen encoder classification.
- LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings
-
Ours constructs LEMON, a large-scale endoscopic dataset containing 4194 surgical videos (938 hours), and proposes LemonFM, a self-supervised foundation model based on enhanced knowledge distillation. LemonFM outperforms existing surgical foundation models across four downstream tasks: surgical phase recognition, tool detection, action recognition, and semantic segmentation.
- LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding
-
The general-domain masked diffusion language model LLaDA is introduced to the biomedical image understanding field for the first time via visual instruction tuning, resulting in the first diffusion-based biomedical VLM. It outperforms LLaVA-Med in open-ended medical dialogues, sets new SOTA records on the closed-set subsets of three VQA benchmarks, and enables explicit control over response length for more detailed answer generation.
- LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol
-
The authors propose the LUMINA multi-vendor Full-Field Digital Mammography (FFDM) dataset (468 patients, 1,824 images) along with an energy harmonization preprocessing method based on foreground pixel histogram matching. They systematically evaluate CNN and Transformer models across three tasks: diagnosis, BI-RADS classification, and density estimation.
- MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
-
This work integrates Vision Mamba state-space modeling into a lightweight U-Net with only 0.494M parameters. By utilizing three specialized modules—Adaptive Multi-branch Mamba Fusion (AMF), Local-Global Feature Mixer (LGFM), and Cross-Gated Attention (CGA)—it enhances multi-scale fusion, local texture/global context interaction, and skip-connection refinement. The model achieves an average of 87.12% IoU and 93.09% Dice across four skin lesion segmentation benchmarks (ISIC2017/2018, HAM10000, PH2), outperforming numerous SOTA methods while using 93.6% fewer parameters and 97.6% fewer GFLOPs than U-Net.
- Masked-Diffusion Autoencoders for 3D Medical Vision Representation Learning
-
MDAE applies dual corruptions—spatial masking and diffusion noise—simultaneously to 3D brain MRI volumes. This allows a time-conditioned network to concurrently learn to reconstruct masked regions (capturing holistic anatomical structures) and denoise visible regions (capturing fine-grained tissue textures). It achieves an average AUROC of 73.6% in-domain and 78.6% cross-modality across 16 clinical benchmarks for self-supervised pre-training.
- MDCS-MoAME: Multi-directional Composite Scanning with Mixture of Attention and Mamba Experts for Cancer Survival Prediction
-
Addressing "gigapixel WSI + sparse genomes" for cancer survival prediction, MDCS-MoAME proposes a composite scanning strategy (five directions for images, interval scanning for genes) using Mamba to capture long-range dependencies. It employs a "Mixture of Attention and Mamba Experts" to dynamically select experts for cross-modal fusion based on modality pairs, incorporating alignment constraints to reduce redundancy. This approach achieved an average c-index of 0.7383 across five TCGA datasets, establishing a new SOTA.
- MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
-
Building on frozen CLIP encoders, this work achieves bidirectional image-text interaction and prediction uncertainty modeling through Probabilistic Vision-Language (PVL) adaptation. Combined with a soft patch-level contrastive loss, it balances data efficiency, domain generalization, and interpretability across 16 medical segmentation datasets.
- MedFG-VQA: Low-Frequency Memory and Graph Attention for Lightweight Medical VQA
-
MedFG-VQA equips a 795M small model with two lightweight modules—"DCT Low-Frequency Memory Bank" and "Graph-Enhanced Cross-Modal Attention"—and utilizes 2.06 million synthetic medical VQA records generated by GPT-4o. This approach achieves higher accuracy in medical visual question answering with a footprint significantly smaller than mainstream VLMs.
- MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
-
MedGRPO proposes two key innovations to address the training collapse issue in multi-dataset reinforcement learning for medical videos: cross-dataset reward normalization (mapping median performance across different datasets to the same reward value via a logistic function) and Medical LLM Judge (comparative scoring across five clinical dimensions). Based on Qwen2.5-VL-7B, it outperforms GPT-4.1 and Gemini-2.5-Flash on MedVidBench (532K video-instruction pairs).
- MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration
-
MedKCO is proposed as a knowledge-driven cognitive orchestration strategy for medical vision-language pretraining. By utilizing a hierarchical curriculum (label-level sorting by diagnostic sensitivity and description-level sorting by sample representativeness) along with a self-paced asymmetric contrastive loss, the model learns progressively from simple to complex concepts. It significantly outperforms baselines in zero-shot and downstream tasks across three medical modalities.
- MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding
-
Addressing the sparse reward challenge when applying GRPO directly to medical visual grounding—where "fixed IoU threshold rewards \(\rightarrow\) early all-zero rewards \(\rightarrow\) gradient vanishing"—this paper proposes MedLoc-R1. It utilizes a sliding window performance tracker and multi-condition update criteria to progressively tighten the IoU reward threshold from loose (dense rewards) to strict (fine-grained alignment) based on model capability, stably improving accuracy and training stability across three medical grounding benchmarks without adding auxiliary networks.
- MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images
-
MedMO utilizes Qwen3-VL as its base model and undergoes a four-stage post-training process using 26M+ cross-modal medical data: "General Medical SFT → High-resolution Grounding SFT → Instruction Tuning → GRPO Reinforcement Learning with Bounding Box Rewards." This approach unifies medical image understanding (VQA / QA / Report Generation) and fine-grained spatial localization (Bbox grounding) into an open-source VLM, outperforming existing open-source medical MLLMs across multiple clinical tasks.
- MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis
-
MedTVT-R1 unifies three types of heterogeneous data from the same patient—ECG (time-series), chest X-ray (CXR, image), and lab results (LAB, table)—into a single MLLM. By utilizing a "modality-aware layer + chain-of-evidence instruction data + GRPO reinforcement fine-tuning," it achieves interpretable multi-disease diagnosis, outperforming both general and medical-specific MLLMs in clinical utility (F1, AUC) and long-text diagnosis generation.
- Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning
-
This paper reveals that in cross-domain few-shot fine-tuning of VLMs, enhancing visual discriminability actually harms cross-modal alignment (the "discriminability trap"). It proposes two plug-and-play modules, SVL and RA, to suppress visual learning shortcuts and guide cross-modal alignment, achieving SOTA on 4 CDFSL datasets and 11 FSL datasets.
- MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding
-
To address the issue where existing computational pathology MLLMs compress entire Whole Slide Images (WSI) into a single vector—losing fine-grained spatial semantics—this paper proposes MLLM-HWSI. It decomposes the WSI into visual tokens across four scales: "cell = word, patch = phrase, region = sentence, WSI = paragraph." Using hierarchical contrastive alignment loss and cross-scale consistency loss, it aligns each scale with pathology reports before feeding them into an instruction-tuned LLM, achieving new SOTA results across 6 pathology tasks and 13 WSI-level benchmarks.
- Modeling the Brain's Grammar: ROI-Guided fMRI Pretraining for Transferable and Interpretable Vision Decoding
-
ROITok replaces the basic unit of cross-subject fMRI pretraining from "whole-brain features" to "ROI tokens." By utilizing sparse ROI context fusion to learn functional synergies between brain regions and Matryoshka-style compression to rank tokens by information content, it achieves superior low-level reconstruction fidelity and few-shot transfer capabilities on NSD/GOD. It also provides quantifiable contributions for each brain region, enhancing model interpretability.
- Momentum Memory for Knowledge Distillation in Computational Pathology
-
MoMKD is proposed to replace traditional batch-local feature alignment with a momentum-updated class-conditional memory bank, achieving genomics \(\rightarrow\) pathology cross-modal knowledge distillation. This enables genome-level predictive capability using only H&E slides during inference.
- MR-RAG: Multimodal Relevance-Aware Retrieval-Augmented Generation for Medical Visual Question Answering
-
MR-RAG optimizes both retrieval and generation stages of the medical VQA RAG pipeline: the retrieval stage utilizes a lightweight adapter to fuse image-text, image-image, and text-text similarities for multimodal relevance scoring, while the generation stage injects these scores into the LVLM's attention mechanism to amplify information from highly relevant documents and suppress noise, achieving up to a 6.4% accuracy improvement across three medical datasets.
- MRI Contrast Enhancement Kinetics World Model
-
The authors first propose the MRI Contrast Enhancement Kinetics World Model (MRI CEKWorld), which employs Spatiotemporal Consistency Learning (STCL) to achieve high-fidelity continuous contrast-enhanced sequence generation from non-contrast MRI using sparsely sampled data, resolving the dual challenges of content distortion and temporal discontinuity.
- Multimodal Causality-Driven Representation Learning for Generalizable Medical Image Segmentation
-
To address domain drift caused by differences in equipment, lighting, and imaging modalities in medical images, this paper explicitly models these differences as "confounders." By constructing a confounder dictionary using CLIP text prompts and performing causal intervention through backdoor adjustment, the method improves the cross-domain average mDice by 2.0% over the strongest baseline in endoscopic segmentation.
- MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
-
MMPFN is proposed to extend the pretrained tabular foundation model TabPFN to multimodal (tabular + image/text) scenarios for the first time. It addresses non-tabular embedding over-compression and token count imbalance through a Multi-Head Gated MLP (MGM) and a Cross-Attention Pooler (CAP), outperforming SOTA on medical and general datasets.
- MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification
-
This paper proposes the MUSE framework, which significantly enhances generalization performance in few-shot Whole Slide Image (WSI) classification through MoE-driven Sample-level Fine-grained Semantic Enhancement (SFSE) and LLM knowledge base-based Stochastic Multi-view Model Optimization (SMMO).
- MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality
-
The MUST framework is proposed to explicitly decompose multimodal representations into modality-specific and cross-modal shared components via algebraic constraints. A conditional Latent Diffusion Model (LDM) is employed to generate specific information when modalities are missing. MUST achieves SOTA performance with a 0.742 C-index across five TCGA cancer datasets, with performance drops limited to approximately 0.4%-3.5% in missing modality scenarios.
- MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy
-
Ours proposes MuViT, a multi-resolution Vision Transformer based on world-coordinate RoPE position encoding. It can jointly process crops of the same scene at different physical resolutions in a single encoder, significantly outperforming single-resolution baselines on microscopy image segmentation tasks.
- NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
-
NeuroFlow unifies "image-to-brain" (encoding) and "brain-to-image" (decoding) into a single flow model. It utilizes a variational backbone, NeuroVAE, to compress fMRI into a semantically structured latent space, followed by Cross-modal Flow Matching (XFM) to learn a reversible continuous flow between visual and neural latent distributions. By integrating forward (encoding) and backward (decoding) paths, it achieves SOTA or comparable performance on both tasks with only approximately 25% of the parameters of MindEye2.
- NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization
-
NeurINO proposes to initialize 3D neuron segmentation models by inflating 2D kernels pre-trained by DINOv3 into 3D operators, while introducing a Topology-Aware Skeleton Loss (TASL) to explicitly supervise skeleton-level structural fidelity. It achieves average improvements of 2.9% in ESA, 2.8% in DSA, and 3.8% in PDS across four neuroimaging datasets.
- OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning
-
Instead of relying on new architectures or larger backbones, this work systematically investigates "how to mix training data." By utilizing strong teacher distillation and rejection sampling, the authors filtered 8 million medical samples with structured reasoning chains (6.8 billion tokens). This process fine-tuned a 7B student model (Qwen2.5-VL-7B) into OctoMed, achieving open-source SOTA on multiple out-of-distribution (OOD) medical benchmarks. Notably, the model adaptively adjusts reasoning chain lengths without explicit supervision.
- OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks
-
OmniBrainBench is the first multimodal VQA benchmark covering the complete clinical workflow of brain imaging analysis. It collects 15 imaging modalities from 30 validated data sources and constructs 9,527 radiologist-verified QA pairs (31,706 images). The benchmark is divided into 15 multi-stage tasks across five major clinical phases: "Anatomical Assessment → Lesion Localization → Diagnostic Reasoning → Prognostic Judgment → Treatment Management." Evaluating 24 SOTA models reveals that the strongest model, Gemini-2.5-Pro (66.58%), still lags significantly behind physicians (91.35%).
- OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging
-
OmniFM is proposed as a modality-robust and task-agnostic federated learning framework. Through three complementary components—Global Spectral Knowledge Retrieval (GSKR), Embedding-level Cross-Attention (ECA) fusion, and Prefix-Suffix Spectral Prompting (PSP)—it supports five medical imaging tasks (classification, segmentation, super-resolution, VQA, and multi-modal fusion) under a unified FL pipeline, significantly outperforming existing baselines in cross-modality heterogeneous scenarios.
- OralGPT-Omni: A Versatile Dental Multimodal Large Language Model
-
OralGPT-Omni is the first dental-specific multimodal large language model. By constructing TRACE-CoT data that mimics the diagnostic workflow of radiologists and employing a four-stage progressive training regimen, it achieved a score of 51.84 on the MMOral-Uni unified benchmark (covering five modalities and five tasks), significantly outperforming GPT-5's 15.42.
- OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis
-
OralGPT-Plus transforms dental panoramic X-ray diagnosis from a "single-forward" VLM into an agent capable of autonomously invoking "Zoom-In" and "Mirror-In" tools for iterative review like a dentist. Powered by expert trajectory instruction tuning and review-driven reinforcement learning, it consistently outperforms strong baselines such as GPT-5 on the self-constructed MMOral-X benchmark.
- OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation
-
Ours proposes OraPO (Oracle-educated GRPO), which injects lightweight DPO supervision to transform failed rollouts into preference pairs when GRPO exploration fails. Combined with FactScore rewards, it achieves SOTA on CheXpert Plus and MIMIC-CXR (F1=0.341/0.357) using only 1K samples and a 3B model, reducing training data by 2-3 orders of magnitude compared to Prev. SOTA.
- OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement
-
OSA constrains the temporal memory updates of the left ventricle in echocardiography videos to the Stiefel manifold (Orthogonalized State Update) and incorporates a feature enhancement module that physically decouples anatomical structures from speckle noise. It achieves state-of-the-art segmentation accuracy and temporal stability at real-time speeds on CAMUS and EchoNet-Dynamic.
- PDD: Manifold-Prior Diverse Distillation for Medical Anomaly Detection
-
This paper uses Grad-CAM analysis to reveal that discriminative activation maps, which are effective in industrial anomaly detection, fail in medical images. Consequently, PDD is proposed: features from two heterogeneous frozen teachers—VMamba-Tiny (global context prior) and wide-ResNet50 (local structural prior)—are unified into a single high-dimensional manifold and distilled into two behaviorally complementary students. A diversity loss prevents representation collapse. In HeadCT, BrainMRI, and ZhangLab datasets, the AUROC is higher than the best baselines by 11.8, 8.5, and 2.9 percentage points, respectively.
- Personalized Longitudinal Medical Report Generation via Temporally-Aware Federated Adaptation
-
This paper proposes "Federated Temporal Adaptation" (FTA), a federated learning setting that treats temporal evolution as a first-class citizen. Using the FedTAR framework—featuring demographic-driven personalized LoRA and meta-learned temporal residual aggregation—it models longitudinal changes in patient follow-ups under privacy constraints. It improves linguistic accuracy, temporal coherence, and cross-institutional generalization on J-MID (~1 million examinations) and MIMIC-CXR.
- PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
-
Addressing three major challenges in 3D whole-body PET/CT report generation—extremely small lesions (<0.1% volume), scattered regions of interest, and the lack of mask-text aligned datasets—this paper introduces PETARSeg-11K, the first publicly available lesion-level aligned dataset, and PETAR-4B, a mask-aware 3D vision-language model. By utilizing "mask conditioning + focal prompts" to resolve fine-grained details in small lesions, the model significantly outperforms 2D/3D baselines across all automated metrics. Clinical utility was validated through the first human evaluation study for PET reports involving five nuclear medicine physicians.
- PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation
-
PGR-Net proposes an explicit ROI-aware brain tumor MRI segmentation network. By constructing a data-driven spatial prior template set from the training data, a hierarchical Top-K ROI selection mechanism, and a Windowed Gaussian-Spatial decay guidance module (WinGS-ROI), it concentrates computational resources on lesion areas. It achieves SOTA performance on BraTS-2019/2023 and MSD Task01 with only 8.64M parameters.
- Phrase-grounded APO for Improving Chest X-ray Report Generation
-
This paper proposes "Phrase-grounded Automatic Preference Optimization (APO)": during the inference phase and without any additional ground truth, a fact-checking model + LLM correction converts the radiology report generator's own output into "preferred/rejected" pairs. A new APO loss, combining preference alignment and phrase-grounding terms, is used for lightweight weight updates. This improves report quality by an average of 30–40% across 7 SOTA report generators on multi-institutional chest X-ray datasets.
- PMRNet: Physics-informed Multi-scale Refinement Network for Medical Image Segmentation
-
PMRNet does not rely on parameter stacking. Instead, it embeds three physics priors—symplectic geometry, renormalization group, and heat diffusion—into the network architecture. With only 0.87M parameters and 3.43 GFLOPs, it outperforms SOTA models with 10–100 times more parameters across 12 medical segmentation datasets, while maintaining real-time inference at 152 FPS.
- Post-training Feature Pruning for Fundus Images Classification
-
GFP is a post-training, architecture-agnostic feature pruning framework that freezes the backbone and performs "greedy + minimum retention ratio" subset selection only on the final flattened feature vector. By removing redundant dimensions, it frequently improves AUROC/AUPRC across 5 fundus datasets while cutting 4%–96% of feature dimensions and improving cross-dataset generalization.
- Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement
-
PDMR learns a low-dimensional non-linear manifold of dynamic 3D MRI motion (Deformation Vector Field, DVF) offline. During online inference, it optimizes only a 12-dimensional latent vector using a single instantaneous k-space measurement. This enables real-time reconstruction of high-fidelity 3D images under ultra-sparse sampling for prospective applications such as MR-guided radiotherapy.
- R2-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection
-
R2-Seg is a fully parameter-free training-free framework that utilizes a two-step "Reason-and-Reject" approach. It first employs an LLM for anatomical reasoning to plan Regions of Interest (ROIs), then applies two-sample statistical testing (MMD² + FDR control) to filter candidates generated by a frozen foundation model (BiomedParse) within the ROIs. This suppresses false positives in Out-of-Distribution (OOD) tumor segmentation, simultaneously improving Dice, specificity, and sensitivity across multi-center and multi-modal tumor datasets.
- RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation
-
Constructed RDFace, a standardized benchmark dataset containing 456 pediatric facial images covering 103 rare genetic diseases. This work systematically investigates the effectiveness of phenotype-aware synthetic data augmentation (DreamBooth/FastGAN) for diagnosis under ultra-low sample conditions, demonstrating that DreamBooth augmentation improves diagnostic accuracy by up to 13.7% in extreme low-data scenarios.
- Real2Sim2Real: RetinalDepth-64K for Depth Estimation in Posterior Segment Ophthalmic Surgery
-
To address the lack of ground truth depth data in fundus (posterior segment) microsurgery, the authors developed a Real2Sim2Real pipeline using Blender to construct RetinalDepth, the first synthetic depth dataset for posterior ophthalmic surgery (44,800 stereo pairs, 896 scenes, with pixel-level depth/normals/instrument segmentation/camera parameters). They also proposed the Temporal Depth Variance (TDV) metric to measure inter-frame stability of video depth, demonstrating that fine-tuning on this data significantly improves the generalization of monocular, stereo, and video depth models in real fundus surgical scenarios.
- Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning
-
This paper identifies "Lost Layers" in the CLIP text encoder—a phenomenon where removing certain intermediate layers actually improves performance in Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL). The authors demonstrate that these layers are not redundant but are underutilized due to visual domain shifts. To address this, the VtT model is proposed to reclaim this information at both the layer and encoder levels, achieving state-of-the-art results.
- Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification
-
By adding lightweight task plugins of only ~1M parameters (LoRA adaptation + permutation-invariant slice attention aggregation) to a frozen 2D Foundation Model (FM), a single framework achieves SOTA performance across 12 diverse 3D medical classification tasks (including 1st place in the VLM3D challenge). The study systematically reveals counter-intuitive conclusions, such as "2D methods outperform 3D architectures in 3D classification" and "General FMs can match Medical FMs after proper adaptation."
- SAR2Net: Learning Spatially Anchored Representations for Retrieval-Guided Cross-Stain Alignment
-
SAR2Net reformulates HE↔IHC Whole Slide Image (WSI) cross-stain alignment from "deformation transform estimation" to "region-level feature retrieval." By learning a "spatially anchored representation" for each point that depends only on coordinates and relative geometric encoding of anchors, it achieves robust region correspondence under severe tissue deformation and fragmentation without requiring any global coarse alignment. On a self-collected biopsy dataset, it improves mIoU from 0.691 (strongest baseline) to 0.899.
- SAT-RRG: LLM-Guided Self-Adaptive Training for Radiology Report Generation with Token-Level Push–Pull Optimization
-
SAT-RRG utilizes a frozen LLM as a "judge" to mark semantic errors token-by-token in generated reports. It employs a pair of "push–pull" losses (depressing incorrect words and strengthening correct ones) combined with entropy-confidence self-adaptive weighting. This converts cross-entropy training into a self-correcting process, achieving new SOTA results on both linguistic and clinical metrics for MIMIC-CXR and IU-Xray with zero additional inference overhead.
- SPECTRE: Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers
-
SPECTRE is a pure Transformer-based volumetric CT foundation model. It addresses three core challenges of volumetric CT—"cubic token explosion, geometric anisotropy, and weak/noisy clinical supervision"—through anisotropic 3D tokenization, a two-level (local/global) ViT, and 3D RoPE. Utilizing a two-stage pretraining pipeline of "DINOv3 SSL → SigLIP Vision-Language Alignment" with only public CT data, SPECTRE outperforms existing CT foundation models in biomarker classification, segmentation, and cross-modal retrieval.
- SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation
-
Proposes SD-FSMIS, a framework that adapts pre-trained Stable Diffusion to few-shot medical image segmentation. Through a support-query interaction module and a visual-to-text condition transformer, it achieves efficient adaptation and demonstrates particularly outstanding performance in cross-domain scenarios.
- SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation
-
SegMoTE freezes the entire SAM and embeds a set of learnable "expert tokens" and a Token-level MoE router (MoTE) only within the mask decoder. It dynamically selects experts based on the imaging modality and incorporates a Progressive Prompt Tokenization (PPT) module to achieve interaction-free segmentation. By training only 17M parameters using the MedSeg-HQ dataset (approx. 0.15M masks), which is less than 1% of the size of existing datasets, it achieves SOTA results in multi-modal medical segmentation.
- Semi-supervised Echocardiography Video Segmentation via Anchor Semantic Awareness and Continuous Pseudo-label Reforging
-
EchoForge utilizes a set of learnable anchors to recalibrate noisy ultrasound regions and propagates anatomical semantic prototypes across frames. By employing a "progressive reforging" pseudo-label strategy, it fully exploits unlabeled frames, achieving real-time and precise echocardiography video segmentation under extremely sparse supervision where only ED/ES frames are annotated.
- SemiGDA: Generative Dual-distribution Alignment for Semi-Supervised Medical Image Segmentation
-
SemiGDA shifts semi-supervised medical image segmentation from a "pixel-wise discriminative" to a "generative" paradigm. By utilizing two structurally distinct encoders to model and align the latent space prior distributions of images and masks, and leveraging a frozen Stable Diffusion VAE decoder equipped with lightweight skip adapters to "generate" masks, the method outperforms 11 SOTA semi-supervised approaches across four types of datasets (colonoscopy, dermoscopy, pathology, ultrasound) under 10%/30% label settings (e.g., surpassing the second-best by 10 points in Dice on BUSI with 10% labels).
- SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation
-
SHAPE reframes Unsupervised Domain Adaptation (UDA) for cross-modal medical segmentation from "local pixel correctness" to "global anatomical plausibility." By performing class-aware Hierarchical Feature Modulation (HFM) on a frozen DINOv3 to generate high-fidelity features, evaluating pseudo-labels at both anatomical shape and layout levels via Hypergraph Plausibility Evaluation (HPE), and removing hallucinated categories through Structural Anomaly Pruning (SAP), the method uses only high-quality pseudo-labels that pass plausibility checks for self-training. It sets a new SOTA on cardiac and abdominal cross-modal benchmarks.
- SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking
-
The authors construct a "biomechanics-aware" simulation annotation pipeline by concatenating off-the-shelf 2D spine detectors, multi-view geometric triangulation, and OpenSim musculoskeletal inverse kinematics. This pipeline automatically supplements the Human3.6M dataset with 15 anatomically consistent vertebral-level 3D keypoints and per-vertebra rotations, creating SIMSPINE—the first open 3D spine motion dataset (2.14 million frames). With accompanying 2D/3D baselines, the framework improves the AUC for indoor spine tracking from 0.63 to 0.80.
- Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation
-
Sketch2CT enables users to use a single 2D sketch and a text description to first generate anatomically consistent 3D segmentation masks through dual-modal fusion, and then synthesize corresponding 3D CT volumes using segmentation-conditioned latent diffusion, achieving low-cost, controllable, and structure-preserved medical volume data augmentation.
- Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors
-
The InvTag framework is proposed, which for the first time combines an MR physics forward model with a pre-trained diffusion generative prior to unifiedly solve three sub-tasks of 3D Tagged MRI: anatomical recovery, Cine synthesis, and motion estimation, without requiring any additional training data.
- STEPH: Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in WSI Prognosis
-
STEPH proposes a model merging scheme based on Task Vector Mixup (TVM) combined with hypernetwork-driven sparse aggregation. It efficiently integrates predictive knowledge from multiple cancer-specific models into a target cancer model. On 13 TCGA datasets, it achieves an average C-Index of 0.6949 (+5.14% vs. cancer-specific learning, +2.01% vs. ROUPKT). During inference, it requires only a single model forward pass, which is significantly more efficient than multi-model representation transfer schemes.
- SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation
-
The SPEGC framework is proposed to refine the raw similarity matrix into high-order structural representations using semantic-prompt-enhanced features and a differentiable graph clustering solver. This guides the adaptation of medical image segmentation models on continuously changing target domains, effectively alleviating error accumulation and catastrophic forgetting.
- Splat-Based Metal Artifact Reduction in Cone-Beam CT via Compact Attenuation Modeling
-
By compressing "energy-dependent material attenuation" into a scalar parameter per Gaussian (interpolating MAC along a Bézier curve) and embedding a differentiable polychromatic Beer–Lambert forward projection into Gaussian Splatting, this work jointly optimizes geometry and material without requiring metal masks. It achieves CBCT metal artifact reduction an order of magnitude faster than neural field methods like Polyner while maintaining higher structural fidelity.
- SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
-
This paper constructs SurgCoT, the first cross-specialty surgical video spatiotemporal reasoning benchmark (covering 7 surgical specialties, 35 procedures, 2,841 videos, 19,345 main questions + 59,177 sub-questions). By employing a "three-stage progressive reasoning + five-tuple annotation protocol (Question→Option→Knowledge→Clue→Answer)," surgical CoT reasoning is decomposed into a "Video-level → Clip-level → Frame-level" hierarchical chain. Evaluations of over ten mainstream MLLMs reveal significant gaps in fine-grained spatiotemporal reasoning, while the structured protocol consistently improves progressive reasoning accuracy.
- Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos
-
The authors construct SurgBlood, the first annotated laparoscopic surgery dataset for both bleeding regions and bleeding points. They propose BlooDet, a SAM2-based dual-branch bidirectional guided online detector, achieving joint optimization of bleeding region segmentation and bleeding point localization through synergistic Mask/Point branches.
- TAlignDiff: Automatic Tooth Alignment assisted by Diffusion-based Transformation Learning
-
TAlignDiff utilizes a Point Cloud Regression Network (PRN) to directly predict a \(4\times4\) transformation matrix for each tooth from preoperative point clouds. Simultaneously, a lightweight Diffusion Transformation Model Denoising (DTMD) module learns the latent distribution of "clinically valid transformation matrices." By employing a contrastive denoising loss to pull the regression output toward this distribution, the model constrains the statistical properties of the transformation matrices beyond mere geometric alignment, achieving TRE/AAE errors superior to existing methods.
- TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis
-
TAMER treats Electrocardiogram (ECG) waveforms, STFT spectrograms, and clinical diagnostic reports as three complementary modalities for self-supervised pre-training. Through "time-frequency" global/local alignment and "report-anchored" diagnostic-level and waveform-level refinement, it achieves State-of-the-Art (SOTA) performance in zero-shot classification (81.2% average AUC) and cross-domain transfer (83.1%) across three public datasets.
- TANGO: Learning Distribution-wise Foundation Prior Consistency and Instance-wise Style Calibration for Medical Image Generalization
-
TanGo distills the low-frequency generalization priors of Visual Foundation Models (SAM/DINOv2) into a lightweight source model during the training phase and employs learnable per-sample "decorators" to pull drifted test images back to the augmented source distribution during the testing phase, achieving SOTA in Continual Test-time Adaptation (CTTA) for medical image segmentation.
- Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model
-
Tell2Adapt is a unified framework that leverages the generalized knowledge of a Vision Foundation Model (BiomedParse) to achieve source-free unsupervised domain adaptation for medical image segmentation across 10 domain transfer directions and 22 anatomical targets. It generates high-quality pseudo labels through Context-Aware Prompt Regularization (CAPR) and removes anatomically implausible predictions via Visual Plausibility Refinement (VPR).
- Temporal Inversion for Learning Interval Change in Chest X-Rays
-
TILA utilizes "swapping the order of paired chest X-rays (temporal inversion)" as a supervisory signal. By incorporating inversion-aware objectives during pre-training, fine-tuning, and inference, it enables existing temporal vision-language models to genuinely distinguish whether lesions are "improving or worsening," rather than merely identifying their presence.
- The Invisible Gorilla Effect in Out-of-distribution Detection
-
Reveals a previously unreported bias in OOD detection—the "Invisible Gorilla Effect": detection performance is significantly better when OOD artifacts are visually similar to the model's region of interest (ROI) and drops drastically when they are dissimilar, particularly affecting feature-based OOD methods.
- TIM: Temporal Decoupling with Iterative Mutual-Refinement Model for Longitudinal Radiology Report Generation
-
TIM decomposes longitudinal radiology report generation into two decoupled branches: "Static Pathology Recognition" and "Dynamic Progression Modeling." It further employs an iterative refinement stage where previous and current reports perform mutual error correction, setting a new SOTA for both linguistic and clinical metrics on the Longitudinal-MIMIC dataset.
- TopoCL: Topological Contrastive Learning for Medical Imaging
-
TopoCL introduces "topology" to standard contrastive learning (CL). It designs controllable topology-aware augmentations using relative bottleneck distance, encodes persistent homology diagrams into features via a hierarchical topological encoder, and adaptively fuses visual and topological representations using a Mixture-of-Experts (MoE) module. It serves as a plug-and-play enhancement for SimCLR/MoCo-v3/BYOL/DINO/Barlow Twins, achieving an average linear probing accuracy gain of 3.26% across five medical datasets.
- TopoSlide: Topologically-Informed Histopathology Whole Slide Image Representation Learning
-
TopoSlide incorporates the diagnostic logic of pathologists—"observing local tissues first, then analyzing global spatial arrangement"—into a self-supervised objective. It first clusters millions of patches into histological clusters, then encodes the spatial arrangement of each cluster into topological descriptors using persistent homology. Finally, a ViT is trained to infer these topologies from slide-level representations under a conditional multi-task objective. Trained on only a few hundred slides, TopoSlide outperforms foundation models trained on hundreds of thousands of slides by up to ~15% in Macro F1 for histological pattern retrieval.
- TRCoRSurg: Temporal-Relational Co-Reasoning for Surgical Video Triplet Recognition
-
TRCoRSurg decomposes surgical video
<instrument, verb, target>triplet recognition into two streams: "intra-frame label dependency" and "inter-frame temporal semantics." It constructs a label graph using GCN (where nodes fuse semantic priors with CAM visual evidence, and edges are adaptively learned via MS-CAMRE) and utilizes a Bidirectional Temporal-Relational Fusion Attention (BTRFA) for mutual correction. It achieves APIVT improvements of 5.1%/7.8% on CholecT45/ProstaTD respectively and introduces the TCER metric to specifically measure triplet compositional consistency. - Turning Pre-Trained Vision Transformers into End-to-End Histopathology Whole Slide Image Models for Survival Prediction
-
The authors discovered that the cross-patch interaction priors learned by pre-trained ViTs on pathology images can extrapolate to longer token sequences. Consequently, they proposed E2E-ViT: by only modifying the input arrangement, adding a parameter-free patch merger, and replacing absolute position encodings with ALiBi, without adding any learnable parameters, a tile-level ViT is directly transformed into an end-to-end WSI model. It outperforms both two-stage MIL and Slide Foundation Models (SFMs) simultaneously across five survival prediction tasks.
- Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding
-
The core contribution of this paper is not merely creating an "Ultrasound version of CLIP," but redefining the image-text alignment objective around the unique anatomical hierarchy and diagnostic attributes of ultrasound. The authors first construct the Ultrasonographic Diagnostic Taxonomy (UDT) and the large-scale US-365K dataset, then explicitly inject clinical relationships from the text into contrastive learning using semantic soft labels and attribute heterogeneous graphs to obtain more "ultrasound-literate" vision-language representations.
- Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities
-
UniME addresses brain tumor segmentation with missing modalities using a two-stage heterogeneous "representation before fusion" design. Phase 1 employs a single ViT Uni-Encoder for masked self-supervised pre-training to learn robust unified representations against missing data. Phase 2 integrates multiple modality-specific CNN Multi-Encoders in parallel to recover high-resolution details. On BraTS 2023/2024, the average DSC across various missing modality combinations outperforms the Prev. SOTA (with an ET gain of 2.4%~2.9%).
- Uni-Hema: Unified Model for Digital Hematopathology
-
Uni-Hema utilizes a unified architecture (comprising a CNN+Transformer visual branch, a T5 text branch, and a multimodal fusion module named Hema-Former) trained in a single pass. It simultaneously performs six categories of tasks in hematopathology—detection, classification, segmentation, morphological prediction, visual question answering (VQA), and masked language modeling (MLM)—across six diseases including leukemia, malaria, anemia, and sickle cell disease. Its performance is comparable to or better than task-specific SOTA models trained on individual datasets.
- Universal-to-Specific: Dynamic Knowledge-Guided Multiple Instance Learning for Few-Shot Whole Slide Image Classification
-
DyKo replaces the "static universal text descriptions" used in pathological Vision-Language Models (VLMs) with "dynamically instantiated knowledge for each slide." By first clustering slide-specific visual prototypes and then using these prototypes to retrieve and synthesize knowledge features for each patch from a concept bank, DyKo anchors the synthesized knowledge back to visual evidence using a structural consistency loss. It consistently outperforms existing MIL and prompt-based methods in 4/8/16-shot settings across four real-world cancer datasets.
- Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework
-
This paper transforms the prompt parameters of each instrument class from "isolated independent prompts" into a "tree structure where shared knowledge is decomposed layer by layer." This enables new instruments to inherit existing knowledge for rapid learning while allowing new knowledge to conversely and gently refine old knowledge. Consequently, it simultaneously enhances performance across new, common, and old categories in incremental surgical instrument segmentation.
- URICA: A Uniformity Region Affine Identifier Capture Algorithm for Arbitrary Region Retrieval in Pathology Images
-
URICA redefines region retrieval in Whole Slide Images (WSI) as a "semantic optimal matching problem under arbitrary spatial transformations." By using semantic tessellation to organize patch features from foundation models into geometrically aware region descriptors and applying rotation/scale-invariant "affine identifiers" for consistency matching, it achieves a 98.38% slide-level retrieval accuracy on 24,811 TCGA WSIs and supports retrieval of regions with arbitrary orientations and sizes for the first time.
- VesMamba: 3D Pulmonary Vessel Segmentation from CT images via Mamba with Structural Perception and Scale-aware Filtering
-
VesMamba adapts Mamba into a segmentation backbone capable of perceiving 3D vascular spatial anisotropy. By utilizing dynamic directional convolutions to compensate for Mamba's lack of spatial awareness, bidirectional scale filtering to suppress noise across encoder layers, and high-level mask constraints for low-level decoders, it outperforms various CNN/Transformer/Mamba SOTAs on Parse22 and an in-house Lung79 dataset with approximately 1/4 of the computational cost.
- Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantized Code
-
This paper proposes CodeBrain, which reformulates the any-to-any brain MRI modality completion problem as a region-level full-stack quantized code prediction task. Through a two-stage pipeline (scalar quantization reconstruction + grading loss code prediction), it achieves unified synthesis of missing modalities, surpassing five SOTA methods.
- Virtual Immunohistochemistry Staining with Dual-Aligned Multi-Task Feature Guidance
-
When translating H&E pathology images to virtual immunohistochemistry (IHC), paired images naturally suffer from spatial misalignment, and supervision from single auxiliary tasks is often too weak. This paper extracts multi-task features using a set of auxiliary task models, performing spatial alignment followed by task-gap alignment (dual alignment). These semantic features provide feature-level guidance to the virtual staining generator, consistently outperforms 7 SOTA methods on FID/KID/LPIPS across BCI and MIST datasets.
- Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities
-
This work treats each MRI modality as a graph node and assigns a set of zero-initialized learnable "virtual nodes" to each. A graph attention network with an adjacency matrix that dynamically rewrites based on available modalities is used for fusion. This single-stage training framework robustly handles brain tumor segmentation under arbitrary modality omissions, outperforming SOTA on almost all missing subsets of BraTS-2018/2020.
- VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
-
VoxTell is a 3D vision-language segmentation model that generates volumetric masks directly from a single sentence (ranging from single words to full clinical reports). By repeatedly injecting text guidance at every level of the UNet decoder (multi-stage fusion) combined with deep supervision, it achieves a zero-shot average Dice of 70.85 across 11 unseen datasets, significantly outperforming the previous state-of-the-art text-promptable method, SAT (51.23).
- X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
-
X-PCR decomposes ophthalmic diagnosis into six causally dependent reasoning stages: "Image Quality Assessment → Anatomical Localization → Lesion Characterization → Disease Diagnosis → Severity Grading → Clinical Decision-making." It performs semantic alignment across 6 ophthalmic imaging modalities, constructing a benchmark with 26,415 images and 177,868 expert-verified VQA pairs. Evaluation of 21 MLLMs shows they significantly lag behind specialists in chain-of-reasoning (the strongest, GPT-5, achieves a full-chain completion rate of only 24.47%) and cross-modality integration.
- X-WIN: Building Chest Radiograph World Model via Predictive Sensing
-
The X-WIN chest radiograph world model is proposed, integrating 3D CT spatial knowledge into CXR representation learning for the first time. By learning to predict 2D projections of CT scans at various rotation angles, the model internalizes 3D anatomical structures. Combined with affinity-guided contrastive alignment and structure-preserving domain adaptation, it achieves SOTA results via linear probing across 6 CXR benchmarks.