🏥 Medical Imaging¶
🔬 ICLR2026 · 88 paper notes
📌 Same area in other venues: 📷 CVPR2026 (163) · 🧪 ICML2026 (28) · 🤖 AAAI2026 (75) · 🧠 NeurIPS2025 (74) · 📹 ICCV2025 (31)
🔥 Top topics: Medical Imaging ×33 · Segmentation ×14 · Multimodal/VLM ×9 · Diffusion Models ×8 · Face & Gaze ×3
- A Brain Graph Foundation Model: Pre-Training and Prompt-Tuning across Broad Atlases and Disorders
-
BrainGFM models fMRI brain networks as graphs and employs "Graph Contrastive Learning + Graph Masked Autoencoding" for large-scale pre-training on 400,000 brain graphs across 27 datasets and 8 brain atlases. By using meta-learning optimized graph prompts for few-shot adaptation and BioClinicalBERT-encoded language prompts for zero-shot transfer, a frozen foundation model can perform direct diagnosis across diverse atlases, brain disorders, and task settings.
- A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding
-
VCFLOW incorporates the "ventral-dorsal dual-stream" mechanism of the human visual cortex into a decoding model. It decomposes fMRI signals into early visual, ventral, and dorsal streams, aligning them with different hierarchical CLIP features. By using a redistribution adapter to decouple "subject-agnostic semantics" from "subject identity," it achieves fMRI-to-video reconstruction without retraining on new subjects for the first time. Compared to subject-specific training, it loses only about 7% accuracy while reducing single-video generation from 12 hours of training to 10 seconds of inference.
- A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration
-
This paper proposes FFDP—a suite of IO-aware non-GEMM fused CUDA kernels combined with a distributed framework supporting convolution-aware tensor sharding. It accelerates traditional/deep image registration pipelines by 6–7×, reduces peak memory by 20–59%, and performs the first native-resolution multimodal registration of 100µm ex-vivo human brain MRI (over 11 billion transformation parameters, 570× larger than clinical data) on 8 A6000 GPUs in approximately one minute.
- A Structured, Tagged, and Localized Visual Question Answering Dataset with Full Sentence Answers and Scene Graphs for Chest X-ray Images
-
This paper automatically constructs CXR-QBA from MIMIC-CXR radiology reports—a large-scale chest X-ray VQA dataset featuring 42.2 million QA pairs. Each answer includes full sentences, bounding boxes, and structured labels (findings, regions, certainty, etc.). Produced via a three-stage pipeline ("Scene Graph Construction → Templated QA Generation → LLM-based Quality Assurance"), the dataset provides two subsets—a 31.2 million pre-training level and a 7.5 million fine-tuning level—along with a baseline model and evaluation metrics.
- AbdCTBench: Learning Clinical Biomarker Representations from Abdominal Surface Geometry
-
The authors extracted 2D body surface mesh images from 23,506 abdominal CT scans of 18,719 patients, paired them with 16 CT biomarkers and hundreds of disease/comorbidity labels to construct AbdCTBench—the first and largest "surface geometry \(\rightarrow\) internal body composition" dataset. Systematically evaluating 7 mainstream vision architectures, they demonstrated that external abdominal geometry alone can predict clinical indicators such as age (MAE 6.22 years), mortality (AUROC 0.839), and diabetes with chronic complications (AUROC 0.801), paving the way for radiation-free, low-cost consumer-grade health screening.
- Accelerating Benchmarking of Functional Connectivity Modeling via Structure-aware Core-set Selection
-
To make the expensive task of "comparing hundreds of functional connectivity (FC) modeling operators on large-scale fMRI data" affordable, this paper reformulates benchmarking as a "rank-preserving subset selection" problem. It proposes a self-supervised framework, SCLCS, which learns the connectivity structure of each sample using an adaptive Transformer, identifies stable "prototype" samples using the Structure Perturbation Score (SPS), and supplements diversity via density-equalized sampling. Using only 10% of the data, it maintains the true ranking of 130 FC operators from the full set, achieving a ranking consistency (nDCG@k) up to 23.2% higher than previous state-of-the-art core-set methods.
- Adaptive Domain Shift in Diffusion Models for Cross-Modality Image Translation
-
The CDTSDE framework is proposed, which embeds a learnable spatially adaptive domain mixing field \(\Lambda_t\) into the reverse SDE of diffusion models. This allows the cross-modality translation path to proceed along a low-energy manifold, achieving higher fidelity with fewer denoising steps in MRI modality conversion, SAR-to-optical, and industrial defect semantic mapping tasks.
- Anatomy-aware Representation Learning for Medical Ultrasound
-
Addressing the three main characteristics of medical ultrasound (US)—heavy speckle texture, singular grayscale color, and organ-specific features—this paper constructs a large-scale ultrasound dataset of 5.2 million images. It proposes an anatomy-aware A-ViT (centered on "Anatomy-Conditional Deformable Transformer", ACDT) coupled with a triple self-supervised objective of "masked reconstruction + adversarial + self-distillation." The method significantly outperforms general-purpose and medical SSL baselines across multiple US diagnostic tasks, including breast, thyroid, gallbladder, COVID-19 lung, and cardiac imaging.
- Are EEG Foundation Models Worth It? Comparative Evaluation with Traditional Decoders in Diverse BCI Tasks
-
The authors conduct a systematic comparative evaluation involving 5 mainstream EEG foundation models across 7 classification and 2 regression tasks using six evaluation protocols with statistical testing. They propose ST-EEGFormer, a simple ViT baseline pre-trained on 8 million raw EEG segments via Masked Autoencoding (MAE). Findings indicate that foundation models hold a significant advantage only in data-abundant population-level decoding; in data-scarce per-subject scenarios, they often fail to outperform compact CNNs or even traditional non-neural decoders. Linear probing is generally weak, and no clear scaling laws were observed.
- ASMIL: Attention-Stabilized Multiple Instance Learning for Whole-Slide Imaging
-
This paper identifies for the first time a failure mode called "Attention Dynamic Instability" in Attention-based MIL for Whole-Slide Imaging (WSI). It proposes ASMIL: a unified framework that stabilizes attention using an EMA-updated anchor model distillation, suppresses attention over-concentration with a normalized sigmoid, and mitigates overfitting via token random dropout, achieving up to 6.49% F1 improvement across multiple pathological datasets.
- Autoregressive Visual Decoding from EEG Signals
-
AVDE reformulates "decoding EEG signals into images" into a two-stage, autoregressive lightweight pipeline: first, it aligns EEG to the CLIP image space using the pre-trained EEG foundation model LaBraM combined with contrastive learning; then, it uses "Next-Scale Prediction" from the VAR framework to generate images progressively from EEG embeddings. With only 10% of the parameters, it outperforms previous SOTA models that rely on large diffusion models in both retrieval and reconstruction tasks.
- BioTamperNet: Affinity-Guided State-Space Model Detecting Tampered Biomedical Images
-
BioTamperNet constructs a Siamese network using "Affinity-Guided Attention" approximated by State-Space Models (SSM) to jointly localize tampered duplicate regions (source and copied target regions) in biomedical papers. It improves the MCC from the previous best of approximately 0.43 to 0.70 on the BioFors real retracted paper dataset, utilizing only 36.7M parameters and 29.6 GFLOPs.
- BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals
-
Instead of training a full student model for distillation, this work freezes two biosignal foundation models and trains only a lightweight "bridge network" to project intermediate representations of a new modality into the space of an old modality. This achieves unsupervised cross-modal knowledge transfer across ECG↔EEG↔PPG↔EMG with only 1%–12% of trainable parameters.
- Boosting Medical Visual Understanding From Multi-Granular Language Learning
-
This paper proposes Multi-Granular Language Learning (MGLL), a plug-and-play contrastive learning framework. By jointly optimizing soft CLIP loss, point-wise loss, and smooth KL divergence, MGLL aligns medical images with multi-label, multi-granular textual descriptions. It consistently outperforms SOTA methods on fundus and X-ray datasets and can be embedded into multimodal large language models (MLLMs) as a vision encoder, improving diagnostic accuracy by up to 34.1%.
- Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer
-
The Brain-IT framework is proposed, which utilizes a brain-inspired Brain Interaction Transformer (BIT) to cluster functionally similar brain voxels into cross-subject shared Brain Tokens. It predicts localized semantic and structural image features to achieve high-fidelity reconstruction from fMRI, matching the performance of prior methods using 40 hours of data with only 1 hour of data.
- Bridging Radiology and Pathology Foundation Models via Concept-Based Multimodal Co-Adaptation
-
This paper proposes the CTF (Concept Tuning and Fusing) framework, which utilizes a set of clinical concepts as a "shared semantic interface" between radiology and pathology foundation models. It enables cross-domain co-adaptation of concept representations before fusion by conditioning them on each other. By training only 0.15% additional parameters, it surpasses various latent space fusion baselines in survival analysis and cancer grading while providing interpretable predictions.
- CardioComposer: Leveraging Differentiable Geometry for Compositional Control of Anatomical Diffusion Models
-
CardioComposer formulates "size, position, and shape" as differentiable losses based on voxel-based geometric moments. It applies energy guidance (gradient correction) during the sampling process of an unconditional 3D anatomical diffusion model to achieve decoupled and compositional geometric control of various anatomical sub-structures (e.g., in the heart) without retraining.
- CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework
-
The CARE framework is proposed to decompose Medical VQA into a three-stage expert pipeline: "entity proposal → referring segmentation → evidence-grounded QA." By fine-tuning VLMs with RLVR and utilizing GPT-5 as a dynamic coordinator for tool planning and CoT review, CARE achieves a 77.54% average accuracy with 10B parameters, outperforming 32B end-to-end SOTA models (72.29%) across four medical VQA benchmarks.
- CerebraGloss: Instruction-Tuning a Large Vision-Language Model for Fine-Grained Clinical EEG Interpretation
-
This paper treats clinical electroencephalogram (EEG) waveforms as a "specialized visual language." By utilizing an automated data engine (including a custom YOLO waveform detector) to synthesize 94,000 EEG image-text instruction pairs, the authors perform two-stage instruction tuning on Qwen2.5-VL-3B. This results in CerebraGloss, the first generative EEG interpretation model capable of "description + multiple-choice questions + multi-turn dialogue." It outperforms GPT-5 on the self-developed open-ended benchmark CerebraGloss-Bench and achieves new SOTA results in seizure detection on TUSZ.
- Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space
-
The authors propose modeling human concept production as cumulative trajectories in Transformer embedding spaces, defining five kinematic metrics (distance, velocity, acceleration, entropy, and distance to centroid). This framework successfully distinguishes clinical groups and concept categories across four datasets (three languages; covering neurodegenerative diseases, swear word fluency, and property listing), with results showing high consistency across different embedding models.
- CodeBrain: Bridging Decoupled Tokenizer and Multi-Scale Architecture for EEG Foundation Model
-
CodeBrain develops an EEG foundation model using a "time-frequency dual-codebook decoupled tokenizer + multi-scale architecture with parallel global structural convolutional SSM and sliding window attention." After pre-training on the largest public EEG corpus, it consistently outperforms existing EEG foundation models across 10 datasets in 8 task categories and provides codebook-level interpretability.
- COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics
-
COMPASS constructs conformal prediction intervals by applying linear perturbations in the intermediate feature space of segmentation networks along low-dimensional subspaces most sensitive to the target metric. It achieves significantly narrower prediction intervals than traditional CP methods across four medical segmentation tasks while maintaining valid coverage.
- CONSIGN: Conformal Segmentation Informed by Spatial Groupings via Decomposition
-
CONSIGN utilizes SVD to extract spatially correlated "principal uncertainty directions" from multiple samples of segmentation models, constructing a space-aware conformal prediction set that can vary jointly. This reduces the prediction set volume by several orders of magnitude compared to pixel-wise methods while maintaining statistical coverage guarantees.
- Contextual Similarity Distillation: Ensemble Uncertainties with a Single Model
-
Estimates the predictive variance of an "infinite randomly initialized ensemble" using a single model and a single forward pass. By reformulating the ensemble variance as a supervised regression problem with kernel similarity labels, the method avoids training actual ensembles or inverting the Gram matrix. It provides uncertainty estimates comparable to or better than Deep Ensembles, validated on OOD detection and sparse reward RL exploration.
- CortiLife: A Unified Framework for Cortical Representation Learning across the Lifespan
-
CortiLife introduces CLIP-style vision-language pre-training to non-Euclidean cortical surfaces for the first time. By combining "icosahedral patching + tri-stream multi-level encoding + attention-guided masked self-distillation + metadata language prompting," it constructs a unified cortical representation spanning from infancy to old age, outperforming SOTA models like CLIP, ACLIP, and DetailCLIP in age prediction, cortical parcellation, and four types of brain disease diagnosis.
- CRONOS: Continuous time reconstruction for 4D medical longitudinal series
-
CRONOS reframes Flow Matching (FM) as a "sequence-to-image" transport problem. By utilizing a shared spatiotemporal velocity field to simultaneously transport multiple historical 3D scans toward a target volume, it supports both discrete grid-aligned and continuous real-valued timestamp voxel-level 4D medical image prediction within a single model. It outperforms existing spatiotemporal baselines and the strong LCI heuristic across three datasets: Cine-MRI, perfusion CT, and longitudinal glioma MRI.
- Cross-Timestep: 3D Diffusion Model with Trans-temporal Memory LSTM and Adaptive Priori Decoding Strategy for Medical Segmentation
-
To address the two major issues of "initial-stage collapse" at high-noise starting points and isolated timesteps when applying diffusion models to 3D medical segmentation, this paper proposes Cross-Timestep. It utilizes an "Adaptive Priori Decoding Strategy (APDS)" to inject time-decaying structural priors from conditional images to stabilize the initial stages of reverse diffusion, and a "Trans-temporal Memory LSTM (tLSTM)" to explicitly pass low-frequency structures and uncertainty saliency across timesteps. It comprehensively outperforms existing SOTA on two multi-center nasopharyngeal carcinoma datasets.
- CUPID: A Plug-in Framework for Joint Aleatoric and Epistemic Uncertainty Estimation with a Single Model
-
CUPID is a plug-and-play module that can be inserted into any intermediate layer of a pre-trained network without structural changes or retraining. It jointly estimates both aleatoric (data noise) and epistemic (model ignorance) uncertainties in a single forward pass.
- Detecting Invariant Manifolds in ReLU-Based RNNs
-
This paper proposes a semi-analytical algorithm for ReLU-based Piecewise Linear RNNs (PLRNNs) that directly computes the stable and unstable manifolds of saddle points or saddle periodic points. This allows for mapping the boundaries of different attractor basins in state space, identifying homoclinic/heteroclinic intersections, and proving the existence of chaos within the RNN—filling a long-standing gap in how to analyze the dynamical topology of discrete-time ReLU RNNs with discontinuous Jacobians.
- DISCO: Densely-overlapping Cell Instance Segmentation via Adjacency-aware Collaborative Coloring
-
Ours models densely-overlapping cell instance segmentation as a graph coloring problem and proposes a divide-and-conquer framework, Disco, featuring "explicit marking of conflict nodes + implicit adjacency constraint disambiguation." By decomposing the cell adjacency graph via BFS and introducing five collaborative loss functions, Ours achieves a 7.08% PQ improvement on the high-density pathological dataset GBC-FS 2025 and attains SOTA results across four heterogeneous datasets.
- Distributional Consistency Loss: Beyond Pointwise Data Terms in Inverse Problems
-
The authors propose the Distributional Consistency (DC) loss, which replaces traditional pointwise data fidelity terms (e.g., MSE/NLL) with distribution-level calibration. This approach avoids overfitting to noise, significantly improving performance in DIP denoising and PET image reconstruction without requiring early stopping.
- DM4CT: Benchmarking Diffusion Models for Computed Tomography Reconstruction
-
Proposes DM4CT—the first systematic benchmark for diffusion models in CT reconstruction, covering ten diffusion methods and seven baseline approaches evaluated across medical, industrial, and synchrotron datasets, revealing the strengths and limitations of diffusion models in CT.
- Dual-Kernel Adapter: Expanding Spatial Horizons for Data-Constrained Medical Image Analysis
-
The authors first systematically demonstrate that in data-scarce medical imaging scenarios, standard Adapters are not only ineffective but can even perform worse than pure linear probing. The root cause is identified as the sharp contraction of the Effective Receptive Field (ERF) of the Adapter when training data is limited. Based on this, the Dual-Kernel Adapter (DKA) is proposed, which utilizes a parallel fusion of a large-kernel (\(51 \times 51\)) depthwise convolution to expand the ERF and a small-kernel (\(5 \times 5\)) depthwise convolution to preserve local details. DKA achieves new SOTA results across classification and segmentation tasks using both natural-image and medical-image pre-trained backbones.
- Fetal-Gauge: A Benchmark for Assessing Vision-Language Models in Fetal Ultrasound
-
Fetal-Gauge integrates 13 public fetal ultrasound datasets to construct the first and largest vision-language question-answering benchmark for fetal ultrasound (42,000 images, 93,000 QA pairs, covering five clinical tasks). Systematic evaluation of 15 mainstream VLMs reveals that the strongest model, GPT-5, achieves only 55% accuracy, which is far below the threshold for clinical utility, exposing systematic shortcomings of current VLMs in the fetal ultrasound domain.
- Frequency-Balanced Retinal Representation Learning with Mutual Information Regularization
-
The authors analyze Masked Autoencoders (MAE) from a spatial frequency perspective, discovering a preference for low-frequency backgrounds and an under-encoding of diagnostically critical high-frequency details. They propose RetMAE: a framework that, without modifying the architecture, introduces a High-Frequency Mutual Information (HighFreqMI) regularization. This allows the retinal encoder to learn "frequency-balanced" representations, surpassing existing fundus foundation models using only ~25.6k unlabeled fundus images.
- Functional MRI Time Series Generation via Wavelet-Based Image Transform and Spectral Flow Matching for Brain Disorder Identification
-
DSFM converts fMRI BOLD time series into multi-scale time-frequency scalogram images via Discrete Wavelet Transform (DWT), compresses them into a low-frequency sparse domain via block DCT, and performs class-conditional generation using a "heat-diffusion-style" flow matching in the DCT domain. The synthesized signals are then transformed back to the time domain for data augmentation, enhancing downstream brain functional connectivity (FC) classification performance.
- Glance and Focus Reinforcement for Pan-cancer Screening
-
This paper proposes GF-Screen, a two-stage framework: a lightweight Glance model uses reinforcement learning to rapidly locate CT sub-volumes containing lesions, while the Focus model performs detailed segmentation only on the selected regions. By migrating the "intra-group relative comparison" concept of GRPO from NLP to visual sub-volume groups, this work achieves RL optimization without a value network for the first time in pure vision tasks. It significantly leads the champion solution in the FLARE25 pan-cancer challenge with a \(+25.6\%\) DSC and is \(5.7\times\) faster in inference.
- HEEGNet: Hyperbolic Embeddings for EEG
-
This paper provides the first systematic validation of the hyperbolic nature (hierarchical structure) of EEG data. It proposes HEEGNet, a hybrid Euclidean-hyperbolic network architecture that combines a Euclidean encoder for spatio-temporal-spectral feature extraction with a hyperbolic encoder to capture hierarchical relationships. Coupled with a novel coarse-to-fine domain adaptation strategy (DSMDBN), it achieves SOTA results across cross-domain tasks in visual evoked potentials, emotion recognition, and intracranial EEG.
- HFSTI-Net: Hierarchical Frequency-spatial-temporal Interactions for Video Polyp Segmentation
-
HFSTI-Net integrates "frequency-spatial" dual-path interaction and "mask-guided recurrent memory propagation" into a single network. It addresses two persistent challenges in colonoscopy video polyp segmentation: shape collapse caused by low contrast in single frames and episodic amnesia due to target fluctuations in long sequences. It achieves SOTA performance on SUN-SEG and CVC-612 while maintaining real-time inference at 31 FPS.
- Histopathology-Genomics Multi-modal Structural Representation Learning for Data-Efficient Precision Oncology
-
MSRL uses Graph Structural Learning to pre-train a "pathology-genomics" cross-case association graph. During the fine-tuning stage, it utilizes a buffer storing real genomic features, allowing cases with only Whole Slide Images (WSI) during inference to "borrow" genomic information from diagnostically related cases. This enables the model to approach the accuracy of full multimodal fusion in scenarios where genomics are missing.
- Improving 2D Diffusion Models for 3D Medical Imaging with Inter-Slice Consistent Stochasticity
-
Ours proposes Inter-Slice Consistent Stochasticity (ISCS), which eliminates inter-slice discontinuity artifacts in 3D medical reconstruction from 2D diffusion priors by generating inter-slice correlated noise via Spherical Linear Interpolation (Slerp) during the re-noising step of diffusion sampling. It requires zero extra computation, hyperparameters, or training overhead, and can be plugged-and-played into any 2D diffusion inverse solver, showing consistent improvements in sparse-view CT, limited-angle CT, and MRI super-resolution.
- Inference-Time Dynamic Modality Selection for Incomplete Multimodal Classification
-
Ours proposes DyMo—an inference-time dynamic modality selection framework. By theoretically deriving a reward function MTIR (Multimodal Task-Relevant Information Reward) based on a classification loss reduction proxy, class prototype distance, and intra-class similarity calibration, the framework iteratively selects and fuses reliable recovered modalities during inference. It systematically addresses the "discarding-imputation dilemma" (loss of information vs. introduction of noise).
- Johnson-Lindenstrauss Lemma Guided Network for Efficient 3D Medical Segmentation
-
VeloxSeg utilizes a trio of "Paired Window Attention + JL Lemma-constrained lightweight convolution + Gram matrix-based texture knowledge distillation" to simultaneously achieve high accuracy (Dice +26%) and efficiency (11× GPU throughput, 48× CPU, 1/20 VRAM usage) in 3D medical segmentation, resolving the "efficiency/robustness" trade-off in lightweight models.
- Joint Adaptation of Uni-modal Foundation Models for Multi-modal Alzheimer's Disease Diagnosis
-
This paper proposes a "modality-anchored interaction" framework that combines uni-modal foundation models from four domains—sMRI, fMRI, clinical text, and genetics—for Alzheimer's disease diagnosis. By rotating each modality as an anchor and freezing most of its parameters, a modality-aware Q-former selectively projects features from auxiliary modalities into the anchor's feature space. This achieves deep cross-modal interaction without destroying the integrity of the individual pre-trained representations.
- K-Prism: A Knowledge-Guided and Prompt Integrated Universal Medical Image Segmentation Model
-
K-Prism unifies semantic priors, few-shot reference examples, and user interactive feedback into 1-D sparse and 2-D dense prompts, dynamically routed by a Mixture-of-Experts (MoE) decoder. It establishes new benchmarks across 18 medical image datasets for semantic, in-context, and interactive segmentation.
- LaVCa: LLM-assisted Visual Cortex Captioning
-
The LaVCa method is proposed to generate natural language descriptions (captions) for each voxel in the human visual cortex using LLMs. Through a four-step pipeline consisting of "encoding model → optimal image selection → MLLM caption generation → LLM keyword refinement + sentence composition," it reveals voxel-level visual selectivity more accurately and diversely than existing methods like BrainSCUBA.
- Learning Patient-Specific Disease Dynamics with Latent Flow Matching for Longitudinal Imaging Generation
-
The study proposes the Δ-LFM framework: utilizing ArcRank loss to construct patient-specific temporally aligned trajectories in latent space (consistent angle + monotonically increasing magnitude), extending the flow matching time range from \([0, 1]\) to \([0, T]\) actual time intervals for arbitrary time-point prediction. It outperforms 8 baseline methods across three Alzheimer's longitudinal MRI benchmarks and introduces a progression-specific metric, Δ-RMAE.
- Learning Self-Critiquing Mechanisms for Region-Guided Chest X-Ray Report Generation
-
RadSCR encodes the "repeated self-questioning" diagnostic process of radiologists into the model architecture. By employing three self-critiquing mechanisms—substituting abnormality classes, swapping patient images, and checking for missed findings—it enables end-to-end learning. This significantly improves clinical accuracy and the reliability of abnormality grounding in chest X-ray reports without requiring LLM inference at test time.
- Lightweight Transformer for EEG Classification via Balanced Signed Graph Algorithm Unrolling
-
The paper unrolls a "spectral denoising algorithm on balanced signed graphs" into an interpretable Transformer-like network. It utilizes the reconstruction errors of two class-specific denoisers for binary epilepsy EEG classification, achieving an accuracy improvement from 85% to 97.6% while using less than 1% of the parameters compared to standard Transformers.
- M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding
-
M3CoTBench is the first benchmark specifically designed to evaluate the quality of Chain-of-Thought (CoT) in MLLMs for medical image understanding. It goes beyond final answer accuracy by quantifying the reasoning paths across four dimensions: correctness, efficiency, impact, and consistency. The study reveals that current MLLMs are neither reliable nor interpretable in clinical reasoning, and accuracy often decreases when CoT is applied.
- MedGMAE: Gaussian Masked Autoencoders for Medical Volumetric Representation Learning
-
MedGMAE shifts the MIM pre-training target for 3D medical imaging from "reconstructing discrete voxel intensities" to "predicting a set of continuous 3D Gaussian primitives followed by volume rendering." This learns encoder representations that better align with anatomical continuity and transforms the decoder into a transferable "geometric prior" capable of providing zero-shot initialization for 3DGS-CT reconstruction.
- MedLesionVQA: A Multimodal Benchmark Emulating Clinical Visual Diagnosis for Body Surface Health
-
MedLesionVQA, developed by the ByteDance Xiaohe team in collaboration with Peking Union Medical College Hospital (Tsinghua Changgung Hospital), is the first multimodal benchmark for body surface health aligned with the "step-by-step visual diagnosis workflow" of physicians. It comprises 12K unreleased real-world hospital patient images and 19K expert-reviewed QAs, covering 94 lesion types, 110 body parts, and 96 diseases. Evaluations of 20+ mainstream MLLMs show a top accuracy of only 56.2%, significantly lower than junior doctors (61.4%) and senior experts (73.2%).
- MindMix: A Multimodal Foundation Model for Auditory Perception Decoding via Deep Neural-Acoustic Alignment
-
MindMix utilizes a two-stage strategy: first, a high-capacity EEG encoder is pre-trained on 3500+ hours of unlabeled EEG; second, a multimodal foundation model for auditory perception decoding is constructed by performing contrastive learning on 100+ hours of EEG-audio paired data via the CALRA cross-modal alignment module. It significantly outperforms existing single-modal EEG foundation models and task-specific SOTAs across three categories of tasks: auditory attention decoding, emotion recognition, and music retrieval (achievable 99.82% accuracy on KUL).
- Mixture of Mini Experts: Overcoming the Linear Layer Bottleneck in Multiple Instance Learning
-
This paper identifies the "task-specific linear layer," often overlooked in Multiple Instance Learning (MIL) pipelines, as the performance bottleneck. It proposes MAMMOTH, a plug-and-play multi-head soft-routing MoE module, to replace this layer. Without increasing the parameter count, MAMMOTH significantly improves the performance of any MIL model (including simple max/mean pooling).
- MnemoDyn: Learning Resting State Dynamics from 40K fMRI Sequences
-
MnemoDyn conceptualizes resting-state fMRI (rs-fMRI) as a trajectory in a latent space driven by a "learnable evolution operator." By replacing Transformer self-attention with wavelet-parameterized pseudo-differential operators, the authors pre-train a lightweight, long-sequence-friendly, and cross-dataset generalizable brain imaging foundation model on approximately 40K rs-fMRI sequences.
- Modeling the Density of Pixel-level Self-supervised Embeddings for Unsupervised Pathology Segmentation in Medical CT
-
This paper proposes Screener: replacing ImageNet pre-trained features with dense self-supervised learning and replacing manual sinusoidal positional encodings with "mask-invariant" learnable conditional variables. This approach renders the density-based Unsupervised Visual Anomaly Segmentation (UVAS) framework fully self-supervised. After training on 30,000 unlabeled CT scans, it significantly outperforms existing UVAS methods on a multi-pathology test set of 1,820 cases.
- Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction
-
MindHier shifts fMRI-to-image reconstruction from "diffusion models + single static guidance" to "next-scale autoregression + hierarchical neural guidance." By injecting brain signals into the generation process across scales following the "forest before trees" principle, it achieves SOTA semantic metrics on NSD while being 4.67× faster and more deterministic.
- NAB: Neural Adaptive Binning for Sparse-View CT Reconstruction
-
This work replaces Random Fourier Coding in Implicit Neural Representations (INR) with a set of differentiable "adaptive rectangular bins." By explicitly incorporating the rectangular shape priors common in industrial objects into the coordinate encoding, the position, size, rotation, and steepness of each bin are learned end-to-end from projection data. This approach significantly outperforms INR baselines in sparse-view CT reconstruction.
- Nef-Net v2: Adapting Electrocardio Panorama in the Wild
-
This work transfers "arbitrary-view ECG synthesis" from idealized laboratory assumptions to real-world clinical practice. By utilizing a Geometric View Transformer for direct view-to-view mapping, coupled with a three-stage (Pretraining → Device Calibration → On-the-fly Calibration) pipeline, the method addresses three major deployment challenges: long-duration signals, cross-device variance, and electrode displacement. It achieves a PSNR improvement of approximately 6 dB over the previous generation Nef-Net.
- Neuro-Symbolic Decoding of Neural Activity
-
The paper proposes NEURONA, a neuro-symbolic framework for fMRI decoding and conceptual grounding. By decomposing visual scenes into symbolic programs (logical combinations of concepts), it significantly outperforms end-to-end neural decoding and linear models in fMRI question-answering tasks.
- ODEBRAIN: Continuous-Time EEG Graph for Modeling Dynamic Brain Networks
-
ODEBRAIN utilizes Neural Ordinary Differential Equations (NODE) to explicitly model multi-channel EEG as a "continuous-time dynamic system." By constructing noise-resistant initial states through a dual encoder, solving latent space trajectories via an adaptive spatio-temporal vector field, and employing a graph-structured multi-step prediction loss, it significantly outperforms discrete recurrent baselines on TUSZ/TUAB epilepsy and abnormal EEG tasks (F1 gain of 6.0%, ACC gain of 8.1%).
- OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis
-
OmniCT utilizes a Spatial Consistency Enhancement (SCE) module comprising "slice composition + tri-axial positional encoding + MoE hybrid projection" to unify 2D slices and 3D volumes into a single LVLM token space. It further incorporates Organ-level Semantic Enhancement (OSE) to explicitly inject anatomical region priors into representations. Combined with the MedEval-CT dataset of 1.7 million samples and a hybrid benchmark, OmniCT significantly outperforms existing medical and general LVLMs in both slice-driven and volume-driven CT tasks (7B model averages 81.45 for slices and 66.15 for volumes).
- OpenPros: A Large-Scale Dataset for Limited View Prostate Ultrasound Computed Tomography
-
This paper constructs OpenPros, the first large-scale dataset for limited-view prostate Ultrasound Computed Tomography (USCT). By synthesizing anatomically realistic 3D Speed-of-Sound (SOS) volumes based on clinical MRI/CT and ex-vivo measurements, the authors generate 280,000 pairs of 2D SOS maps and full-waveform ultrasound data. Accompanied by an open-source FDTD solver and a physics/deep learning inversion benchmark, the work reveals that while deep learning models are fast and accurate, they still fail to resolve internal prostate micro-structures and struggle with cross-patient generalization.
- PathChat-SegR1: Reasoning Segmentation in Pathology via SO-GRPO
-
Addressing the pain point where "rare/unseen morphologies are difficult to segment" in pathology, this paper develops a pathology-specific reasoning segmentation model, PathChat-SegR1. It employs stain-invariant self-distillation to train a pathological vision encoder and utilizes SO-GRPO reinforcement learning to enable the VLM to autonomously decide "when to output the
<SEG>token to trigger segmentation." It achieves a 61% improvement in zero-shot Dice on unseen pathologies compared to the previous SOTA. - Pixel-Level Residual Diffusion Transformer: Scalable 3D CT Volume Generation
-
PRDiT proposes a two-stage residual diffusion framework to generate high-resolution 3D CT volumes directly at the voxel level. It utilizes a lightweight MLP "Local Denoiser" to estimate low-frequency coarse structures from overlapping 3D patches, followed by a "Global Residual DiT" to recover high-frequency residuals using a global field of view. Combined with a hot predictor-corrector sampling and a scaling strategy that reuses low-resolution backbones, the method surpasses HA-GAN, 3D LDM, and WDM-3D in 3D FID, MMD, and Wasserstein distance on LIDC-IDRI / RAD-ChestCT, while reducing \(256^3\) training costs to 1/4 ~ 1/6 of competitors.
- Prior-aware and Context-guided Group Sampling for Active Probabilistic Subsampling
-
Building upon Active Deep Probabilistic Subsampling (A-DPS), the proposed method first acquires a batch of samples using a fixed prior mask learned from the training set, followed by context-guided grouped active sampling using DPS-top-k. Complemented by theoretical proof that grouped sampling smoothens optimization via Lipschitz analysis, the method outperforms A-DPS and DPS across MNIST/CIFAR-10 classification, fastMRI reconstruction, and AeroRIT hyperspectral segmentation.
- ProstaTD: Bridging Surgical Triplet from Classification to Fully Supervised Detection
-
This work constructs the first large-scale multi-center dataset for "fully supervised surgical triplet detection," named ProstaTD (21 robot-assisted radical prostatectomies, 71,775 frames, 196,490 instances with bounding boxes, 89 triplet classes). By employing clinically defined temporal boundaries and precise bounding boxes, the task is advanced from "frame-level weakly supervised classification" to "fully supervised detection with spatial localization." It is accompanied by two labeling tools, an evaluation suite, and TDnet—a baseline integrating multi-task learning and instance-level self-distillation.
- Reducing Semantic Mismatch in Brain-to-Text Decoding Through Personalized Multimodal Masking
-
This paper proposes Yo'Mind, which employs personalized multimodal semantic masking driven by Optimal Transport (OT). It identifies visual and textual semantics actually encoded by each subject's brain signals during image viewing and uses them for brain-to-text decoding, thereby alleviating semantic mismatch between brain and machine representations and achieving superior results in cross-subject brain-to-text reconstruction on the Natural Scenes Dataset (NSD).
- Reliable Evaluation of MRI Motion Correction: Dataset and Insights
-
Addressing the fundamental dilemma that "3D MRI motion correction methods cannot be reliably evaluated," this paper releases PMoC3D, a paired real-motion dataset, and proposes MoMRISim, a feature-space metric trained via self-supervision. By systematically auditing three evaluation paradigms—real-paired, simulated-motion, and no-reference—the study concludes that "Real-Paired + MoMRISim" is the most reliable despite being imperfect, whereas simulated motion systematically overestimates algorithms, and no-reference metrics favor over-smoothed deep learning outputs.
- Rethinking Model Calibration through Spectral Entropy Regularization in Medical Image Segmentation
-
This paper reframes the over-confidence calibration problem in medical image segmentation from a frequency domain perspective. It posits that low-frequency dominated spectral bias and confidence saturation (which suppresses total spectral energy in confidence maps) jointly lead to boundary uncertainty distortion. The authors introduce spectral entropy regularization with cross-batch power spectrum smoothing during training to improve calibration with minimal sacrifice to segmentation accuracy.
- Rethinking Radiology Report Generation: From Narrative Flow to Topic-Guided Findings
-
This paper points out that the "narrative flow imitation" paradigm in report generation causes VLMs to over-rely on language priors and weakens visual grounding. The authors propose LLaVA-TA: decomposing the report into independent topics organized by anatomical regions, where each topic generates a single finding sentence based on the full image and a corresponding anatomical mask. This approach significantly improves RadGraph F1 (report-level 29.4→34.3, topic-level up to 44.0) and CheXpert F1 on MIMIC-CXR.
- SE-Diff: Simulator and Experience Enhanced Diffusion Model for Comprehensive ECG Generation
-
SE-Diff integrates a lightweight ODE ECG simulator and an LLM-based retrieval enhancement based on EHR case experience into a conditional latent diffusion model. This allows for the generation of 12-lead, 10-second ECGs from clinical text that conform to both the physical mechanisms of cardiac electrical activity and real-world clinical experience, outperforming previous text-to-ECG methods in signal fidelity, text alignment, and downstream classification tasks.
- SEED: Towards More Accurate Semantic Evaluation for Visual Brain Decoding
-
The authors propose SEED (Semantic Evaluation for Visual Brain Decoding), a composite evaluation metric combining Object F1, Cap-Sim, and EffNet, which significantly surpasses all existing metrics in alignment with human evaluation.
- Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI
-
PRISM overturns the convention that "visual representations must be used to reconstruct visual images." The authors first prove via alignment metrics that fMRI signals most closely resemble the textual space of language models. Consequently, they project fMRI into a structured textual space as an intermediate bridge. By utilizing "automated search for brain-aligned keywords + object-centric diffusion" to synthesize images from text, they reduce the perceptual loss LPIPS by up to approximately 6% across the NSD, BOLD5000, and GOD datasets.
- Sequential Information Bottleneck Fusion: Towards Robust and Generalizable Multi-Modal Brain Tumor Segmentation
-
Addressing the common "missing modality" issue in multi-modal MRI brain tumor segmentation, this paper proposes Sequential Information Bottleneck Fusion to progressively compress information from various modalities into a shared latent representation. From an information-theoretic perspective, it is demonstrated that this approach is more robust and provides a tighter generalization upper bound than mainstream parallel fusion. Based on this, the SMSN network is designed, which comprehensively outperforms parallel fusion baselines on BRATS18/20 and generalizes from glioma to brain metastasis without fine-tuning.
- sleep2vec: Unified Cross-Modal Alignment for Heterogeneous Nocturnal Biosignals
-
sleep2vec performs cross-modal contrastive pre-training on 42,249 nights across nine sleep physiological signals. It utilizes a DASH-InfoNCE objective that dynamically weights negative samples based on demographic and acquisition metadata to align heterogeneous signals into a unified representation space. This enables inference with arbitrary modality subsets, provides robustness to sensor loss, and characterizes the scaling laws of PSG signals relative to modality diversity and model scale.
- Spike-based Digital Brain: A Novel Fundamental Model for Brain Activity Analysis
-
This paper proposes Spike-DB, which introduces the spiking computation paradigm into fMRI time-series modeling. By using spiking neurons simulated with IIR filters to encode BOLD signals into spike trains, it employs an "anchor region \(\rightarrow\) target region" self-supervised prediction framework to learn temporal driving relationships between brain regions. It achieves high-precision brain activity prediction, disease classification, anomalous region identification, and effective connectivity inference on epilepsy and Alzheimer's (ADNI) datasets.
- SpineBench: A Clinically Significant, Segment-Aware Spinal Diagnosis and Treatment Evaluation Benchmark and SpineMed-450k Corpus
-
This paper constructs SpineMed-450k, a traceable multimodal spinal diagnosis and treatment instruction corpus with 450,000 entries, and an accompanying benchmark SpineBench through a clinician-in-the-loop approach. It reveals systematic weaknesses in current Large Vision-Language Models (LVLMs) regarding fine-grained reasoning for "locating specific vertebral segments." Using a fine-tuned 7B model, SpineGPT, the authors demonstrate that specialized instruction data enables small models to achieve clinical performance comparable to Gemini-2.5-Pro.
- Stochastic Optimal Control for Continuous-Time fMRI Representation Learning
-
BDO treats heterogeneous fMRI time series as continuous-time latent stochastic dynamical systems, utilizing stochastic optimal control to unify MAE reconstruction and JEPA latent variable prediction. This approach learns brain dynamic representations that are more robust to TR discrepancies and more computationally efficient across multiple datasets.
- MedVLSynther: Synthesizing High-Quality Medical Visual Question Answering from Biomedical Literature with Generator-Verifier LMMs
-
This paper proposes MedVLSynther, a rubric-driven and context-aware generator-verifier framework that synthesizes multiple-choice medical VQA data directly from open PubMed biomedical literature (figures, captions, and in-text references). After a multi-stage automated verification process, it produces 13,087 high-quality samples (MedSynVQA). Training open-source LMMs using this data via Reinforcement Learning from Verifiable Rewards (RLVR) achieves average accuracies of 55.85 (3B) and 58.15 (7B) across 6 medical VQA benchmarks, outperforming several strong medical LMM baselines.
- The Mind's Transformer: Computational Neuroanatomy of LLM-Brain Alignment
-
Pending further reading of the paper.
- Towards Interpretable Visual Decoding with Attention to Brain Representations
-
NeuroAdapter is proposed to segment fMRI signals into independent tokens by brain region and directly condition Stable Diffusion via cross-attention. By bypassing traditional CLIP/DINO intermediate embedding spaces, it achieves high-level semantic metrics superior to or on par with existing methods on datasets like NSD. Furthermore, the IBBI bi-directional interpretability framework is introduced, revealing for the first time how different cortical regions dynamically drive image generation during the denoising trajectory.
- Towards Text–Mask Consistency in Medical Image Segmentation
-
To address the "mask-text mismatch" in text-guided medical segmentation, C2Seg proposes a two-stage scheme: the pre-training stage utilizes Cluster-aware Contrastive Learning (CaCL) with text-similarity-based soft labels to resolve false negative conflicts caused by templated clinical descriptions; the fusion stage employs Bi-directional Complementary Attention Module (BCAM) to explicitly construct a "language-dominant" spatial feature path, complemented by KAN gating for fine-grained selection, achieving simultaneous improvements in text-mask consistency and segmentation accuracy across four public medical datasets.
- U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding
-
U2-BENCH is the first benchmark to systematically evaluate the ultrasound understanding capabilities of Large Vision-Language Models (LVLMs). By sampling 7,241 cases across 15 anatomy sites from 40 authorized datasets and defining 8 clinical tasks across four categories (classification, detection, regression, and text generation), the evaluation of 20 open-source and closed-source models reveals that while models perform acceptably on image-level classification, they generally fail in spatial reasoning and clinical language generation.
- UltraGauss: Ultrafast Gaussian Reconstruction of 3D Ultrasound Volumes
-
UltraGauss transforms Gaussian Splatting from "camera projection + depth occlusion" into an ultrasound-specific rendering paradigm of "probe plane intersection + in-plane aggregation." Combined with triangular precision matrix parameterization and two-stage load-balanced CUDA rasterization, it reconstructs 2D ultrasound slices into 3D volumes in ~20 minutes on a single GPU, achieving 0.99 SSIM. Clinical experts generally consider its reconstructions more realistic than baselines.
- Unified Brain Surface and Volume Registration
-
NeurAlign trains a "volume registration network" and a "spherical registration network" simultaneously within a shared framework, coupled via a cortical consistency loss. This allows the cortical (surface) and subcortical (volume) structures of brain MRI to be aligned consistently in a single forward pass. At inference, it requires only a single MRI without the need for meshes or segmentations. Registration accuracy leads significantly (up to +7 cortical Dice points), and speed is orders of magnitude faster than the standard CVS method.
- WavePolyp: Video Polyp Segmentation via Hierarchical Wavelet-based Feature Aggregation and Inter-frame Divergence Perception
-
WavePolyp employs wavelets to decompose per-frame features into high and low frequencies for separate enhancement and aggregation (HWFA). It also introduces a module that performs difference-based attention along the temporal dimension (IDP) to explicitly model polyp variations between adjacent frames. This enables the model to both extract highly camouflaged polyps and maintain stable cross-frame tracking in colonoscopy videos, outperforming previous SOTA on SUN-SEG and CVC-612 across all metrics while achieving near real-time performance (23 FPS).
- You Point, I Learn: Online Adaptation of Interactive Segmentation Models for Handling Distribution Shifts in Medical Imaging
-
To address the inconsistency between training and testing distributions after deploying medical image models, this paper treats the "final user-corrected prediction" in interactive segmentation as a pseudo-ground truth. It designs a streamlined online adaptation framework, OAIMS (Post-Interaction + Mid-Interaction updates + Click-Centered Gaussian loss), which consistently outperforms existing methods across 5 fundus and 4 brain MRI datasets, achieving over 10% Dice improvement on brain MRI.