CVPR2026 Medical Imaging MRI contrast enhancement world model spatiotemporal consistency learning latent space alignment diffusion model DCE-MRI

MRI Contrast Enhancement Kinetics World Model¶

Conference: CVPR2026 arXiv: 2602.19285 Code: GitHub Area: Medical Imaging Keywords: MRI contrast enhancement, world model, spatiotemporal consistency learning, latent space alignment, diffusion model, DCE-MRI

TL;DR¶

This paper presents the first MRI Contrast Enhancement Kinetics World Model (MRI CEKWorld), which leverages spatiotemporal consistency learning (STCL) on sparsely sampled data to generate continuous, high-fidelity contrast-enhanced sequences from non-contrast MRI, addressing the dual challenges of content distortion and temporal discontinuity.

Background & Motivation¶

Low information efficiency of clinical contrast MRI: Contrast agent injection carries safety risks (deposition, allergic reactions) and significant costs, while the acquired sequences consist of fixed, sparse time points — a severe mismatch between information yield and cost.
World models can simulate contrast enhancement kinetics: World models excel at learning the dynamic evolution of physical systems; applied to MRI contrast enhancement, they enable continuous dynamic imaging without contrast agents, eliminating injection-related risks.
Extremely low temporal resolution of MRI: Due to reconstruction duration and patient breath-holding constraints, MRI acquisition sequences are highly sparse (only second-level intervals), far from the millisecond-level continuous frames available in video, severely limiting model training.
Spatial content distortion from sparse training: Missing time points lack ground-truth supervision, causing models to overfit irrelevant features and produce structural deformations and organ misalignment.
Temporal discontinuity from sparse training: Without continuous sampling data, models cannot learn smooth contrast enhancement kinetics, resulting in temporal jumps and inter-frame inconsistencies.
Limitations of existing methods: Static generation methods synthesize only single-time-point images; dynamic sequence methods remain limited to image-to-image mapping without genuinely simulating contrast kinetics; prior regularization and pixel-space smoothing fail to preserve patient-specific details and avoid blurring, respectively.

Method¶

Overall Architecture¶

MRI CEKWorld is built upon the ControlNet architecture. The inputs are a non-contrast MRI image \(\mathcal{I}_{p,0}\) and a continuous time variable \(t\); the output is a contrast-enhanced image \(\hat{\mathcal{I}}_p(t)\) at any target time point. The encoder consists of three components:

VAE encoder \(E_{gt}\): Encodes ground-truth contrast-enhanced images into latent space.
CLIP temporal encoder \(E_t\): Converts time text (HH:MM:SS format, representing elapsed time after contrast injection) into high-dimensional semantic features to guide time-specific enhancement generation.
Zero-convolution image encoder \(E_{img}\): Encodes the non-contrast image \(\mathcal{I}_{p,0}\) and injects features into each layer of the U-Net via zero convolutions as generation conditioning.

Training proceeds in two stages, with the total loss comprising a diffusion loss plus spatial and temporal regularization terms. At inference, only the non-contrast image and target time point are required to generate the corresponding contrast-enhanced image.

Key Design 1: Latent Alignment Learning (LAL) — Ensuring Spatial Fidelity¶

Design Motivation: During contrast enhancement, the anatomical structures of a given patient (organ contours, tissue boundaries) remain unchanged, constituting a spatial consistency prior.

Co-occurrence encoding: Latent representations \(\hat{x}_0\) are extracted from the diffusion reverse process, flattened and centered, and used to compute covariance matrices \(\Sigma_t\) at each time point, encoding spatial co-occurrence relationships among anatomical regions. Shrinkage regularization \(\tilde{\Sigma}_t = (1-\gamma)\Sigma_t + \gamma I + \varepsilon I\) is applied to ensure positive definiteness.
Patient-level template construction: Log-Cholesky parameterization maps each covariance matrix to a Euclidean vector \(z_t\); averaging across all time points yields a patient-specific template \(\bar{z} = \frac{1}{T}\sum_{t=1}^T z_t\) representing the time-invariant anatomy.
Isometry constraint: Each \(z_t\) is constrained to maintain a consistent distance from the template \(\bar{z}\) via the spatial loss \(\mathcal{L}_{Spatial} = \frac{1}{P}\sum_p \frac{1}{T_p}\sum_t \|z_t - \bar{z}\|_2^2\), preserving content consistency while permitting legitimate dynamic variation.

Key Design 2: Latent Difference Learning (LDL) — Ensuring Temporal Continuity¶

Design Motivation: Contrast enhancement sequences should follow smooth evolutionary trends without abrupt transitions.

Dense sequence interpolation: \(K_i\) intermediate virtual time points are uniformly inserted between original sparse acquisition time points to construct a dense temporal sequence \(T_{dense}\). Latent representations for acquired time points are recovered from the denoising process, while latent predictions for interpolated points are generated from Gaussian noise.
Second-order central difference smoothing: Discrete second-order central differences \(\mathbf{D}_2^k\) are computed over adjacent points in the dense sequence, incorporating adaptive weights \(w^k = \frac{1}{1+h_0^k+h_1^k}\) for non-uniform time steps (weaker penalty for larger intervals). The differences are penalized toward zero to suppress abrupt changes.
Temporal loss: \(\mathcal{L}_{Temporal} = \frac{1}{T-2}\sum_{k=1}^{T-2}\|\mathbf{D}_2^{(k)}\|_1\), using the L1 norm for robustness against outliers.

Loss & Training¶

Two-stage training strategy: - Stage 1 (diffusion warm-up + spatial consistency): \(\mathcal{L}_1 = \mathcal{L}_{Diffusion} + \lambda_{Spatial}\mathcal{L}_{Spatial}\) - Stage 2 (temporal smoothing): \(\mathcal{L}_2 = \mathcal{L}_{Diffusion} + \lambda_{Temporal}\mathcal{L}_{Temporal}\)

Key Experimental Results¶

Datasets & Setup¶

Abdominal DCE-MRI (private): 91 patients, 1 non-contrast + 15 contrast-enhanced images (6 arterial, 6 venous, 3 delayed phase, within 300 seconds).
Breast DCE-MRI (Duke public dataset): 922 cases, 3–4 post-injection time points.
All images resized to 256×256, normalized to \([-1,1]\), 3-channel input; trained on an A100 40GB GPU.
Evaluation metrics: PSNR, SSIM, LPIPS, rMSE (spatial); cSSIM (mean structural similarity between adjacent frames, temporal).
Baselines: CustomDiff, T2I Adapter, CCNet, EditAR, ControlNet baseline.
Training hyperparameters: epoch=14, batch_size=4, abdominal \(\lambda_{Spatial}=6.0\), breast \(\lambda_{Spatial}=4.0\), \(\lambda_{Temporal}=1.0\), \(K_i=2\).

Main Results¶

Method	Abd. PSNR↑	Abd. SSIM↑	Abd. cSSIM↑	Breast PSNR↑	Breast SSIM↑	Breast cSSIM↑
ControlNet baseline	23.61	0.7178	0.8286	19.79	0.5196	0.3370
+ LAL	23.92	0.7227	0.8439	20.86	0.5442	0.3879
+ LDL	24.05	0.7369	0.8411	20.21	0.5391	0.3392
MRI CEKWorld	24.06	0.7419	0.8451	21.09	0.5599	0.3900
CCNet	24.35	0.5794	0.7098	21.47	0.4043	0.3155
EditAR	22.65	0.5571	0.7536	19.85	0.4170	0.3886

MRI CEKWorld achieves the best Avg. SSIM (0.6509) and Avg. cSSIM (0.6176). Although CCNet achieves higher PSNR, insufficient convergence leads to over-smoothing with loss of structural detail, resulting in poor SSIM/LPIPS/rMSE. CustomDiff and T2I Adapter produce images that deviate substantially from ground truth, with blurred organ contours and distorted dynamic enhancement gradients. Visual results demonstrate that CEKWorld achieves high spatial fidelity and natural contrast kinetics on both datasets, closely matching ground truth.

Ablation Study¶

LAL alone: +2.46% SSIM and +1.07 PSNR on breast data.
LDL alone: +1.25% SSIM on abdominal, +5.09% SSIM on breast (spatial benefit induced by temporal smoothing).
The two components are complementary: LAL establishes spatial consistency as a foundation, and LDL further enhances both temporal smoothness and spatial coherence.
Hyperparameters: \(\lambda_{Spatial}=6.0\) (abdominal) / \(4.0\) (breast) is optimal; performance degrades with either larger or smaller values. \(K_i=2\) is optimal; excessive interpolation points introduce out-of-distribution noise.
Contrast enhancement kinetic curves: Uniform sampling across arterial (1–15 s), venous (55–72 s), and delayed (90–300 s) phases for renal regions of interest shows that CEKWorld's mean intensity curves are smooth and stable, accurately matching the physiological process of rapid contrast filling → accumulation → washout. CCNet and EditAR both exhibit pronounced fluctuations.
Latent space visualization: PCA projection shows that CEKWorld's feature points are distributed continuously in temporal order, whereas the baseline is disordered.

Highlights & Insights¶

Pioneering contribution: The first application of world models to MRI contrast enhancement kinetics simulation, enabling continuous dynamic imaging without contrast agents — with clear clinical value (elimination of injection risks, cost reduction, improved temporal resolution).
Physics-prior-driven design: LAL exploits the anatomical invariance prior; LDL exploits the kinetic smoothness prior. Both designs are elegant and theoretically grounded.
Log-Cholesky parameterization: Maps positive-definite covariance matrices to Euclidean space for optimization, ensuring numerical stability and preservation of positive definiteness to enable gradient-based optimization.
Non-uniform difference weighting: Adaptive smoothing constraints for varying time intervals with weighted penalties (weaker for larger gaps) accommodate the non-uniform acquisition patterns of DCE-MRI.
Two-stage training strategy: Progressive learning — spatial consistency first, then temporal smoothing — avoids multi-objective conflicts; experiments confirm the superiority of the two-stage approach over using either loss alone.
Thorough visualization analysis: Contrast enhancement kinetic time curves and latent space PCA distributions intuitively demonstrate the effectiveness of the proposed method.

Limitations & Future Work¶

Validation is limited to the MRI modality; extension to other contrast-enhanced imaging modalities such as CT is not explored (explicitly identified by the authors as future work).
The second-order central difference cannot constrain the first and last time points (\(t=0\) s, \(t=1\) s), resulting in outliers in the visualization; one-sided differences or special boundary treatment could be considered.
The private abdominal dataset is relatively small (91 cases), leaving generalizability unverified; multi-center validation is absent.
The two-stage training requires manual switching of loss functions; end-to-end joint optimization or adaptive weight scheduling is not explored.
The relatively high rMSE on the breast dataset (attributed by the authors to the intensity range 0–4000) is not analyzed in depth for potential improvements.
Inference speed and real-time requirements for clinical deployment are not discussed.

vs. static virtual contrast methods (Chen et al., Cheng et al.): Static methods synthesize single-time-point contrast-enhanced images from multi-modal non-contrast sequences (T1w, T2w, ADC), focusing on final tumor enhancement patterns, and cannot simulate temporal kinetic evolution.
vs. dynamic sequence methods (CCNet, EditAR): These methods remain image-to-image mappings between sparse time points, constrained by physical acquisition; this work enables continuous-time modeling to generate contrast images at arbitrary time points.
vs. general-purpose world models (Dreamer, AdaWorld): Continuous-action models require persistent external control signals, and observation-driven models require densely sampled video — neither is feasible for MRI scenarios. This work addresses sparse training through STCL.
vs. spatiotemporal consistency methods (slow feature analysis, contrastive learning): Slow feature analysis relies on minimizing temporal derivatives between consecutive frames, and contrastive learning requires sufficient positive/negative sample pairs — neither is suitable for the extremely sparse DCE-MRI setting. This work redefines consistency through covariance statistics and difference smoothing.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First introduction of world models to MRI contrast kinetics; both the problem formulation and methodological design are entirely novel.
Experimental Thoroughness: ⭐⭐⭐⭐ — Two datasets, multiple baselines, and complete ablation studies; however, larger-scale or multi-center validation is lacking.
Writing Quality: ⭐⭐⭐⭐ — Problem motivation is clear, mathematical derivations are rigorous, and figures are informative; notation is somewhat heavy in places.
Value: ⭐⭐⭐⭐⭐ — Continuous dynamic imaging without contrast agents carries significant clinical importance; the technical approach offers reference value for other sparsely sampled temporal medical imaging tasks.