ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis¶

Metadata¶

Conference: ICCV 2025
arXiv: 2505.04963
Authors: Onkar Susladkar, Gayatri Deshmukh, Yalcin Tur, Gorkem Durak, Ulas Bagci (Northwestern University, Stanford University, UIUC)
Code: GitHub / Weights
Area: Medical Imaging / Medical Image Synthesis
Keywords: Medical image synthesis, Rectified Flow, Tweedie's Formula, pathology-aware, liver cirrhosis, data augmentation, LoRA

TL;DR¶

This paper proposes ViCTr, a two-stage framework that combines Rectified Flow with a Tweedie-corrected diffusion process to achieve high-fidelity pathology-aware medical image synthesis. The method reduces inference steps from 50 to 3–4 and, for the first time, enables graded-severity pathology synthesis for abdominal MRI.

Background & Motivation¶

Problem Definition¶

Medical image synthesis aims to generate anatomically realistic and pathologically diverse synthetic medical images for data augmentation, alleviating the scarcity of medical imaging data.

Existing Challenges¶

Data scarcity: Privacy regulations, inter-institutional data fragmentation, and interoperability constraints result in severe shortages of medical imaging data.

Insufficient anatomical fidelity: Existing methods struggle to maintain anatomical accuracy while simultaneously modeling pathological features.

Difficulty with diffuse pathologies: Diffuse lesions such as liver cirrhosis involve subtle tissue changes across multiple organ systems, far more complex than focal lesions such as tumors.

Low sampling efficiency: Conventional diffusion models require more than 50 sampling steps, incurring substantial computational overhead.

Lack of pathology control: Existing methods cannot finely control the severity of synthesized pathologies.

Core Motivation¶

Rectified Flow provides near-linear sampling trajectories, reducing the required number of steps.
Tweedie's Formula corrects sampling bias and improves initialization accuracy.
The combination of both, together with two-stage training, yields efficient, high-fidelity, and pathology-controllable synthesis.

Method¶

Overall Architecture¶

ViCTr comprises two training stages: 1. Stage 1 — Pre-training: Establishes anatomical priors on the ATLAS-8k dataset. 2. Stage 2 — Fine-tuning: Adapts to downstream tasks (CT/MRI generation and pathology synthesis) via LoRA.

Key Designs¶

1. Rectified Flow Trajectory¶

Interpolation is defined as: $x_t = (1-t)x_0 + tx_1$, where $x_0 \sim p_0$ (noise) and $x_1 \sim p_{target}$ (target data).

Velocity model training: $$\hat{\theta} = \arg\min_\theta \mathbb{E}_{t \sim \text{Uniform}(0,1)} \left[ \|(x_1 - x_0) - v_\theta(x_t, t)\|^2 \right]$$

One-step distillation: $\hat{\mathcal{T}}(x_0) = x_0 + v(x_0, 0)$

2. Tweedie's Formula Correction¶

Tweedie correction is incorporated into the Rectified Flow ODE: $$dx_t = v_{\hat{\theta}}(x_t, t)dt + (1 - \bar{\alpha}_t)\nabla_{x_t} \log p(x_t) dt$$

The additional score term $(1 - \bar{\alpha}_t)\nabla_{x_t} \log p(x_t)$ corrects sampling bias, directing $x_t$ more accurately toward the target distribution.

One-step sampling extension: $$\hat{\mathcal{T}}(x_0) = x_0 + v(x_0, 0) + (1 - \bar{\alpha}_0)\nabla_{x_0} \log p(x_0)$$

3. Stage 1 — ATLAS-8k Pre-training¶

Input: CT image $X_I$, segmentation mask $X_S$, text prompt $X_p$
Encoding: A frozen VAE encoder extracts latent representations $Z_o$, $Z_s$; a pre-trained text encoder extracts $Z_p$
Forward diffusion: $P(Z_t|Z_o) = (1-t) \cdot Z_o + t \cdot \epsilon_{true}$
Reverse diffusion: $P(Z_{t-1}|Z_t, Z_s, Z_p, t) = Z_t + \delta T \times \phi_\theta(Z_t, Z_s, Z_p, t)$
EWC (Elastic Weight Consolidation): Selectively unfreezes critical layers to maintain model stability.

Loss function: $$L_{diff} = -|\phi_\theta(Z_t, Z_s, Z_p, t) - (\epsilon_{true} - Z_o)|^2$$

Composite loss: $L_{diff} + L_2 + L_{SSIM}$

4. Stage 2 — LoRA Fine-tuning¶

Dual network: $\phi_{base}$ (frozen, retains anatomical knowledge) + $\phi_{adapt}$ (LoRA trainable)
Consistency loss: $L_{consistency}$ aligns intermediate outputs between the two networks
Temporal consistency: $L_{temporal}$ ensures smooth transitions across reverse diffusion timesteps
Spatial consistency: $L_{spatial}$ enforces alignment in output reconstruction

5. Pathology Generation¶

Dataset: CirrMRI600+ (T1/T2 MRI)
Liver segmentation masks combined with text prompts specifying severity: "low", "mild", "severe"
Progressive pathology control

Support for Multiple Diffusion Backbones¶

Diffusion Method	Denoiser	Text Encoder
Stable Diffusion	UNet	CLIP-B/16
Pixart-alpha	DiT	T5-XXXL
SDXL	Dual UNet	CLIP-L/14
Flux	MultiModal Transformer	T5-XXXL + CLIP
SD-3	MultiModal Transformer	T5-XXXL + CLIP

Key Experimental Results¶

Main Results — Synthesis Quality (FID/MFID)¶

Backbone	BTCV(CT) Vanilla/ViCTr	AMOS(MRI) Vanilla/ViCTr	CirrMRI600+ Vanilla/ViCTr	Steps Vanilla→ViCTr
Stable Diffusion	25.44/19.67 → 21.98/19.02	25.43/21.76 → 20.37/19.11	28.34/23.43 → 25.57/21.46	40→4
SDXL	23.47/18.21 → 20.33/17.44	24.11/20.23 → 19.44/18.45	27.34/22.11 → 24.02/20.76	30→4
SD-3	19.07/16.22 → 17.37/16.02	22.32/19.76 → 18.02/19.08	24.49/21.78 → 21.28/19.34	50→3
Pixart-alpha	21.32/17.09 → 19.22/16.96	23.78/20.04 → 18.76/18.56	26.06/20.07 → 23.04/18.92	25→3
Flux	15.52/15.01 → 13.28/14.08	19.02/18.28 → 15.55/16.58	22.46/18.88 → 19.96/17.01	30→3

ViCTr + Flux achieves MFID 17.01 on CirrMRI600+, 28% lower than the previous best.

Segmentation Performance Improvement (mDSC%↑ / mHD95↓)¶

Segmentation Model	Real Data Only	+Augmentation	+30% Vanilla Synthetic	+30% ViCTr Synthetic
BTCV
UNet	76.72	78.45	79.32	81.22
TransUNet	85.52	87.01	87.54	89.78
nnUNet	80.48	82.54	83.37	85.19
MedSegDiff	87.91	88.65	89.78	91.92
CirrMRI600+
UNet	68.74	69.38	70.12	73.39
nnUNet	71.02	72.49	73.56	78.89
MedSegDiff	76.92	77.11	78.03	81.37

nnUNet trained with ViCTr synthetic data on CirrMRI600+ achieves a gain of +7.87% mDSC (+3.8% over vanilla synthetic data).

Ablation Study¶

Ablation	FID
w/o Stage-1 pre-training	17.33
w/o Tweedie correction	18.78
Using Reflow	20.19
Using Flow Straight and Fast	21.37
Using Distribution Matching Distillation	22.33
$L_{diff}$ only	18.77
$L_{diff} + L_{spatial}$	18.21
$L_{diff} + L_{consistency}$	17.02
$L_{diff} + L_{spatial} + L_{consistency}$	15.55
LoRA r=8	18.46
LoRA r=16	17.52
LoRA r=32	16.66
LoRA r=64 (final)	15.55

Key Findings¶

Tweedie correction is critical: Removing it raises FID from 15.55 to 18.78, indicating a significant degradation in distribution alignment.
Pre-training is indispensable: Omitting Stage 1 raises FID from 15.55 to 17.33.
Consistency loss contributes most: Among the three loss components, $L_{consistency}$ yields the largest gain (17.02 vs. 18.77).
Higher LoRA rank is better: r=64 performs best, as a higher-dimensional adaptation space provides finer-grained parameter updates.
Inference efficiency: Sampling steps are reduced from 50 to 3–4, and inference time drops from 18.98s to 2.78s (SD-3).
Radiologist validation: Three radiologists were unable to distinguish synthesized from real cirrhotic MRI images in a Visual Turing Test.

Highlights & Insights¶

First abdominal MRI pathology synthesis: This is the first method to achieve graded-severity control in abdominal MRI pathology synthesis, filling an important gap in the field.
Theoretical innovation: Embedding Tweedie's Formula into the Rectified Flow framework is an original theoretical contribution, not a trivial combination of existing techniques.
Extreme efficiency: A reduction from 50 to 3 steps (10× speedup) makes clinical deployment feasible.
Broad compatibility: Support for 5 mainstream diffusion backbones demonstrates the generality of the approach.
Dual validation: The method is evaluated through both quantitative metrics (FID/MFID/mDSC) and qualitative radiologist assessment.
Elegant two-stage paradigm: Stage 1 establishes anatomical priors while Stage 2 applies LoRA-based pathology adaptation, elegantly balancing generality and task specificity.

Limitations & Future Work¶

2D slice-level synthesis: The current method operates on 2D slices only; 3D volumetric consistency is not guaranteed.
Limited pathology types: Only liver cirrhosis is validated; other diffuse conditions (e.g., hepatic steatosis, fibrosis) remain untested.
Resolution constraint: A resolution of 256×256 may be insufficient to capture certain subtle pathological features.
Training overhead: Pre-training requires approximately 52 hours on 8×8 A100 GPUs, imposing substantial resource demands.
VAE bottleneck: Reliance on a frozen VAE encoder/decoder means that information loss therein propagates to the final synthesis quality.

Medical diffusion: MedSegDiff, EMIT-Diff, DiNO-Diffusion
Rectified Flow: ReFlow, Flow Straight and Fast, Distribution Matching Distillation
Consistency models: Consistency Models
Medical data augmentation: DiffuseMix, DreamDA, ControlPolypNet

Insights¶

The combination of Rectified Flow and Tweedie correction is generalizable to other settings requiring high-fidelity few-step sampling.
The two-stage paradigm of "general pre-training + LoRA task-specific adaptation" is particularly effective in data-scarce domains.
Evaluation of medical image synthesis should jointly consider generation quality (FID) and downstream task utility (segmentation mDSC).

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Value	⭐⭐⭐⭐⭐