SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation¶

Metadata¶

Conference: ICCV 2025
arXiv: 2503.09641
Code: GitHub | HuggingFace Models
Institution: NVIDIA, MIT, Tsinghua University, HuggingFace
Area: Image Generation / Diffusion Model Distillation
Keywords: One-step generation, continuous-time consistency models (sCM), latent adversarial diffusion distillation (LADD), Flow Matching → TrigFlow conversion, real-time interactive generation

TL;DR¶

SANA-Sprint proposes a hybrid distillation framework combining continuous-time consistency models (sCM) and latent adversarial diffusion distillation (LADD). It converts pretrained Flow Matching models to TrigFlow in a lossless manner and jointly trains with sCM+LADD, achieving unified adaptive high-quality text-to-image generation in 1–4 steps, with a single-step latency of only 0.1 seconds on H100.

Background & Motivation¶

Diffusion models typically require 50–100 iterative denoising steps, incurring substantial computational cost. Existing step-distillation methods exhibit notable limitations:

GAN-based methods (e.g., ADD, LADD): Training instability with mode collapse and adversarial dynamic oscillation; hyperparameter tuning is difficult.

VSD-based methods: Require jointly training an additional diffusion model, placing heavy demands on GPU memory.

Consistency models (CM): Quality degrades significantly at very low step counts (<4 steps), particularly in text-to-image tasks where truncation errors cause semantic alignment deterioration.

These challenges motivate the design of a unified distillation framework that balances efficiency, flexibility, and quality. SANA-Sprint builds upon the pretrained SANA model and incorporates recent advances in continuous-time consistency models (sCM) to eliminate the discretization errors inherent in discrete-time consistency models.

Method¶

Overall Architecture¶

SANA-Sprint adopts a three-stage design: 1. Training-free conversion: Mathematically equivalent conversion of a Flow Matching model to a TrigFlow model. 2. Stability optimization: Addressing training instability via QK normalization and dense time embeddings. 3. Hybrid distillation: The sCM loss provides teacher alignment and diversity preservation, while the LADD adversarial loss enhances single-step fidelity.

Key Design 1: Lossless Flow Matching → TrigFlow Conversion¶

Flow Matching and TrigFlow are misaligned in three aspects: time domain ([0,1] vs. [0,π/2]), noise schedule (inconsistent data scaling), and prediction target (static vs. time-varying coefficients).

The authors derive a lossless mathematical conversion:

\[t_{\text{FM}} = \frac{\sin(t_{\text{Trig}})}{\sin(t_{\text{Trig}}) + \cos(t_{\text{Trig}})}\]

\[x_{t,\text{FM}} = \frac{x_{t,\text{Trig}}}{\sigma_d} \cdot \sqrt{t_{\text{FM}}^2 + (1-t_{\text{FM}})^2}\]

The TrigFlow model output is obtained via linear combination:

\[\hat{F_\theta} = \frac{1}{\sqrt{t_{\text{FM}}^2 + (1-t_{\text{FM}})^2}} \left[ (1-2t_{\text{FM}}) x_{t,\text{FM}} + (1-2t_{\text{FM}}+2t_{\text{FM}}^2) v_\theta \right]\]

Experiments confirm that FID remains nearly unchanged before and after conversion (5.81 vs. 5.73), demonstrating losslessness in both theory and practice.

Key Design 2: Stable Continuous-Time Distillation¶

Dense time embeddings: The noise coefficient $c_{\text{noise}}(t)$ is changed from $1000t$ to $t$, preventing the time derivative $\partial_t F_{\theta^{-}}$ from being amplified by a factor of 1000, which would otherwise cause gradient explosion. PCA visualization shows that embeddings in the 0–1 range are more densely similar.
QK normalization: RMS Normalization is applied to the Query and Key in both self-attention and cross-attention, resolving training collapse caused by gradient norms exceeding $10^3$ when scaling the model from 0.6B to 1.6B parameters. Only 5,000 iterations of teacher model fine-tuning are required.

Key Design 3: sCM + LADD Hybrid Loss¶

sCM distills teacher knowledge in a local manner (learning consistency between adjacent timesteps) and converges slowly. LADD adversarial loss is introduced to provide global supervision across timesteps:

\[\mathcal{L} = \mathcal{L}_{\text{sCM}} + \lambda \mathcal{L}_{\text{adv}}, \quad \lambda = 0.5\]

LADD uses the frozen teacher model as a feature extractor and trains multiple discriminator heads in the latent space.
Discriminators employ hinge loss to distinguish real noisy samples from generated noisy samples.

Additional Maximum Timestep Weighting¶

With probability $p$, the training timestep is set to $t = \pi/2$ (pure noise), reinforcing single-step generation capability. Experiments find that 50% probability is optimal.

Loss & Training¶

sCM loss (continuous-time consistency): $$\mathcal{L}_{\text{sCM}} = \mathbb{E}_{x_t, t} \left[ \frac{e^{w_\phi(t)}}{D} \left\| \hat{F_\theta} - \hat{F_{\theta^{-}}} - \cos(t) \frac{d\hat{f_{\theta^{-}}}}{dt} \right\|_2^2 - w_\phi(t) \right]$$

where $w_\phi(t)$ is an adaptive weighting function that minimizes variance across different timesteps.

Key Experimental Results¶

Main Results: Efficiency and Performance Comparison with SOTA Methods¶

Method	Steps	Throughput (samples/s)	Latency (s)	Params (B)	FID ↓	CLIP ↑	GenEval ↑
FLUX-schnell	4	0.5	2.10	12.0	7.94	28.14	0.71
SDXL-DMD2†	4	2.27	0.54	0.9	6.82	28.84	0.60
SD3.5-Turbo	4	0.94	1.15	8.0	11.97	27.35	0.72
SANA-Sprint 0.6B	4	5.34	0.32	0.6	6.48	28.45	0.76
SANA-Sprint 1.6B	4	5.20	0.31	1.6	6.54	28.45	0.77
FLUX-schnell	1	1.58	0.68	12.0	7.26	28.49	0.69
SDXL-DMD2†	1	3.36	0.32	0.9	7.10	28.93	0.59
SANA-Sprint 0.6B	1	7.22	0.21	0.6	7.04	28.04	0.72
SANA-Sprint 1.6B	1	6.71	0.21	1.6	7.69	28.27	0.76

Key findings: - 4-step SANA-Sprint 0.6B achieves 10.7× the throughput of FLUX-schnell with better FID (6.48 vs. 7.94). - Single-step SANA-Sprint 0.6B achieves a latency of only 0.21s and a GenEval score of 0.72, surpassing FLUX-schnell at the same step count.

Ablation Study: Loss Combinations and Training Strategies¶

Configuration	FID ↓	CLIP ↑
sCM only	8.93	27.51
LADD only	12.20	27.00
sCM + LADD	8.11	28.02
w/o CFG Embed	9.23	27.15
w/ CFG Embed	8.72	28.09
0% maxT	9.44	27.65
50% maxT	8.32	27.94
70% maxT	8.11	28.02
sCM:LADD = 1.0:1.0	8.81	27.93
sCM:LADD = 1.0:0.5	8.43	27.85
sCM:LADD = 1.0:0.1	8.90	27.76

Key findings: - The complementary effect of sCM + LADD is significant: FID drops from 12.20 (LADD only) to 8.11. - CFG embedding improves CLIP score by 0.94. - Maximum timestep weighting from 0% to 50% reduces FID from 9.44 to 8.32.

Real-Time Interactive Generation¶

With ControlNet integration, 1024×1024 image generation achieves a latency of 250ms on H100.
Single-step generation takes 0.31s on a consumer-grade RTX 4090 GPU.

Highlights & Insights¶

Elegant training-free conversion: The mathematically equivalent Flow Matching → TrigFlow transformation avoids the enormous cost of repretraining the teacher model and is compatible with automatic differentiation via differentiable mappings.
Unified step-adaptive model: A single model supports 1–4 steps without requiring step-specific training, unlike many competitors (e.g., SDXL-DMD2, PCM) that train separate models per step count.
Complementarity of hybrid distillation: sCM ensures alignment with the teacher distribution and diversity preservation, while LADD enhances single-step fidelity; the two objectives are orthogonally complementary.
Engineering insights: Dense time embeddings ($c_{\text{noise}}=t$ rather than $1000t$) and QK-Norm are identified as critical stabilization techniques for training large-scale continuous-time consistency models.

Limitations & Future Work¶

Validation is primarily conducted on the SANA architecture; although generalizability to Flow Matching models such as FLUX and SD3 is claimed, no experimental evidence is provided.
Generation quality has not fully reached the level of the 20-step teacher model, particularly in complex semantic scenes.
The LADD discriminator increases training complexity; although lighter than VSD, there remains room for further optimization.
ControlNet is validated only under HED edge conditioning; the effectiveness of other condition types (depth, segmentation, etc.) remains unknown.

sCM [Lu et al., 2024]: Theoretical foundation for continuous-time consistency models; SANA-Sprint directly adopts its training framework.
LADD [Sauer et al., 2024]: Key reference for latent adversarial diffusion distillation, using teacher feature space to train discriminators.
SANA [Xie et al., 2024]: Base model architecture enabling efficient image generation via linear attention transformers.
Inspiration: For other generative tasks requiring few-step inference (video, 3D), the paradigm of Flow → TrigFlow conversion + hybrid distillation holds broad general value.

Rating ⭐⭐⭐⭐⭐¶

This work delivers exceptional engineering value. The lossless Flow → TrigFlow conversion eliminates sCM's dependency on dedicated pretraining, significantly lowering the barrier to continuous-time consistency distillation. The innovative combination of hybrid distillation achieves a notable breakthrough on the speed–quality Pareto frontier. The results of 1-step FID 7.04 and GenEval 0.72 with only 0.6B parameters are impressive and have direct practical implications for deployment on consumer-grade GPUs.