SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation¶

Conference: ICCV 2025 arXiv: 2503.09641 Code: https://github.com/NVlabs/Sana Area: Image Generation / Diffusion Model Acceleration Keywords: one-step diffusion, consistency distillation, flow matching, adversarial distillation, real-time generation

TL;DR¶

This work converts a pretrained SANA flow matching model into TrigFlow via a lossless mathematical transformation, and combines continuous-time consistency distillation (sCM) with latent adversarial diffusion distillation (LADD) in a hybrid training strategy, achieving unified 1–4 step adaptive high-quality image generation. One-step generation of 1024×1024 images requires only 0.1s on an H100, surpassing FLUX-schnell with an FID of 7.59 and GenEval of 0.74 while being 10× faster.

Background & Motivation¶

Diffusion models typically require 20–100 sampling steps, severely limiting real-time applications. Existing acceleration methods fall into two categories: trajectory distillation (e.g., consistency models, CM) and distribution distillation (e.g., GAN/VSD), each with distinct drawbacks: GAN training is unstable and prone to mode collapse; VSD requires an additional diffusion model, increasing memory overhead; discrete-time CMs suffer quality degradation at very few steps (<4). More critically, continuous-time consistency models (sCM) require models in TrigFlow format, whereas mainstream models (SANA/FLUX/SD3) use flow matching, making training a TrigFlow model from scratch prohibitively expensive.

Core Problem¶

How to efficiently convert an existing flow matching model into a consistency model capable of one-step generation—without training from scratch—while preserving generation quality and multi-step flexibility.

Method¶

Overall Architecture¶

SANA-Sprint builds upon the pretrained SANA model in three stages: (1) a lossless mathematical transformation converts the flow matching model to TrigFlow format; (2) sCM-based continuous-time consistency distillation maintains alignment with the teacher, while LADD adversarial distillation enhances single-step fidelity; (3) unified training produces a step-adaptive model shared across 1–4 steps.

Key Designs¶

Training-Free Flow→TrigFlow Conversion: The core contribution—via rigorous mathematical derivation (Proposition 3.1), the inputs and outputs of a flow matching model are converted to TrigFlow format through a differentiable transformation. The conversion involves time remapping (\(t_{FM} = \sin(t_{Trig})/(\sin(t_{Trig})+\cos(t_{Trig}))\)), input rescaling, and linear combination of outputs. Both theoretical analysis and experiments confirm the transformation is completely lossless (FID 5.81→5.73 at 50 steps). This eliminates the need to pretrain TrigFlow models from scratch and enables the sCM framework to be directly applied to any flow matching model.
Hybrid sCM + LADD Distillation: sCM learns self-consistent mappings along ODE trajectories via a continuous-time consistency loss, maintaining distributional alignment and diversity with the teacher; LADD performs adversarial learning in latent space using teacher model features as the discriminator, enhancing single-step fidelity. The two are complementary: sCM alone achieves FID=8.93, LADD alone achieves FID=12.20, and their combination yields FID=8.11.
Training Stabilization Techniques: (a) Dense Time Embedding—replacing the noise coefficient from \(1000t\) to \(t\) stabilizes time derivatives and accelerates convergence; (b) QK-Normalization—applying RMS norm in self/cross-attention stabilizes gradients for large models (1.6B) and prevents training collapse; (c) Max-Time Weighting—setting the training timestep to \(\pi/2\) (i.e., pure noise) with 50% probability in LADD, significantly improving one-step generation quality.

Loss & Training¶

Total loss: \(\mathcal{L} = \mathcal{L}_{sCM} + 0.5 \cdot \mathcal{L}_{adv}\)
The teacher is obtained by pruning and fine-tuning SANA-1.5 4.8B; the teacher is first fine-tuned for 5K steps to adapt to dense time embedding and QK-Norm, followed by 20K steps of student distillation.
Training uses 32× A100 GPUs with a global batch size of 512.

Key Experimental Results¶

Method	Steps	Params	FID↓	GenEval↑	Latency (A100)
FLUX-dev	50	12B	10.15	0.67	23.0s
SANA-1.6B	20	1.6B	5.76	0.66	1.2s
FLUX-schnell	4	12B	7.94	0.71	2.10s
SD3.5-Turbo	4	8B	11.97	0.72	1.15s
SANA-Sprint 0.6B	4	0.6B	6.48	0.76	0.32s
SANA-Sprint 1.6B	4	1.6B	6.54	0.77	0.31s
FLUX-schnell	1	12B	7.26	0.69	0.68s
SDXL-DMD2	1	0.9B	7.10	0.59	0.32s
SANA-Sprint 0.6B	1	0.6B	7.04	0.72	0.21s
SANA-Sprint 1.6B	1	1.6B	7.69	0.76	0.21s

One-step generation: FID=7.04 and GenEval=0.72, surpassing FLUX-schnell (7.26/0.69) at 3.2× the speed.
Generates 1024×1024 images in 0.31s on RTX 4090 and 0.1s on H100—genuinely AIPC-class real-time performance.
ControlNet integration: 1024×1024 in 0.25s (H100), enabling real-time sketch-to-image interaction.
Unified step-adaptive design: a single model performs well across 1–4 steps without step-specific training.
8.4× faster than the teacher SANA and 64.7× faster than FLUX-schnell (Transformer computation only).

Ablation Study¶

sCM+LADD ≫ sCM alone ≫ LADD alone.
Dense time embedding (\(t\) vs. \(1000t\)): eliminates 1000× gradient amplification, significantly improving stability.
QK-Norm is critical for the 1.6B model (training collapses without it).
Max-time weighting at 50% is optimal (0%→FID=9.44, 50%→FID=8.32).
CFG embedding improves CLIP score by +0.94.
The Flow→TrigFlow transformation is fully lossless (FID difference <0.1).

Highlights & Insights¶

Lossless Flow→TrigFlow transformation is a breakthrough contribution: it enables any flow matching model (FLUX/SD3, etc.) to directly adopt the sCM distillation framework without repretraining.
Hybrid distillation (sCM+LADD) offers strong complementarity: sCM preserves alignment and diversity, while LADD ensures single-step fidelity.
Genuine AIPC-class performance: 0.31s on RTX 4090 and 0.1s on H100—a milestone for real-time text-to-image generation on consumer GPUs.
Unified step-adaptive design: a single model delivers high quality across 1–4 steps, greatly simplifying deployment.
General-purpose stabilization techniques: dense time embedding and QK-Norm are transferable to other distillation methods.

Limitations & Future Work¶

SANA-Sprint 1.6B achieves slightly worse FID than 0.6B at one step (7.69 vs. 7.04), suggesting larger models may require more distillation iterations for single-step generation.
The JVP computation in sCM currently does not support Flash Attention, requiring Linear Attention as a substitute.
Validation is limited to the SANA architecture; applicability to FLUX/SD3 is claimed but not empirically demonstrated.
ControlNet is only evaluated with HED edge conditioning; other condition types remain untested.

vs. FLUX-schnell: Both are distilled models; SANA-Sprint with 0.6B parameters surpasses the 12B FLUX-schnell on both FID and GenEval at over 10× the speed.
vs. LCM/PCM: Discrete-time CMs suffer severe quality degradation below 4 steps (SDXL-LCM FID=50.51 at 1 step); SANA-Sprint eliminates discretization error via sCM.
vs. DMD2: DMD2 is pure distribution distillation (VSD) requiring step-specific models; SANA-Sprint uses hybrid distillation with a step-adaptive model.
vs. Dense2MoE: Dense2MoE accelerates DiT inference via MoE; SANA-Sprint reduces step count via distillation—the two acceleration paradigms are orthogonal.

The Flow→TrigFlow transformation elevates sCM from a theoretical tool to a practical one, potentially catalyzing a wave of work accelerating existing flow matching models to one-step generation. The sCM+GAN hybrid strategy is also transferable to video diffusion model acceleration, where real-time performance is even more critical. This work is orthogonal to MoE-based methods such as Dynamic-DINO—combining MoE parameter reduction with sCM step reduction could yield even more extreme acceleration.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The lossless Flow→TrigFlow transformation is a theoretical innovation; the sCM+LADD hybrid framework is elegantly designed.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive comparison of 6 methods across 1/2/4 steps; per-component ablation of stabilization techniques; detailed timestep scheduling analysis.
Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are rigorous (with complete proofs); Figures 1 and 2 are highly persuasive.
Value: ⭐⭐⭐⭐⭐ A milestone for real-time text-to-image generation on consumer GPUs; open-sourced code and models with significant community impact.