WaDi: Weight Direction-aware Distillation for One-step Image Synthesis¶

Conference: CVPR 2026 arXiv: 2603.08258 Code: https://github.com/gudaochangsheng/WaDi Area: Image Generation Keywords: Diffusion Distillation, Weight Direction, Low-Rank Rotation, One-Step Generation, Parameter Efficiency

TL;DR¶

By decomposing weight changes during distillation into norm and direction components, this work finds that directional change is the primary driver of distillation (with a magnitude 22× larger than norm change). It proposes LoRaD (Low-Rank Weight Direction Rotation) adapters, integrated into the VSD framework to form WaDi, achieving state-of-the-art one-step FID on COCO with only ~10% trainable parameters.

Background & Motivation¶

Background: Diffusion distillation methods compress multi-step diffusion into one-step generators. Mainstream approaches are divided into full fine-tuning (FT) and LoRA-based fine-tuning, both built upon the VSD (Variational Score Distillation) framework.

Limitations of Prior Work: Both FT and LoRA directly update parameters, jointly optimizing weight norm and direction — yet these two quantities change at vastly different scales: the mean and standard deviation of directional change are 22× and 10× larger than those of norm change, respectively. This coupling increases optimization difficulty.

Key Challenge: Distillation signals are primarily conveyed through directional adjustments, yet existing adapters (LoRA/DoRA) are not specifically designed for direction optimization, leading to slow convergence, instability, and susceptibility to overfitting.

Key Validation: Replacing the one-step model's directions with teacher directions degrades FID by 241; replacing the norms changes FID by only 0.7. The direction residual matrix recovers 93% of the information with 30% of its rank — indicating a low-rank structure.

Core Idea: Since the essence of distillation is weight direction rotation, directly learning a low-rank rotation matrix to adjust directions is more principled than indirect influence via LoRA.

Method¶

Overall Architecture¶

WaDi is built upon the VSD framework: a frozen teacher \(\epsilon_\psi\) (multi-step diffusion model) + a student generator \(G_{\lambda}\) (one-step) + a fake model \(\epsilon_\phi\) (tracking the student distribution). The key innovation is replacing LoRA/FT with LoRaD as the adapter for both the student and the fake model.

Key Designs¶

LoRaD (Low-Rank Weight Direction Rotation):
- Function: Adjusts only the direction of pretrained weights via learnable rotation matrices, leaving their norms unchanged.
- Mechanism: Inspired by RoPE, each column of weights is partitioned into \(d/2\) odd-even paired subspaces, with an independent rotation applied in each 2D subspace: \(W_{ro} = R_{AB}W = \begin{bmatrix} \cos AB & -\sin AB \\ \sin AB & \cos AB \end{bmatrix} \begin{bmatrix} W_{\text{odd}} \\ W_{\text{even}} \end{bmatrix}\) The rotation angle matrix \(\Theta = AB\), where \(A \in \mathbb{R}^{d/2 \times r}\) and \(B \in \mathbb{R}^{r \times k}\), enabling low-rank parameterization.
- Design Motivation: Rotation matrices are orthogonal transformations that naturally preserve norms — perfectly aligned with the finding that direction is critical while norm is negligible. Low-rank decomposition exploits the low-rank structure of direction residuals, substantially reducing parameter count.
- Implementation Efficiency: By leveraging the sparse block-diagonal structure of rotation matrices, computation requires only element-wise multiplication with no additional matrix multiplication overhead.
WaDi Training Framework:
- Function: Integrates LoRaD into the VSD distillation framework.
- Mechanism: The student \(G_{\lambda_{\Theta^l}}\) uses high-rank LoRaD (rank=256); the fake model \(\epsilon_{\phi_{\Theta^s}}\) uses low-rank LoRaD (rank=32). Both are optimized alternately.
- Student loss: \(\nabla_{\lambda_{\Theta^l}} \mathcal{L}_{\text{wadi}} = \mathbb{E}[\omega(t)(\epsilon_\psi - \epsilon_{\phi_{\Theta^s}}) \frac{\partial G_{\lambda_{\Theta^l}}}{\partial \lambda_{\Theta^l}}]\)
- Design Motivation: The student requires higher capacity (rank=256) to fully fit the teacher distribution; the fake model only needs to track the student's evolution (rank=32).

Loss & Training¶

Image-free training: no real images required; only 1.4M JourneyDB text prompts are used.
Student LR=1e-4, fake model LR=1e-2, AdamW optimizer, batch=128, CFG=1.5.
Trained for 2 epochs; supports SD1.5, SD2.1, and PixArt-α backbones.

Key Experimental Results¶

Main Results — COCO 2014 Zero-Shot FID¶

Method	Backbone	NFE	Trainable Params	FID↓	CLIP↑
SD 1.5	U-Net	25	860M	8.78	0.30
DMD2	U-Net	1	860M	12.96	0.30
SiD-LSG	U-Net	1	860M	14.27	0.30
WaDi	U-Net	1	83.8M (9.7%)	10.79	0.31
PixArt-α	DiT	20	610M	8.75	0.32
WaDi	DiT	1	81.2M (13.3%)	18.99	0.30

Ablation Study — Adapter Type Comparison¶

Adapter	Params	FID↓	Direction Mean Change
LoRA	120.9M	25.27	0.83%
DoRA	121.2M	26.56	0.55%
DoRA (frozen norm)	120.9M	24.52	0.92%
FT (DMD2)	860.0M	23.30	2.21%
LoRaD	83.8M	20.86	2.89%

Ablation Study — Effect of Rank Configuration (COCO 2014)¶

Setting	Student Rank	Student Params	Fake Model Rank	FID↓	CLIP↑
A	64	20.95M	32	13.64	0.30
B	128	41.90M	32	13.16	0.29
C	256	83.80M	32	10.79	0.31
D	512	167.59M	32	12.75	0.30

Key Findings¶

LoRaD achieves the largest directional change (2.89%) and best FID (20.86) with the fewest parameters (83.8M vs. 860M), perfectly validating the hypothesis that direction is the key variable in distillation.
Rank=256 is the optimal student configuration; rank=512 leads to overfitting (FID degrades from 10.79 to 12.75).
Fake model rank primarily affects fidelity (FID) with minimal impact on semantic alignment (CLIP).
WaDi can be directly applied to downstream tasks including ControlNet (86% inference speedup), ReVersion (89% speedup), and DreamBooth.
In a user study with 57 participants, WaDi was consistently rated superior to existing baselines in both image quality and text-image alignment.

Highlights & Insights¶

Weight Norm–Direction Decomposition Analysis: This work is the first to systematically study the structure of weight changes during distillation — directional change far exceeds norm change, and direction residuals exhibit a low-rank structure. This offers a novel theoretical perspective on distillation.
Rotation Instead of Addition: LoRA updates weights via addition \(W + \Delta W\) (jointly modifying both norm and direction), whereas LoRaD applies rotation \(R_{\Theta}W\) to modify direction exclusively — more targeted and more efficient.
Parameter Efficiency: Approximately 10% of trainable parameters suffice to surpass full fine-tuning, offering significant value in resource-constrained settings.

Limitations & Future Work¶

The 2D subspace pairing in LoRaD follows a fixed odd-even scheme, which may not be the optimal grouping strategy.
Although FID is better than DMD2, the difference in CLIP scores is marginal, suggesting that directional rotation primarily improves image fidelity rather than semantic alignment.
The FID gap on PixArt-α (DiT architecture) remains relatively large (18.99), potentially requiring architecture-specific designs for DiT.
Ablation studies are conducted only on COCO 2017; validation on additional datasets is lacking.

vs. DMD2: DMD2 employs full fine-tuning for distillation; WaDi achieves better FID with only 10% of the parameters by precisely targeting the key variable in distillation (direction).
vs. LoRA/DoRA: LoRA's additive updates modify both norm and direction but yield insufficient directional change (0.83%); DoRA decouples the norm but still updates direction via LoRA; LoRaD directly rotates directions, achieving the largest directional change (2.89%).
vs. SwiftBrush: SwiftBrush also builds on VSD but uses full fine-tuning; WaDi combines VSD with LoRaD for far superior parameter efficiency.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The weight norm–direction analysis perspective is original, and the LoRaD design is elegant with well-grounded theoretical motivation.
Experimental Thoroughness: ⭐⭐⭐⭐ Coverage across three backbones, downstream tasks, and a user study is comprehensive; ablation is detailed, though broader dataset coverage would strengthen the work.
Writing Quality: ⭐⭐⭐⭐⭐ The motivation analysis is highly convincing (replacement experiments + SVD analysis), with rigorous logical argumentation throughout.
Value: ⭐⭐⭐⭐⭐ Sets a new standard for parameter-efficient distillation; LoRaD is transferable to other fine-tuning scenarios.