Visual Generation Without Guidance¶
Conference: ICML 2025
arXiv: 2501.15420
Code: https://github.com/thu-ml/GFT
Area: Diffusion Models / Image Generation
Keywords: Classifier-Free Guidance, guidance-free generation, sampling efficiency, pseudo-temperature parameterization, distillation alternative
TL;DR¶
This work proposes Guidance-Free Training (GFT), which parameterizes the conditional model as a linear interpolation between a sampling network and an unconditional network. This enables direct training of guidance-free visual generative models from data, halving the sampling computation while matching the performance of CFG on five models (DiT, VAR, LlamaGen, MAR, and LDM).
Background & Motivation¶
Background: CFG is a standard technology in visual generation, which improves generation quality by running both conditional and unconditional models simultaneously during sampling, which directly doubles the inference computation.
Limitations of Prior Work: (a) Doubled inference costs; (b) complicated post-training processes (distillation and RLHF require special handling of unconditional models); (c) inconsistency with the simple temperature sampling used in LLMs.
Key Challenge: The sampling distribution of CFG, \(p^s(\boldsymbol{x}|c) \propto p(\boldsymbol{x}|c)[p(\boldsymbol{x}|c)/p(\boldsymbol{x})]^s\), does not have a corresponding real-world dataset, precluding direct maximum likelihood training.
Goal: Can a single model achieve the quality-diversity trade-off of CFG?
Key Insight: Realigning the CFG formulation to express the conditional model as a weighted combination of a sampling model and an unconditional model, allowing direct learning of the sampling model.
Core Idea: CFG formulated as \(\epsilon^c = \frac{1}{1+s}\epsilon^s + \frac{s}{1+s}\epsilon^u\), which directly optimizes \(\epsilon^s\) enabling sampling without guidance.
Method¶
Overall Architecture¶
GFT maintains the same maximum likelihood training objective as CFG but parameterizes the conditional model differently: the conditional model is defined as an implicit linear combination of the sampling network \(\epsilon_\theta^s\) and the unconditional network \(\epsilon_\theta^u\). A pseudo-temperature \(\beta = 1/(1+s)\) is introduced as an additional input to the model, allowing flexible adjustment during inference.
Key Designs¶
-
Implicit Conditional Parameterization:
- Function: Enables the training to directly optimize the sampling model \(\epsilon_\theta^s\) instead of the conditional model \(\epsilon_\theta^c\)
- Mechanism: \(\epsilon_\theta^c(\boldsymbol{x}_t|\boldsymbol{c},\beta) = \beta \epsilon_\theta^s(\boldsymbol{x}_t|\boldsymbol{c},\beta) + (1-\beta) \epsilon_\theta^u(\boldsymbol{x}_t)\), where this implicit representation is trained using standard conditional loss
- Design Motivation: Although \(\epsilon^s\) cannot be learned directly without a dataset for \(p^s\), \(\epsilon^c\) is learnable, which allows \(\epsilon^s\) to be optimized indirectly through it
-
Stop-Gradient Trick:
- Function: Improves training efficiency and stability
- Mechanism: The unconditional model \(\epsilon_\theta^u\) is executed in eval mode with stop-gradient, propagating gradients only to \(\epsilon_\theta^s\)
- Design Motivation: (a) Highly aligned with CFG training, differing only by one unconditional inference; (b) introduces negligible GPU memory overhead; (c) increases training time by only 19%
-
Pseudo-Temperature \(\beta\) Input:
- Function: Allows a single model to support sampling at different temperatures
- Mechanism: \(\beta \sim U(0,1)\) is randomly sampled and added to the time/class embeddings after being processed by a Fourier embedding and MLP
- Design Motivation: \(\beta=1\) is equivalent to standard conditional generation, while \(\beta \to 0\) approaches low-temperature high-quality sampling
Loss & Training¶
- Diffusion version: \(\mathcal{L} = \|\beta\epsilon_\theta^s(\boldsymbol{x}_t|\boldsymbol{c}_\varnothing,\beta) + (1-\beta)\mathbf{sg}[\epsilon_\theta^u(\boldsymbol{x}_t|\varnothing,1)] - \boldsymbol{\epsilon}\|_2^2\)
- AR/Masked version: Conditional logits \(\ell_\theta^c = \beta \ell_\theta^s + (1-\beta)\mathbf{sg}[\ell_\theta^u]\), followed by computing the standard cross-entropy
- Fine-tuning pretrained CFG models requires only 1-5% of pretraining epochs, utilizing zero-initialization on the final MLP for \(\beta\) to avoid impacting initial outputs
Key Experimental Results¶
Main Results¶
| Model | CFG FID ↓ | GFT FID ↓ | GFT Fine-tuning/Scratch |
|---|---|---|---|
| DiT-XL/2 | 2.11 | 1.99 | Fine-tune 2% epochs |
| DiT-XL/2 (Distillation) | 2.11 | - | - |
| VAR-d30 | - | Matches CFG | Train from scratch |
| LlamaGen | - | Matches CFG | Train from scratch |
| MAR | - | Matches CFG | Train from scratch |
| LDM (T2I) | - | Matches CFG | Fine-tune |
Ablation Study¶
| Method | Domain | Extra Training Time | VRAM Increase | Trainable from Scratch |
|---|---|---|---|---|
| Guidance Distillation | Diffusion only | ×1.19 | ×1.15 | ✗ |
| Contrastive Alignment | AR/Masked only | ×1.69 | ×1.39 | ✗ |
| GFT (Ours) | All | ×1.00 | ×1.00 | ✓ |
Key Findings¶
- GFT fine-tuning on DiT-XL reaches an FID of 1.99 with only 2% epochs, outperforming the CFG baseline of 2.11.
- GFT is the only method that supports training from scratch and is compatible across all three model types: diffusion, AR, and masked models.
- Adjusting \(\beta\) enables a diversity-fidelity trade-off comparable to CFG.
Highlights & Insights¶
- Extremely Simple Implementation: Requires modifying only a few lines of existing CFG code, directly inheriting most hyperparameters.
- Unity: Integrates guidance-free training methods for diffusion, AR, and masked models for the first time.
- Theoretical Guarantee: Theorem 1 proves that the optimal solution of GFT is consistent with the target CFG sampling distribution.
Limitations & Future Work¶
- The authors acknowledge that the sampling distribution of \(\beta\) (uniform distribution) may not be optimal.
- Validation on large-scale T2I generation (such as SDXL level) remains insufficient.
- The integration with acceleration methods like consistency models has not been explored.
Rating¶
- Novelty: ⭐⭐⭐⭐ The parameterization perspective is ingenious, though the core is mathematically a reformulation of CFG.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Full coverage across five models, class-conditional/text-conditional settings.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear derivation and fair experimental comparisons.
- Value: ⭐⭐⭐⭐⭐ Highly practical, with strong potential to become a standard method replacing CFG.