Enhancing Diffusion Model Guidance through Calibration and Regularization¶
Conference: NeurIPS 2025 (SPIGM Workshop) arXiv: 2511.05844 Code: ajavid34/guided-info-diffusion Area: Image Generation Keywords: classifier guidance, diffusion model, f-divergence, calibration, conditional generation
TL;DR¶
To address the vanishing gradient problem caused by overconfident classifiers in classifier-guided diffusion models, this paper proposes two complementary approaches: (1) a Smooth ECE calibration loss for fine-tuning classifiers, yielding ~3% FID improvement; and (2) regularized sampling guidance based on f-divergences (RKL/FKL/JS) that requires no retraining, achieving FID 2.13 on ImageNet 128×128.
Background & Motivation¶
Classifier-guided DDPM steers the reverse diffusion process toward a target class using gradients \(\nabla_x \log p(y|x)\) from an external classifier, making it a core technique for conditional image generation. However, a critical failure mode exists:
Vanishing Gradient Problem: Classifiers assign excessively high confidence (near one-hot distributions) to partially denoised images in the early stages of denoising, causing \(\nabla_x \log p(y|x) \to 0\). Subsequent denoising steps then effectively degrade to unconditional generation, severely degrading conditional generation quality.
Existing remedies (e.g., entropy-constrained training) require training classifiers from scratch and cannot be applied to off-the-shelf classifiers.
Core Problem¶
How can the vanishing gradient problem in classifier guidance be mitigated—without retraining the diffusion model or classifier—to improve the quality and diversity of conditional generation?
Method¶
1. Smooth ECE Calibration Loss (Requires Fine-tuning)¶
A differentiable Huber-style calibration loss is defined as:
where \(\hat{p}^{(i)} = \max_y p_\phi(y|x^{(i)})\) is the predicted confidence, \(a^{(i)} = \mathbb{I}[\hat{y}^{(i)} = y^{(i)}]\) is the correctness indicator, and \(\beta > 0\) ensures differentiability.
This loss can be incorporated as a regularization term during fine-tuning to improve classifier calibration and, consequently, guidance quality.
2. f-Divergence Regularized Sampling Guidance (No Retraining Required)¶
Core Idea: An f-divergence between the classifier's output distribution \(p(\cdot|x)\) and the target distribution \(q_y(\cdot)\) is used as a regularization term to prevent premature distributional collapse. The guidance score is defined as:
where \(q_y(i) = (1-\epsilon)\frac{1}{N} + \epsilon \mathbb{I}_{i=y}\) is a label-smoothed target distribution, and \(\tau_1, \tau_2\) are joint and marginal temperatures.
The general gradient form is:
where \(w_f(q, p) = p f'(p/q)\) and \(g_i(x) = \nabla_x f_i(x) - \sum_j p(j|x) \nabla_x f_j(x)\).
3. Three f-Divergence Instantiations¶
Reverse KL (Mode Covering)¶
\(f(t) = -\log(t)\), weight \(w_f(q,p) = -q\). The gradient decomposes as:
The mode-covering property of RKL ensures non-zero probability wherever \(q_y\) has support, preventing mode dropping. Under a Gaussian mixture analysis, the guidance force comprises an enhanced target direction \((\tau_1 + \alpha\epsilon)(\mu_y - x)\) and a diversity term \(\alpha\frac{1-\epsilon}{K}\sum_{k \neq y}(\mu_k - x)\).
Forward KL (Mode Seeking)¶
\(f(t) = t\log(t)\), with weights containing a \(\log(p(i|x)/q_y(i))\) term. This strongly penalizes probability mass placed by \(p\) outside the support of \(q_y\), producing sharper but less diverse samples. Highest precision, lowest recall.
Jensen-Shannon (Balanced Guidance)¶
Balances mode covering and mode seeking through an implicit mixture distribution \(m = \frac{1}{2}(q_y + p)\). The weight \((q_y(i) - p(i|x))/m(i)\) is bounded and approaches zero when \(q_y \approx p\), providing smooth gradient dynamics. Achieves the best empirical performance.
4. Tilted Sampling¶
Adjusts guidance weights using within-batch information:
\(t > 0\) emphasizes high-confidence samples (improving quality); \(t < 0\) emphasizes low-confidence samples (improving diversity).
Key Experimental Results¶
Smooth ECE Fine-tuning (10K ImageNet 128×128)¶
| Method | FID↓ | Precision↑ | Recall↑ |
|---|---|---|---|
| Standard fine-tuned classifier | 6.15 | 0.77 | 0.68 |
| +Smooth ECE | 5.94 | 0.79 | 0.66 |
~3% FID improvement with minimal fine-tuning.
Sampling Guidance Comparison (10K samples, ResNet-50)¶
| Method | FID↓ | Precision↑ | Recall↑ |
|---|---|---|---|
| Baseline (ma2023) | 5.34 | 0.78 | 0.67 |
| Tilted sampling (t=-0.2) | 5.28 | 0.77 | 0.68 |
| Entropy regularization | 5.30 | 0.77 | 0.69 |
| RKL guidance | 5.12 | 0.78 | 0.68 |
Comparison with SOTA (50K ImageNet 128×128)¶
| Method | Classifier | FID↓ | Precision↑ | Recall↑ |
|---|---|---|---|---|
| Dhariwal et al. | Fine-tuned | 2.97 | 0.78 | 0.59 |
| Entropy-aware classifier | Specialized | 2.68 | 0.80 | 0.56 |
| Classifier-free | - | 2.43 | - | - |
| ma2023 | ResNet-50 | 2.37 | 0.77 | 0.60 |
| FKL (ours) | ResNet-101 | 2.17 | 0.80 | 0.59 |
| RKL (ours) | ResNet-101 | 2.14 | 0.79 | 0.59 |
| JS (ours) | ResNet-101 | 2.13 | 0.79 | 0.60 |
JS divergence achieves the best FID of 2.13 without retraining the diffusion model or classifier.
Divergence Characteristics Summary¶
| Divergence | FID | Precision | Recall | Characteristic |
|---|---|---|---|---|
| FKL | Mid | Highest | Lowest | Mode seeking, high sharpness |
| RKL | Mid-high | Mid | Mid | Mode covering, diversity preservation |
| JS | Best | Mid | Highest | Balanced coverage and seeking |
Highlights & Insights¶
- Plug-and-play: The f-divergence guidance method can be directly applied to off-the-shelf classifiers and diffusion models with zero retraining cost.
- Rigorous theoretical analysis: Complete gradient derivations for all three f-divergences are provided (Proposition 2), along with closed-form analysis under a Gaussian mixture scenario (Proposition 3), establishing a solid mathematical foundation.
- Novel insight: The superiority of JS divergence over both RKL and FKL challenges the conventional belief that mode-covering (RKL) guidance is optimal for generation.
- Smooth ECE loss is simple yet effective, improving calibration and FID with only minimal fine-tuning.
Limitations & Future Work¶
- As a workshop paper, the experimental scope is limited—evaluation is restricted to ImageNet 128×128 without testing higher resolutions.
- The f-divergence guidance hyperparameters (\(\alpha, \epsilon, \tau_1, \tau_2\)) require tuning, and sensitivity analysis is not sufficiently discussed.
- The improvement from tilted sampling is marginal (FID 5.34 → 5.28), limiting its practical utility.
- No comparison is made against more recent conditional generation methods (e.g., DiT + classifier-free guidance).
- Smooth ECE fine-tuning and f-divergence sampling are not jointly evaluated.
Related Work & Insights¶
| Dimension | Dhariwal et al. | Entropy-constrained | ma2023 | Ours |
|---|---|---|---|---|
| Requires classifier retraining | ✓ | ✓ | ✗ | ✗ |
| Requires diffusion model retraining | ✗ | ✗ | ✗ | ✗ |
| Theoretical analysis | None | Limited | Energy perspective | f-divergence framework |
| FID (ResNet-101) | 2.97 | 2.68 | 2.19 | 2.13 |
| Diversity preservation | Weak | Moderate | Moderate | Strong (JS/RKL) |
Additional insights:
- Classifier confidence ≠ guidance strength: Overconfident classifiers paradoxically provide weaker guidance, highlighting the importance of calibration for conditional generation.
- The f-divergence family defines a regularization design space: Different divergences correspond to different precision-recall trade-offs, enabling application-specific selection.
- JS divergence's "symmetric penalty" mechanism: Adaptive correction via the mixture distribution \(m\) strengthens when predictions deviate from the target and diminishes at alignment—a mechanism potentially transferable to other guidance settings.
- The tilted sampling approach of adaptive adjustment using batch-level information has potential connections to ensemble methods.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The f-divergence framework for diffusion guidance regularization is novel with solid theoretical contributions.
- Experimental Thoroughness: ⭐⭐⭐ — Workshop-scale experiments limited to 128×128 resolution; large-scale validation is absent.
- Writing Quality: ⭐⭐⭐⭐ — Mathematical derivations are clear, the Proposition chain is complete, and intuitive explanations are well-presented.
- Value: ⭐⭐⭐⭐ — Zero-cost plug-and-play guidance improvement has strong practical utility.