ICLR2026 Image Generation Diffusion Models Classifier-Free Guidance Subnetwork Stochastic Block-Dropping Self-Guidance Text-to-Image Text-to-Video

Stochastic Self-Guidance for Training-Free Enhancement of Diffusion Models¶

Conference: ICLR2026 arXiv: 2508.12880 Code: Project Page Area: Image Generation Keywords: Diffusion Models, Classifier-Free Guidance, Subnetwork, Stochastic Block-Dropping, Self-Guidance, Text-to-Image, Text-to-Video

TL;DR¶

This paper proposes S²-Guidance, which constructs a weak model by randomly dropping transformer block activations during denoising to perform self-guidance, correcting the suboptimal predictions of CFG without additional training. The method consistently outperforms CFG and other advanced guidance strategies on text-to-image and text-to-video tasks.

Background & Motivation¶

CFG is the cornerstone of conditional generation: Classifier-Free Guidance enhances generation quality by extrapolating between conditional and unconditional predictions, and has become the standard practice for diffusion models.
CFG has inherent deficiencies: Empirical analysis shows that CFG-generated results deviate from the true distribution, leading to semantic inconsistency and loss of detail.
Weak model guidance is promising: Works such as Autoguidance find that guiding with a degraded model improves over CFG, but require training a separate weak model, which is infeasible for large-scale pretrained models.
Manual network modification generalizes poorly: Methods like SEG simulate weak models by modifying attention regions, but rely on empirical hyperparameter tuning and are designed for specific tasks.
Transformer blocks exhibit significant redundancy: In mainstream architectures such as DiT, outputs across different blocks are highly similar, suggesting that subnetworks can substitute the full model for functional prediction.
A universal training-free improvement is needed: Existing methods either require training a weak model or depend on task-specific modifications, lacking a simple and general solution.

Method¶

Step 1: Analyzing the Suboptimality of CFG¶

The limitations of CFG are verified on a Gaussian mixture toy example — while conditional generation is improved, a mode shift occurs, and in the 2D case samples spread into non-target regions. t-SNE analysis on CIFAR-10 further confirms severe distributional collapse under CFG.

Step 2: Naive S²-Guidance¶

The core idea is to use the model's own subnetwork as the weak model:

\[\tilde{D}_\theta^\lambda(x_t|c) = D_\theta(x_t|\phi) + \lambda(D_\theta(x_t|c) - D_\theta(x_t|\phi)) - \frac{\omega}{N}\sum_{i=1}^N(\hat{D}_\theta(x_t|c, \mathbf{m}_i) - D_\theta(x_t|c))\]

A binary mask \(\mathbf{m}\) randomly drops a subset of transformer blocks to construct subnetwork predictions \(\hat{D}_\theta\).
The deviation between the subnetwork prediction and the full model prediction serves as the self-guidance signal.
\(N\) different masks are sampled per step and the guidance signals are averaged.
\(\omega\) controls the self-guidance strength (S²Scale).

Step 3: Simplification to S²-Guidance¶

A key finding is that within a reasonable drop range, dropping different blocks consistently steers the model toward the desired distribution. This motivates a simplification to a single random block-dropping per timestep:

\[\tilde{D}_\theta^\lambda(x_t|c) = D_\theta(x_t|\phi) + \lambda(D_\theta(x_t|c) - D_\theta(x_t|\phi)) - \omega(\hat{D}_\theta(x_t|c, \mathbf{m}_t) - D_\theta(x_t|c))\]

Key Design Choices¶

Protecting critical blocks: Structurally important blocks (e.g., the first block) are excluded; random dropping is applied only among non-critical blocks.
Drop ratio of ~10%: Experiments confirm that dropping approximately 10% of blocks yields optimal performance.
Application interval: The method is most effective when applied within the middle 80% of the noise level range during denoising.
Dynamic diversity: Independently sampling masks at different timesteps is more robust than fixing the dropped block throughout denoising.

Key Experimental Results¶

Table 1: Text-to-Image HPSv2.1 and T2I-CompBench Comparison¶

Model	Method	HPSv2.1 Avg↑	Color↑	Shape↑	Texture↑	Qalign(HPSv2.1)↑
SD3	CFG	30.48	53.61	51.20	52.45	4.66
SD3	CFG-Zero	30.78	52.70	52.84	53.37	4.66
SD3	SEG	30.39	58.20	57.68	57.17	4.33
SD3	S²-Guidance	31.09	59.63	58.71	56.77	4.65
SD3.5	CFG	30.82	51.29	47.71	47.39	4.63
SD3.5	S²-Guidance	31.56	57.57	51.23	50.13	4.70

S²-Guidance achieves the best results across all HPSv2.1 dimensions and leads substantially on Color and Shape in T2I-CompBench.

Table 2: ImageNet 256×256 Class-Conditional Generation¶

Method	IS↑	FID↓
Baseline	125.13	9.41
CFG	258.09	2.15
CFG-Zero	258.87	2.10
S²-Guidance	259.12	2.03

Table 3: VBench Text-to-Video Comparison (Wan Model)¶

Model	Method	Total↑	Quality↑	Semantic↑
Wan-1.3B	CFG	80.29	84.32	64.16
Wan-1.3B	CFG-Zero	80.71	84.51	65.53
Wan-1.3B	S²-Guidance	80.93	84.74	65.70
Wan-14B	CFG	82.65	84.88	73.76
Wan-14B	S²-Guidance	82.84	84.89	74.65

The method achieves the highest overall score on both the 1.3B and 14B models, validating its generality.

Computational Overhead¶

Runtime: approximately 40% increase over CFG (29.2s → 40.2s).
Peak memory: unchanged, as the subnetwork and full model run sequentially.
S²-Guidance with 20 steps achieves a higher HPS Score than CFG with 60 steps, yielding a superior performance–efficiency frontier.

Highlights & Insights¶

Training-free and plug-and-play: No additional weak model training is required; the method directly exploits the inherent redundancy of the model's own subnetwork and adapts to any DiT architecture.
Clear theoretical intuition: Starting from a closed-form analysis of Gaussian mixtures and progressively extending to real data, the argumentative chain is complete and rigorous.
Minimal and efficient design: Only one additional forward pass with ~10% blocks dropped is needed per step, with no increase in memory usage.
Multi-task coverage: Consistent improvements are demonstrated across class-conditional image generation, T2I, and T2V tasks, validated on multiple models including SD3, SD3.5, and Wan.
Dynamic diversity outperforms fixed strategies: The time-varying diversity of stochastic dropping naturally avoids the limitation of using a fixed weak model throughout the entire denoising process.

Limitations & Future Work¶

40% computational overhead: Although memory usage is unchanged, the additional forward pass per step incurs non-trivial costs in large-scale deployment.
Manual tuning of \(\omega\): The optimal S²Scale may vary across models and tasks, and excessively large \(\omega\) can cause over-correction.
Heuristic block-dropping design: Excluding critical blocks and determining the drop range still relies on empirical analysis, lacking an automated selection mechanism.
Applicability to non-DiT architectures is unverified: The method is primarily tested on Transformer-based diffusion models; its suitability for architectures such as UNet remains uncertain.
Diminishing gains on stronger models: The improvement on Wan-14B is smaller than on the 1.3B model, suggesting diminishing marginal returns as the model approaches the state of the art.

Method	Requires Training?	Generality	Core Mechanism	Comparison with S²-Guidance
CFG	×	High	Conditional–unconditional extrapolation	Suffers from mode shift and distributional collapse
Autoguidance	✓	Low	Training a degraded weak model	Requires additional training; weak model selection is difficult
SEG	×	Medium	Modifying attention regions	Task-specific, hyperparameter-sensitive, reduced aesthetic scores
CFG++	×	High	Manifold constraint	Some metrics fall below vanilla CFG
CFG-Zero	×	High	Zero-initialization correction	Competitive but does not leverage weak model guidance direction
S²-Guidance	×	High	Stochastic block-dropping self-guidance	Universal, training-free, best overall performance

Rating¶

Novelty: ⭐⭐⭐⭐ — The insight of using stochastic block-dropping as a weak model is both novel and natural.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive coverage from toy examples to ImageNet, T2I, and T2V, with thorough ablations.
Writing Quality: ⭐⭐⭐⭐ — The argument progresses logically from toy to real settings, with intuitive illustrations.
Value: ⭐⭐⭐⭐ — A plug-and-play universal enhancement for diffusion models with strong practical utility.