Steer Away From Mode Collisions: Improving Composition In Diffusion Models¶
Conference: ICLR 2026 arXiv: 2509.25940 Code: https://github.com/debottam-dutta7/co3 Area: Diffusion Models / Compositional Generation Keywords: Compositional Generation, Mode Collision, Tweedie Mean Composition, Gradient-Free Correction, Plug-and-Play
TL;DR¶
To address concept missing and collision in multi-concept prompts for diffusion models, this paper proposes the "mode collision" hypothesis — that the modes of the joint distribution overlap with those of individual concept distributions — and introduces CO3 (Concept Contrasting Corrector). CO3 corrects sampling via a contrasting distribution \(\tilde{p}(x|C) \propto p(x|C) / \prod_i p(x|c_i)\) in Tweedie mean space to steer away from degenerate modes, achieving plug-and-play, gradient-free, and model-agnostic compositional generation improvement.
Background & Motivation¶
Background: Diffusion models have achieved remarkable progress in text-to-image generation, yet even simple multi-concept prompts (e.g., "a cat and a dog") frequently suffer from concept missing, blurring, or unnatural merging.
Limitations of Prior Work: - Optimization-based correction methods (Attend-Excite, SynGen, ToMe) require gradient computation through the model and are therefore model-specific. - Composable diffusion methods are model-agnostic but perform poorly, as linear score composition does not correspond to the correct forward distribution for \(t > 0\). - Each family has its own trade-offs, and no unified framework exists.
Key Challenge: The joint distribution \(p(x|C)\) under multi-concept prompts contains "problematic modes" that overlap heavily with the individual concept distributions \(p(x|c_i)\), causing generation to collapse toward a dominant concept.
Goal: (i) Theoretically analyze and unify existing approaches (correction vs. composable diffusion); (ii) Design a training-free, gradient-free, and model-agnostic sampling correction strategy to improve multi-concept composition.
Key Insight: The paper introduces the "mode collision" hypothesis — when certain modes of \(p(x|C)\) overlap with \(p(x|c_i)\), sampling tends to collapse to that single concept. The solution is to design a correcting distribution that suppresses these overlapping regions.
Core Idea: Steer sampling away from single-concept-dominated degenerate modes via \(\tilde{p}(x|C) \propto p(x|C) / \prod_i p(x|c_i)\), and show that Tweedie mean composition constitutes a unifying framework.
Method¶
Overall Architecture¶
CO3 is a sampling-stage corrector embedded within the DDIM sampling process. Corrections are applied during the first 20% of denoising steps, divided into two phases: (1) CO3-resampler: noise resampling for the first 3 steps (weights sum to 0); (2) CO3-corrector: latent correction for subsequent steps (weights sum to 1). No model parameters are modified and no gradient computation is required.
Key Designs¶
-
Tweedie Mean Composition Framework:
- Function: Unifies correction-based and composable diffusion methods under a common theoretical foundation.
- Mechanism: Moves distribution composition from score space to Tweedie mean space: \(\tilde{x}_{\text{tweedie}} = w_0 \hat{x}_{\text{tweedie}}[\epsilon_t^{\lambda,C}] + \sum_{k=1}^K w_k \hat{x}_{\text{tweedie}}[\epsilon_t^{\lambda,c_k}]\), where \(\hat{x}_{\text{tweedie}}[\epsilon_t^{\lambda,c}] = x_t - \sigma_t \epsilon_t^{\lambda,c}\).
- Design Motivation: Proposition 1 proves that when \(\sum_k w_k = 1\), the composed Tweedie mean remains a valid Tweedie mean (CO3-corrector); when \(\sum_k w_k = 0\), it reduces to weighted noise (CO3-resampler, theoretically valid only at \(t = T\)).
-
CO3-resampler (weights sum to 0):
- Function: Replaces the initial noise at early high-noise steps.
- Mechanism: Replaces the current \(x_t\) with a weighted combination of concept-conditioned noise estimates, effectively resampling the starting noise from a concept-suppressed distribution.
- Design Motivation: Experiments show that resampling is most effective at high \(t\), helping to suppress concept dominance at initialization.
-
CO3-corrector (weights sum to 1):
- Function: Corrects the Tweedie mean at intermediate denoising steps.
- Mechanism: \(w_0 > 0\) (joint prompt weight); \(w_1, \ldots, w_K < 0\) (negative weights for individual concepts, acting as suppression). Crucially, the composition retains the CFG form: \(\tilde{\epsilon}_t^{\tilde{\lambda},C} = \epsilon_t^\phi + \lambda(\sum_k w_k \epsilon_t^{c_k} - \epsilon_t^\phi)\).
- Design Motivation: Unlike composable diffusion, which uses arbitrary \(\lambda_i\), CO3-corrector preserves the unconditional-to-conditional ratio of CFG, preventing out-of-distribution samples.
-
Closeness-Aware Concept Weight Modulation:
- Function: Adaptively adjusts the suppression weight for each concept.
- Mechanism: Computes the distance \(d_k\) between the current noise prediction \(\epsilon^C\) and each concept noise \(\epsilon^{c_k}\), converts distances to affinities via an exponential kernel \(a_k = \exp(-\beta d_k)\), normalizes them, and uses the result as negative weights \(w_k = -a_k / \sum_j a_j\).
- Design Motivation: Concepts closer to the current sample should be suppressed more strongly, enabling dynamic balancing across concepts.
Loss & Training¶
This is a training-free method. It is built on SDXL with 50-step DDIM sampling and guidance scale \(\lambda = 5.0\). Corrections are applied in the first 20% of steps: the resampler is used for the first 3 steps and the corrector for subsequent steps. \(\beta = 0.8\).
Key Experimental Results¶
Main Results (Two-concept prompts, Attend-Excite benchmark)¶
| Method | Training-Free | Gradient-Free | Model-Agnostic | ImageReward (A-A) | ImageReward (A-O) | ImageReward (O-O) |
|---|---|---|---|---|---|---|
| SDXL | ✓ | ✓ | - | 0.782 | 1.547 | 0.679 |
| Attend-Excite | ✓ | ✗ | ✗ | 0.824 | 1.238 | 0.874 |
| InitNO | ✓ | ✗ | ✓ | 1.008 | 1.393 | 1.138 |
| Tweediemix | ✓ | ✓ | ✓ | 1.002 | 1.313 | 0.796 |
| CO3 (Ours) | ✓ | ✓ | ✓ | 1.234 | 1.674 | 1.016 |
Ablation Study (Incremental component addition)¶
| Configuration | ImageReward Avg | BLIP-VQA Avg |
|---|---|---|
| SDXL base | 0.843 | — |
| + Resampling | 0.944 | — |
| + Corrector | 0.946 | — |
| + Weight modulation | 1.012 | — |
Key Findings¶
- CO3 is the only method that simultaneously satisfies all three properties — training-free, gradient-free, and model-agnostic — while matching or surpassing gradient-based methods across all metrics.
- The weight-sum-to-1 constraint is critical: it preserves the CFG form, whereas composable diffusion with arbitrary weights readily produces out-of-distribution samples.
- The resampler and corrector play complementary roles: the resampler is effective in the high-noise regime, while the corrector operates in the intermediate denoising phase.
- Weight modulation contributes substantially (Avg: 0.946 → 1.012), demonstrating that adaptive suppression is more effective than fixed weights.
- CO3 also outperforms methods specifically designed for complex prompts (T2ICompBench) and rare concepts (RareBench), including specialized approaches such as R2F.
Highlights & Insights¶
- Strong Theoretical Unification: Proposition 1 unifies existing correction and composable diffusion methods under the Tweedie mean composition framework, revealing the critical role of weight constraints (\(\sum w_k = 1\) preserves CFG form; \(\sum w_k = 0\) enables resampling). This unified perspective is itself a valuable theoretical contribution.
- Mode Collision Hypothesis: Suppressing degenerate modes overlapping with individual concept distributions via \(p(x|C) / \prod_i p(x|c_i)\) is intuitive and well-supported by experimental evidence, which indirectly validates the hypothesis.
- Fully Plug-and-Play: No model modification, no gradient computation, and no additional training are required, making CO3 directly applicable to any diffusion model.
Limitations & Future Work¶
- Each denoising step requires \(K+1\) separate noise predictions, so inference cost scales linearly with the number of concepts.
- Evaluation is conducted solely on SDXL; although model-agnosticism is claimed, the method has not been tested on DiT-based architectures (e.g., Flux/SD3).
- CO3-resampler is not theoretically rigorous for \(t < T\); while empirically effective up to approximately \(t \approx 0.9T\), formal guarantees are lacking.
- The \(\beta\) parameter in weight modulation and the proportion of correction steps require manual tuning.
- Concept decomposition relies on prompt parsing; the quality of decomposition for complex natural-language prompts is not discussed.
Related Work & Insights¶
- vs. Attend-Excite: Attend-Excite optimizes attention maps for attribute binding, requiring gradients and being architecture-dependent; CO3 addresses the problem at the distribution mode level, offering a theoretically higher-level perspective.
- vs. Composable Diffusion: Both are model-agnostic score composition approaches, but composable diffusion uses unconstrained linear combinations; CO3 demonstrates that the weight constraint is the key factor.
- vs. Tweediemix: Both operate in Tweedie space, but Tweediemix requires a LoRA fine-tuning stage, whereas CO3 is a purely sampling-time correction.
- The mode collision hypothesis may inspire deeper investigation into the training dynamics of diffusion models.
Rating¶
- Novelty: ⭐⭐⭐⭐ The mode collision hypothesis is novel, and the unifying theory based on Tweedie mean composition is insightful.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks (simple/complex/rare prompts), complete ablations, and human evaluation.
- Writing Quality: ⭐⭐⭐⭐ Theoretical derivations are clear, though the heavy notation raises the reading barrier.
- Value: ⭐⭐⭐⭐ Plug-and-play compositional generation improvement carries high practical utility.