Skip to content

Steer Away From Mode Collisions: Improving Composition In Diffusion Models

Conference: ICLR 2026 arXiv: 2509.25940 Code: https://github.com/debottam-dutta7/co3 Area: Diffusion Models / Compositional Generation Keywords: Compositional Generation, Mode Collision, Tweedie Mean Composition, Gradient-Free Correction, Plug-and-Play

TL;DR

To address concept missing and collision in multi-concept prompts for diffusion models, this paper proposes the "mode collision" hypothesis — that the modes of the joint distribution overlap with those of individual concept distributions — and introduces CO3 (Concept Contrasting Corrector). CO3 corrects sampling via a contrasting distribution \(\tilde{p}(x|C) \propto p(x|C) / \prod_i p(x|c_i)\) in Tweedie mean space to steer away from degenerate modes, achieving plug-and-play, gradient-free, and model-agnostic compositional generation improvement.

Background & Motivation

Background: Diffusion models have achieved remarkable progress in text-to-image generation, yet even simple multi-concept prompts (e.g., "a cat and a dog") frequently suffer from concept missing, blurring, or unnatural merging.

Limitations of Prior Work: - Optimization-based correction methods (Attend-Excite, SynGen, ToMe) require gradient computation through the model and are therefore model-specific. - Composable diffusion methods are model-agnostic but perform poorly, as linear score composition does not correspond to the correct forward distribution for \(t > 0\). - Each family has its own trade-offs, and no unified framework exists.

Key Challenge: The joint distribution \(p(x|C)\) under multi-concept prompts contains "problematic modes" that overlap heavily with the individual concept distributions \(p(x|c_i)\), causing generation to collapse toward a dominant concept.

Goal: (i) Theoretically analyze and unify existing approaches (correction vs. composable diffusion); (ii) Design a training-free, gradient-free, and model-agnostic sampling correction strategy to improve multi-concept composition.

Key Insight: The paper introduces the "mode collision" hypothesis — when certain modes of \(p(x|C)\) overlap with \(p(x|c_i)\), sampling tends to collapse to that single concept. The solution is to design a correcting distribution that suppresses these overlapping regions.

Core Idea: Steer sampling away from single-concept-dominated degenerate modes via \(\tilde{p}(x|C) \propto p(x|C) / \prod_i p(x|c_i)\), and show that Tweedie mean composition constitutes a unifying framework.

Method

Overall Architecture

CO3 is a sampling-stage corrector embedded within the DDIM sampling process. Corrections are applied during the first 20% of denoising steps, divided into two phases: (1) CO3-resampler: noise resampling for the first 3 steps (weights sum to 0); (2) CO3-corrector: latent correction for subsequent steps (weights sum to 1). No model parameters are modified and no gradient computation is required.

Key Designs

  1. Tweedie Mean Composition Framework:

    • Function: Unifies correction-based and composable diffusion methods under a common theoretical foundation.
    • Mechanism: Moves distribution composition from score space to Tweedie mean space: \(\tilde{x}_{\text{tweedie}} = w_0 \hat{x}_{\text{tweedie}}[\epsilon_t^{\lambda,C}] + \sum_{k=1}^K w_k \hat{x}_{\text{tweedie}}[\epsilon_t^{\lambda,c_k}]\), where \(\hat{x}_{\text{tweedie}}[\epsilon_t^{\lambda,c}] = x_t - \sigma_t \epsilon_t^{\lambda,c}\).
    • Design Motivation: Proposition 1 proves that when \(\sum_k w_k = 1\), the composed Tweedie mean remains a valid Tweedie mean (CO3-corrector); when \(\sum_k w_k = 0\), it reduces to weighted noise (CO3-resampler, theoretically valid only at \(t = T\)).
  2. CO3-resampler (weights sum to 0):

    • Function: Replaces the initial noise at early high-noise steps.
    • Mechanism: Replaces the current \(x_t\) with a weighted combination of concept-conditioned noise estimates, effectively resampling the starting noise from a concept-suppressed distribution.
    • Design Motivation: Experiments show that resampling is most effective at high \(t\), helping to suppress concept dominance at initialization.
  3. CO3-corrector (weights sum to 1):

    • Function: Corrects the Tweedie mean at intermediate denoising steps.
    • Mechanism: \(w_0 > 0\) (joint prompt weight); \(w_1, \ldots, w_K < 0\) (negative weights for individual concepts, acting as suppression). Crucially, the composition retains the CFG form: \(\tilde{\epsilon}_t^{\tilde{\lambda},C} = \epsilon_t^\phi + \lambda(\sum_k w_k \epsilon_t^{c_k} - \epsilon_t^\phi)\).
    • Design Motivation: Unlike composable diffusion, which uses arbitrary \(\lambda_i\), CO3-corrector preserves the unconditional-to-conditional ratio of CFG, preventing out-of-distribution samples.
  4. Closeness-Aware Concept Weight Modulation:

    • Function: Adaptively adjusts the suppression weight for each concept.
    • Mechanism: Computes the distance \(d_k\) between the current noise prediction \(\epsilon^C\) and each concept noise \(\epsilon^{c_k}\), converts distances to affinities via an exponential kernel \(a_k = \exp(-\beta d_k)\), normalizes them, and uses the result as negative weights \(w_k = -a_k / \sum_j a_j\).
    • Design Motivation: Concepts closer to the current sample should be suppressed more strongly, enabling dynamic balancing across concepts.

Loss & Training

This is a training-free method. It is built on SDXL with 50-step DDIM sampling and guidance scale \(\lambda = 5.0\). Corrections are applied in the first 20% of steps: the resampler is used for the first 3 steps and the corrector for subsequent steps. \(\beta = 0.8\).

Key Experimental Results

Main Results (Two-concept prompts, Attend-Excite benchmark)

Method Training-Free Gradient-Free Model-Agnostic ImageReward (A-A) ImageReward (A-O) ImageReward (O-O)
SDXL - 0.782 1.547 0.679
Attend-Excite 0.824 1.238 0.874
InitNO 1.008 1.393 1.138
Tweediemix 1.002 1.313 0.796
CO3 (Ours) 1.234 1.674 1.016

Ablation Study (Incremental component addition)

Configuration ImageReward Avg BLIP-VQA Avg
SDXL base 0.843
+ Resampling 0.944
+ Corrector 0.946
+ Weight modulation 1.012

Key Findings

  • CO3 is the only method that simultaneously satisfies all three properties — training-free, gradient-free, and model-agnostic — while matching or surpassing gradient-based methods across all metrics.
  • The weight-sum-to-1 constraint is critical: it preserves the CFG form, whereas composable diffusion with arbitrary weights readily produces out-of-distribution samples.
  • The resampler and corrector play complementary roles: the resampler is effective in the high-noise regime, while the corrector operates in the intermediate denoising phase.
  • Weight modulation contributes substantially (Avg: 0.946 → 1.012), demonstrating that adaptive suppression is more effective than fixed weights.
  • CO3 also outperforms methods specifically designed for complex prompts (T2ICompBench) and rare concepts (RareBench), including specialized approaches such as R2F.

Highlights & Insights

  • Strong Theoretical Unification: Proposition 1 unifies existing correction and composable diffusion methods under the Tweedie mean composition framework, revealing the critical role of weight constraints (\(\sum w_k = 1\) preserves CFG form; \(\sum w_k = 0\) enables resampling). This unified perspective is itself a valuable theoretical contribution.
  • Mode Collision Hypothesis: Suppressing degenerate modes overlapping with individual concept distributions via \(p(x|C) / \prod_i p(x|c_i)\) is intuitive and well-supported by experimental evidence, which indirectly validates the hypothesis.
  • Fully Plug-and-Play: No model modification, no gradient computation, and no additional training are required, making CO3 directly applicable to any diffusion model.

Limitations & Future Work

  • Each denoising step requires \(K+1\) separate noise predictions, so inference cost scales linearly with the number of concepts.
  • Evaluation is conducted solely on SDXL; although model-agnosticism is claimed, the method has not been tested on DiT-based architectures (e.g., Flux/SD3).
  • CO3-resampler is not theoretically rigorous for \(t < T\); while empirically effective up to approximately \(t \approx 0.9T\), formal guarantees are lacking.
  • The \(\beta\) parameter in weight modulation and the proportion of correction steps require manual tuning.
  • Concept decomposition relies on prompt parsing; the quality of decomposition for complex natural-language prompts is not discussed.
  • vs. Attend-Excite: Attend-Excite optimizes attention maps for attribute binding, requiring gradients and being architecture-dependent; CO3 addresses the problem at the distribution mode level, offering a theoretically higher-level perspective.
  • vs. Composable Diffusion: Both are model-agnostic score composition approaches, but composable diffusion uses unconstrained linear combinations; CO3 demonstrates that the weight constraint is the key factor.
  • vs. Tweediemix: Both operate in Tweedie space, but Tweediemix requires a LoRA fine-tuning stage, whereas CO3 is a purely sampling-time correction.
  • The mode collision hypothesis may inspire deeper investigation into the training dynamics of diffusion models.

Rating

  • Novelty: ⭐⭐⭐⭐ The mode collision hypothesis is novel, and the unifying theory based on Tweedie mean composition is insightful.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks (simple/complex/rare prompts), complete ablations, and human evaluation.
  • Writing Quality: ⭐⭐⭐⭐ Theoretical derivations are clear, though the heavy notation raises the reading barrier.
  • Value: ⭐⭐⭐⭐ Plug-and-play compositional generation improvement carries high practical utility.