Skip to content

Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects

Conference: CVPR 2025
arXiv: 2411.18936
Code: None
Area: Image Generation
Keywords: Diffusion guidance, subject blending, self-cross attention, training-free inference, similar subject generation

TL;DR

This paper proposes Self-Cross Diffusion Guidance, which effectively addresses the subject mixing problem when generating similar subjects with diffusion models by penalizing the overlap between the aggregated self-attention map of one subject and the cross-attention map of another. It represents the first training-free method to simultaneously leverage the interactions between self-attention and cross-attention.

Background & Motivation

  • While diffusion models have made remarkable progress in text-to-image generation, subject mixing remains a critical unsolved issue.
  • This problem is particularly severe when generating multiple visually similar subjects (e.g., a leopard and a tiger), where the features of different subjects tend to bleed into each other.
  • Existing methods (Attend&Excite, INITNO, CONFORM) perform guidance based on either cross-attention or self-attention individually, neglecting the interaction between the two.
  • Focusing only on the most discriminative patches (e.g., a bird's beak) is insufficient—other foreground patches can also lead to subject mixing.
  • Existing evaluation benchmarks lack challenging prompts for similar subject scenarios, and CLIP scores correlate poorly with human judgment.

Method

Overall Architecture

Self-Cross Guidance is a training-free inference-time optimization method. During the first half of the denoising steps in the diffusion reverse process, corresponding patches are selected from the cross-attention map of each subject using Otsu thresholding. The self-attention maps of these patches are then aggregated to penalize the overlap between the aggregated self-attention and the cross-attention of other subjects. This is implemented through a combination of initial noise optimization and iterative latent refinement.

Key Designs

Design 1: Self-Attention Map Aggregation

  • Function: To obtain a self-attention representation covering the entire subject region.
  • Mechanism: Apply Otsu thresholding to the cross-attention map \(A_i^c\) of subject \(i\) to select high-response patches. The self-attention maps of the selected patches are summed as a weighted average based on their cross-attention values: \(A_i^s = \frac{\sum_{x_m,y_n}(A_i^c[x_m,y_n] \times A_{x_m,y_n}^s)}{\sum_{x_m,y_n} A_i^c[x_m,y_n]}\)
  • Design Motivation: Self-attention maps of different patches vary significantly; relying solely on the single most discriminative patch cannot cover the entire area of the subject. Aggregating the self-attention maps of multiple patches provides a more comprehensive representation of the subject's attended region.

Design 2: Self-Cross Guidance Loss

  • Function: To penalize the overlap between the self-attention region of one subject and the cross-attention region of another, thereby eliminating subject mixing.
  • Mechanism: For a pair of subjects \((i, j)\), the overlap is computed as \(g(i,j) = \sum_{x,y} \min(A_i^s[x,y], A_j^c[x,y]) + \sum_{x,y} \min(A_i^c[x,y], A_j^s[x,y])\). For \(N\) similar subjects, the average of all \(C_N^2\) pairs is calculated. The total loss is defined as \(\mathcal{L}_{total} = S_{self-cross} + \lambda \cdot S_{cross-attn}\).
  • Design Motivation: The essence of subject mixing is that the self-attention of one subject invades the region of another subject. The overlap between the aggregated self-attention map and the cross-attention map captures this intrusion more precisely than using either attention type in isolation.

Design 3: SSD Benchmark and GPT-4o Evaluation

  • Function: To provide a challenging evaluation benchmark for similar subject generation.
  • Mechanism: Release the Similar-Subject Dataset (SSD), which contains text prompts featuring two or three similar subjects. Utilize GPT-4o to automatically evaluate subject presence, identifiability, and attribute binding in the generated images via visual question answering.
  • Design Motivation: CLIP scores cannot effectively differentiate subject mixing issues. GPT-4o evaluation exhibits higher consistency with human judgment.

Loss & Training

\[\mathcal{L}_{total} = S_{self-cross} + \lambda \cdot S_{cross-attn}\]

where \(S_{cross-attn}\) follows the cross-attention response score of Attend&Excite, and \(\lambda\) is a balancing coefficient. This loss is only applied during the first half of the denoising steps and to the intermediate layers.

Key Experimental Results

Quantitative Results on SSD Benchmark

Method Presence ↑ Identifiability ↑ Attribute Binding ↑ FID ↓
Stable Diffusion Baseline Baseline Baseline Baseline
Attend&Excite Improved Limited Improvement Limited Improvement
INITNO Improved Partial Improvement Partial Improvement
CONFORM Improved Partial Improvement Partial Improvement
Self-Cross (Ours) Best Best Best Maintained

Ablation Study

Configuration Effect
Cross-attn guidance only Fails to eliminate subject mixing
Self-attn guidance only Partial improvement
Single-patch self-attn + cross-attn Limited improvement
Aggregated self-attn + cross-attn Significant elimination of subject mixing

Key Findings

  • Self-Cross guidance substantially outperforms methods that rely on a single attention map, such as INITNO, in eliminating subject mixing.
  • Aggregating multi-patch self-attention yields significantly better results than single-patch approaches.
  • The method is compatible with both UNet-based (SD 1.x/2.x) and Transformer-based (SD3) diffusion models.
  • As a side benefit, subject omission issues are also alleviated.
  • The overall image quality (FID) is not visibly affected.

Highlights & Insights

  1. First exploration of the interaction between self-attention and cross-attention: Provides a new understanding of the causes of subject mixing—self-attention intrusion into other regions leading to feature copying.
  2. Necessity of multi-patch aggregation: Demonstrates that focusing solely on the most discriminative patch is insufficient to eliminate subject mixing.
  3. GPT-4o evaluation scheme: Provides a more reliable automated evaluation instrument for the diffusion model community.

Limitations & Future Work

  • Initial noise optimization and iterative refinement increase inference time.
  • The method primarily targets mixing between similar subjects, offering limited improvement for attribute binding of non-similar subjects.
  • It requires the user to specify which subjects are "similar," lacking an automatic detection mechanism.
  • Future work can explore incorporating Self-Cross guidance into the training phase to scale to more scenarios.
  • Attend&Excite [Chefer et al.] prevents subject omission via maximizing cross-attention.
  • INITNO [Guo et al.] optimizes initial noise by combining self-attention conflict scores.
  • CONFORM [Meral et al.] uses contrastive loss for subject separation.
  • This work is the first to reveal the crucial role of the interaction between self-attention and cross-attention in subject mixing.

Rating

⭐⭐⭐⭐ — Deep analysis on the causes of subject mixing (self-attention intrusion), and the design of the Self-Cross guidance loss is intuitive yet effective. The significant advantage of the multi-patch aggregation strategy over the single-patch scheme is convincing. The SSD benchmark and GPT-4o evaluation provide valuable tools to the community.