Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects¶

Conference: CVPR 2025
arXiv: 2411.18936
Code: None
Area: Image Generation
Keywords: Diffusion guidance, subject blending, self-cross attention, training-free inference, similar subject generation

TL;DR¶

This paper proposes Self-Cross Diffusion Guidance, which effectively addresses the subject mixing problem when generating similar subjects with diffusion models by penalizing the overlap between the aggregated self-attention map of one subject and the cross-attention map of another. It represents the first training-free method to simultaneously leverage the interactions between self-attention and cross-attention.

Background & Motivation¶

While diffusion models have made remarkable progress in text-to-image generation, subject mixing remains a critical unsolved issue.
This problem is particularly severe when generating multiple visually similar subjects (e.g., a leopard and a tiger), where the features of different subjects tend to bleed into each other.
Existing methods (Attend&Excite, INITNO, CONFORM) perform guidance based on either cross-attention or self-attention individually, neglecting the interaction between the two.
Focusing only on the most discriminative patches (e.g., a bird's beak) is insufficient—other foreground patches can also lead to subject mixing.
Existing evaluation benchmarks lack challenging prompts for similar subject scenarios, and CLIP scores correlate poorly with human judgment.

Method¶

Overall Architecture¶

Self-Cross Guidance is a training-free inference-time optimization method. During the first half of the denoising steps in the diffusion reverse process, corresponding patches are selected from the cross-attention map of each subject using Otsu thresholding. The self-attention maps of these patches are then aggregated to penalize the overlap between the aggregated self-attention and the cross-attention of other subjects. This is implemented through a combination of initial noise optimization and iterative latent refinement.

Key Designs¶

Design 1: Self-Attention Map Aggregation

Function: To obtain a self-attention representation covering the entire subject region.
Mechanism: Apply Otsu thresholding to the cross-attention map \(A_i^c\) of subject \(i\) to select high-response patches. The self-attention maps of the selected patches are summed as a weighted average based on their cross-attention values: \(A_i^s = \frac{\sum_{x_m,y_n}(A_i^c[x_m,y_n] \times A_{x_m,y_n}^s)}{\sum_{x_m,y_n} A_i^c[x_m,y_n]}\)
Design Motivation: Self-attention maps of different patches vary significantly; relying solely on the single most discriminative patch cannot cover the entire area of the subject. Aggregating the self-attention maps of multiple patches provides a more comprehensive representation of the subject's attended region.

Design 2: Self-Cross Guidance Loss

Function: To penalize the overlap between the self-attention region of one subject and the cross-attention region of another, thereby eliminating subject mixing.
Mechanism: For a pair of subjects \((i, j)\), the overlap is computed as \(g(i,j) = \sum_{x,y} \min(A_i^s[x,y], A_j^c[x,y]) + \sum_{x,y} \min(A_i^c[x,y], A_j^s[x,y])\). For \(N\) similar subjects, the average of all \(C_N^2\) pairs is calculated. The total loss is defined as \(\mathcal{L}_{total} = S_{self-cross} + \lambda \cdot S_{cross-attn}\).
Design Motivation: The essence of subject mixing is that the self-attention of one subject invades the region of another subject. The overlap between the aggregated self-attention map and the cross-attention map captures this intrusion more precisely than using either attention type in isolation.

Design 3: SSD Benchmark and GPT-4o Evaluation

Function: To provide a challenging evaluation benchmark for similar subject generation.
Mechanism: Release the Similar-Subject Dataset (SSD), which contains text prompts featuring two or three similar subjects. Utilize GPT-4o to automatically evaluate subject presence, identifiability, and attribute binding in the generated images via visual question answering.
Design Motivation: CLIP scores cannot effectively differentiate subject mixing issues. GPT-4o evaluation exhibits higher consistency with human judgment.

Loss & Training¶

\[\mathcal{L}_{total} = S_{self-cross} + \lambda \cdot S_{cross-attn}\]

where \(S_{cross-attn}\) follows the cross-attention response score of Attend&Excite, and \(\lambda\) is a balancing coefficient. This loss is only applied during the first half of the denoising steps and to the intermediate layers.

Key Experimental Results¶

Quantitative Results on SSD Benchmark¶

Method	Presence ↑	Identifiability ↑	Attribute Binding ↑	FID ↓
Stable Diffusion	Baseline	Baseline	Baseline	Baseline
Attend&Excite	Improved	Limited Improvement	Limited Improvement	—
INITNO	Improved	Partial Improvement	Partial Improvement	—
CONFORM	Improved	Partial Improvement	Partial Improvement	—
Self-Cross (Ours)	Best	Best	Best	Maintained

Ablation Study¶

Configuration	Effect
Cross-attn guidance only	Fails to eliminate subject mixing
Self-attn guidance only	Partial improvement
Single-patch self-attn + cross-attn	Limited improvement
Aggregated self-attn + cross-attn	Significant elimination of subject mixing

Key Findings¶

Self-Cross guidance substantially outperforms methods that rely on a single attention map, such as INITNO, in eliminating subject mixing.
Aggregating multi-patch self-attention yields significantly better results than single-patch approaches.
The method is compatible with both UNet-based (SD 1.x/2.x) and Transformer-based (SD3) diffusion models.
As a side benefit, subject omission issues are also alleviated.
The overall image quality (FID) is not visibly affected.

Highlights & Insights¶

First exploration of the interaction between self-attention and cross-attention: Provides a new understanding of the causes of subject mixing—self-attention intrusion into other regions leading to feature copying.
Necessity of multi-patch aggregation: Demonstrates that focusing solely on the most discriminative patch is insufficient to eliminate subject mixing.
GPT-4o evaluation scheme: Provides a more reliable automated evaluation instrument for the diffusion model community.

Limitations & Future Work¶

Initial noise optimization and iterative refinement increase inference time.
The method primarily targets mixing between similar subjects, offering limited improvement for attribute binding of non-similar subjects.
It requires the user to specify which subjects are "similar," lacking an automatic detection mechanism.
Future work can explore incorporating Self-Cross guidance into the training phase to scale to more scenarios.

Attend&Excite [Chefer et al.] prevents subject omission via maximizing cross-attention.
INITNO [Guo et al.] optimizes initial noise by combining self-attention conflict scores.
CONFORM [Meral et al.] uses contrastive loss for subject separation.
This work is the first to reveal the crucial role of the interaction between self-attention and cross-attention in subject mixing.

Rating¶

⭐⭐⭐⭐ — Deep analysis on the causes of subject mixing (self-attention intrusion), and the design of the Self-Cross guidance loss is intuitive yet effective. The significant advantage of the multi-patch aggregation strategy over the single-patch scheme is convincing. The SSD benchmark and GPT-4o evaluation provide valuable tools to the community.