ICCV 2025 Segmentation Compositional text-to-image generation attribute-object binding diffusion models visual embedding fusion cross-attention localization

VSC: Visual Search Compositional Text-to-Image Diffusion Model¶

Conference: ICCV 2025 arXiv: 2505.01104 Code: Not released Area: Image Segmentation Keywords: Compositional text-to-image generation, attribute-object binding, diffusion models, visual embedding fusion, cross-attention localization

TL;DR¶

This paper proposes VSC, a visual search-based compositional text-to-image diffusion generation method that significantly improves the accuracy and scalability of multi-attribute-object binding by generating reference images for each attribute-object pair independently, fusing visual prototype embeddings, and training with segmentation-guided cross-attention localization.

Background & Motivation¶

Text-to-image diffusion models excel at generating realistic images but frequently fail on complex prompts containing multiple attribute-object binding pairs. For example, given "a red car and a yellow bicycle," the model may confuse or misassign colors.

The root cause lies in the limitations of text encoders such as CLIP:

Insufficient linguistic structure understanding: CLIP tends to represent text as a "bag-of-concepts," ignoring hierarchical relationships and modifiers.

Erroneous cross-attention maps: The cross-attention maps in the diffusion model UNet are inherently inaccurate, and the attention distributions of attribute tokens are misaligned with their corresponding objects.

Poor scalability: Existing inference-time attention control methods (e.g., Attend-and-Excite) degrade sharply as the number of binding pairs increases.

Inspired by Serial Search Theory from cognitive psychology—wherein the brain processes and binds features one at a time when confronted with multi-feature objects—and observing that CLIP and diffusion models already perform well on single binding pair settings (SD2.1 scores for single-pair color/texture/shape reach 0.95/0.94/0.85 respectively), the key insight is: first generate reference images for each attribute-object pair independently, then fuse the visual information into the full generation.

Method¶

Overall Architecture¶

VSC builds upon Stable Diffusion with the following core pipeline: 1. Decomposition: Decompose the complex prompt into multiple attribute-object sub-prompts. 2. Search: Use SD to generate \(m\) reference images for each sub-prompt. 3. Fusion: Extract visual prototypes via an image encoder and fuse them with text embeddings through an MLP. 4. Generation: Drive the frozen SD to generate the final compositional image using the enhanced text embeddings.

Key Design 1: Visual Embedding Fusion¶

For each binding pair \([a_n, o_n]\), \(m\) reference images are generated, and the image encoder \(\phi\) extracts embeddings whose mean serves as the visual prototype:

\[e_j = \frac{1}{m}\sum_{k=0}^{m} v_j^k = \frac{1}{m}\sum_{k=0}^{m} \phi(r_j^k), \quad j \in \mathcal{I}\]

A trainable MLP then fuses the visual prototype with the corresponding text embedding:

\[c'_i = \begin{cases} c_i, & i \notin \mathcal{I} \\ \text{MLP}([c_i, e_j]), & i = i_j \in \mathcal{I} \end{cases}\]

Only tokens belonging to binding pair indices are augmented with visual information; all other tokens remain unchanged.

Key Design 2: Segmentation-Guided Cross-Attention Localization Training¶

Since the cross-attention maps in SD are erroneous and would lead to attribute misassignment if used directly, a localization loss is introduced to align the attention maps of both attribute and object tokens with their corresponding segmentation masks:

\[L_{\text{loc}} = \frac{1}{n}\sum_{j=1}^{n}\left[(\text{mean}(A_{i_a} \odot \bar{M}_j) - \text{mean}(A_{i_a} \odot M_j)) + (\text{mean}(A_{i_o} \odot \bar{M}_j) - \text{mean}(A_{i_o} \odot M_j))\right]\]

The total training objective is: \(L = L_{\text{noise}} + \lambda L_{\text{loc}}\)

Key Design 3: Synthetic Dataset Construction¶

Three models—SD3.5, Flux, and SynGen—are used to generate 300 images per prompt from T2I-CompBench. Instance segmentation is performed with Mask2Former, and alignment scores are computed using OpenCLIP. The top 45 images per prompt by VQA score are selected, yielding a final dataset of 90,000 images with precise segmentation annotations.

Training Details¶

Only the MLP and the last 2–3 layers of the image encoder are trained; the SD UNet is fully frozen throughout.
Training runs for 60,000 steps with a learning rate of 1e-5 and a batch size of 8.
Compatible with SD 1.4, 2.1, and 3.5.

Key Experimental Results¶

Main Results (T2I-CompBench)¶

Model	Color	Texture	Shape	HM
SD 2.1
Baseline	0.50	0.49	0.42	0.467
Attn-Excite	0.64	0.59	0.45	0.547
SynGen	0.71	0.61	0.50	0.594
VSC	0.74	0.64	0.53	0.608
SD 3.5
Baseline	0.76	0.67	0.53	0.639
SynGen	0.82	0.74	0.59	0.703
VSC	0.85	0.79	0.63	0.727

VSC achieves the best performance across all SD versions, with SD 3.5 + VSC reaching the highest HM of 0.727.

Scalability (Varying Number of Binding Pairs)¶

Method	3 pairs	4 pairs	5 pairs
SynGen (SD2.1)	0.678	0.485	0.119
VSC (SD2.1)	0.688	0.548	0.190
SynGen (SD3.5)	0.808	0.566	0.182
VSC (SD3.5)	0.813	0.618	0.246

Key findings: - The performance advantage of VSC grows as the number of binding pairs increases (VSC outperforms SynGen by 60%+ at 5 pairs). - This confirms that the segmentation-guided cross-attention training effectively corrects erroneous attention maps.

Ablation Study¶

Data scale: Performance improves steadily from 5K to 30K images, with diminishing gains from 30K to 90K.
Transferability: A model trained only on the Color category outperforms the baseline on Texture and Shape, demonstrating that compositional ability transfers across attribute types.
Image encoder fine-tuning: Freezing the image encoder prevents the localization loss from converging; fine-tuning the last 2–3 layers is necessary.
Human evaluation: VSC receives 30.01% of majority votes, substantially outperforming SynGen (25.18%) and SD3.5 (12.61%).

Inference Efficiency¶

VSC achieves efficient scaling through parallel reference image generation: inference time is approximately 2.26× that of the baseline and remains constant, whereas SynGen and Attend-and-Excite increase significantly with the number of binding pairs. At 5 pairs, VSC is already the fastest method aside from the baseline.

Highlights & Insights¶

Cognitive science inspiration: An elegant application of Serial Search Theory to generative models—first "searching" each attribute-object pair individually, then compositing them together.
Extremely lightweight training: Only the MLP and the last few layers of the image encoder require training; the UNet is fully frozen.
Scalability advantage: Unlike inference-time attention control methods, VSC corrects attention maps during training via the localization loss, incurring no additional overhead at inference.
Fully synthetic training data: No human-annotated data is required.

Limitations & Future Work¶

Reference image generation is required for each binding pair, consuming approximately 1.1 GB of additional VRAM per extra prompt.
The quality of the synthetic data is constrained by the capabilities of the reference models.
Gains on the Shape attribute consistently lag behind those on Color and Texture.
Relies on a segmentation model (Mask2Former) for training annotations.

Compositional generation: Attend-and-Excite controls attention at inference time; SynGen optimizes cross-attention loss during training.
Subject-driven generation: DreamBooth and FastComposer enable reference image-based personalized generation.
Layout-guided generation: GLIGEN and SpaText leverage bounding boxes or segmentation maps to control spatial information.

Rating¶

Novelty: ⭐⭐⭐⭐ — The visual search-and-fusion paradigm is original, though the overall framework builds on existing components.
Technical Quality: ⭐⭐⭐⭐ — Comprehensive experiments including human evaluation and scalability analysis with thorough ablations.
Practicality: ⭐⭐⭐⭐ — Applicable to multiple SD versions, though inference requires additional reference image generation.
Writing Quality: ⭐⭐⭐⭐ — Clear structure with a natural and well-motivated introduction of the psychological theory.