CVPR 2025 Image Generation Human-Object Interaction Generation Verb Semantic Understanding Relation Decoupled Guidance Interaction Region Localization Text-to-Image Diffusion

VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness¶

Conference: CVPR 2025
arXiv: 2503.16406
Code: None
Area: Object Detection / Image Generation
Keywords: Human-Object Interaction Generation, Verb Semantic Understanding, Relation Decoupled Guidance, Interaction Region Localization, Text-to-Image Diffusion

TL;DR¶

This paper proposes VerbDiff, a text-to-image diffusion model that generates accurate human-object interaction images without requiring extra conditions (such as bounding boxes). It eliminates interaction verb bias using Relation Decoupled Guidance (RDG) and extracts local interaction regions from cross-attention maps via an Interaction Region Module (IR Module) for directional guidance.

Background & Motivation¶

Text-to-image diffusion models underperform in depicting human-object interactions, struggling to distinguish semantically distinct interaction verbs (e.g., "walking a bicycle" vs. "riding a bicycle").
CLIP exhibits a strong object bias, tending to focus on object nouns in the prompt while ignoring semantic differences in verbs.
Existing methods rely on additional conditions (e.g., LLM layouts, bounding boxes) to provide explicit relation information, yet they still lack a true understanding of interaction semantics.
Even with bounding box assistance, InteractDiffusion still fails when semantically similar interaction verbs (e.g., "walking" vs. "riding") share similar bounding boxes.
The interactions in generated images suffer from bias, favoring verbs with the highest frequency in the data distribution (e.g., generating "wearing" instead of "holding" for a backpack).
The distribution of interaction verbs in the HICO-DET dataset is extremely long-tailed, where common verbs dominate the generation results.
The CLIP similarity metric is insensitive to subtle semantic differences in interaction verbs.
Existing methods have difficulty localizing specific interaction regions in the generated images without extra annotations.

Method¶

Overall Architecture¶

VerbDiff is built upon Stable Diffusion v1.4, where only the cross-attention layers are trained. It contains two core modules: Relation Decoupled Guidance (RDG), which eliminates interaction biases using frequency anchor texts and a triplet loss, and Interaction Directional Guidance (IDG), which extracts interaction regions from cross-attention maps via an IR Module to guide the model towards local interaction details. Training is conducted on HICO-DET using only text inputs and requiring no bounding boxes.

Key Designs¶

Design 1: Relation Decoupled Guidance (RDG) - Function: Eliminating interaction bias in generated images and enhancing the understanding of semantic differences among different verbs. - Mechanism: For each human-object pair, a frequency anchor verb is defined as \(r^{anc} = \arg\max_{r \in R_o} \mathcal{C}(r|o)\) (the most frequent verb for that pair) to construct the anchor text \(T^{anc}\). A triplet loss is used to pull the generated image feature \(f^{gen}\) closer to the correct ground-truth interaction text \(e^{gt}\) and push it away from the anchor text \(e^{anc}\): \(\mathcal{L}_{\text{triple}} = \max(0, m + \text{sim}(f^{gen}, e^{gt}) - \text{sim}(f^{gen}, e^{anc}))\). Meanwhile, masked regions are used to extract real image features for an image alignment loss, which is multiplied by an effective coefficient \(\alpha(k)\) to balance the long-tail distribution. - Design Motivation: The biased interactions in generated images often correspond to the most frequent verbs in the data; thus, semantic differentiation is achieved by explicitly pushing away from the anchor text features.

Design 2: Interaction Region Module (IR Module) - Function: Automatically extracting the interaction region from cross-attention maps of the generated image without requiring bounding boxes. - Mechanism: Utilizing the cross-attention maps \(\mathcal{A}_h, \mathcal{A}_r, \mathcal{A}_o\) corresponding to the h/r/o tokens, and computing the center points \(c_h, c_r, c_o\) of each token through a centroid extraction mechanism. The interaction center is defined as the centroid of these three points \(c_{rel}\), and the interaction region is computed as \(B_{rel}^{gen} = c_{rel} \pm \|c_h - c_o\|_2^2\). - Design Motivation: Global image-level feature alignment is not fine-grained. Focusing on the local regions where interactions occur can better capture interaction details.

Design 3: Interaction Directional Guidance (IDG) - Function: Guiding the model to modify image features within local interaction regions, making them closer to the ground-truth interactions. - Mechanism: Extracting \(f_{rel}^{gt}\) and \(f_{rel}^{gen}\) from the interaction region, computing the bias feature \(f_{rel}^{bias} = f_{rel}^{gt} - f_{rel}^{gen}\), and designing a direction guidance loss that aligns the global-level modification direction with the local-level interaction region modification direction: \(\mathcal{L}_{\text{IDG}} = 1 - \frac{(f_{\mathcal{M}}^{gt} - f^{gen}) \cdot f_{rel}^{bias}}{|f_{\mathcal{M}}^{gt} - f^{gen}||f_{rel}^{bias}|}\). - Design Motivation: Ensuring that the model's modifications to the image are concentrated in the interaction region rather than the global layout, achieving fine-grained interaction semantic alignment.

Loss & Training¶

The total loss is formulated as \(\mathcal{L}_{\text{total}} = \lambda_1 \cdot \mathcal{L}_{\text{rec}} + \lambda_2 \cdot \mathcal{L}_{\text{RDG}} + \lambda_3 \cdot \mathcal{L}_{\text{IDG}}\), where \(\lambda_1=1.0, \lambda_2=10, \lambda_3=0.8\). The reconstruction loss uses a mask constraint to focus only on the corresponding interaction region, and the RDG loss is multiplied by a long-tail balancing factor.

Key Experimental Results¶

Main Results¶

Model	CLIP T2T	S-BERT T2T	HOI Acc Def. (Full)	KO. (Full)
SD	0.725	0.620	16.09 / 20.08	18.22 / 21.69
GLIGEN	0.683	0.554	15.88 / 17.83	17.91 / 19.35
InteractDiffusion	0.703	0.575	19.67 / 23.53	21.31 / 24.86
VerbDiff	0.733	0.633	22.59 / 27.05	24.79 / 28.43

Ablation Study¶

Setting	CLIP T2T	S-BERT T2T	HOI Acc Def.	KO.
\(\mathcal{L}_{rec}\) only	0.691	0.582	19.38	20.89
+\(\mathcal{L}_{triple}\)+\(\mathcal{L}_{align}\)	0.700	0.589	20.32	21.87
+\(\mathcal{L}_{triple}\)+\(\mathcal{L}_{IDG}\)	0.710	0.610	23.39	24.51
All (Full)	0.733	0.633	22.59	24.79

Key Findings¶

VerbDiff consistently outperforms InteractDiffusion, which requires explicit bounding boxes, across all evaluation settings.
The S-BERT metric reflects interaction semantic differences better than CLIP (VerbDiff achieves a more significant improvement on S-BERT).
IDG provides the largest contribution (+HOI Acc 3.0+), indicating that focusing on the local interaction region is crucial.
Under complex prompts with multiple interactions, VerbDiff can still accurately distinguish different interaction verbs, achieving performance close to DALL-E 3.
The HOI accuracy is close to that of the real HICO-DET data (Def. 22.59 vs. 26.52).

Highlights & Insights¶

Discovery of frequency anchor text: The generation bias corresponds to the most frequent verbs, an observation that provides a clear motivation for the decoupling design.
Localizing interaction regions from cross-attention maps: Localizing where interactions occur without requiring extra annotations, leading to a simple yet effective design.
Introduction of the S-BERT evaluation metric: Compensating for CLIP's insufficiency in capturing subtle distinctions in interaction semantics.
Training only cross-attention layers: Lightweight and efficient, completing training in only 17 hours.

Limitations & Future Work¶

The base model, constructed upon SD v1.4, has limited capacity, leaving a performance gap between it and larger models like DALL-E 3.
The model is only trained and evaluated on HICO-DET; generalization to other HOI datasets remains unverified.
The IR Module relies on simple geometric centroid calculations of cross-attention, which might not be robust enough for complex multi-person scenarios.
Future work can combine stronger VLMs and larger-scale datasets to further enhance interaction understanding.

Unlike InteractDiffusion which requires explicit bounding boxes, VerbDiff relies solely on text to achieve better interaction understanding.
The decoupling concept using frequency anchor texts can be extended to other generative tasks suffering from class imbalance bias.
The semantic localization capability of cross-attention maps offers a new perspective for annotation-free region extraction.

Rating¶

⭐⭐⭐⭐ — Precise problem definition; the decoupling concept based on frequency bias is novel. The experimental comparisons are thorough, and the achievement of outperforming bounding-box-required methods under text-only conditions is impressive.