Penalizing Boundary Activation for Object Completeness in Diffusion Models¶
Conference: ICCV 2025 arXiv: 2509.16968 Code: https://github.com/HaoyXu7/Object_Completeness Area: Diffusion Models / Image Generation Keywords: Object Completeness, RandomCrop, Attention Constraint, Training-Free, Boundary Penalty
TL;DR¶
This paper investigates the root cause of incomplete object generation in diffusion models — the RandomCrop data augmentation used during training — and proposes a training-free boundary activation penalty method. By applying cross-attention and self-attention constraints during early denoising steps, the method suppresses object generation near image boundaries, reducing the object incompleteness rate of SDv2.1 from 45.7% to 17.3%.
Background & Motivation¶
-
Background: Images generated by diffusion models frequently exhibit incomplete objects cropped by image boundaries — e.g., "a red car" showing only its front half, or "a bathtub" missing its right portion. The incompleteness rate reaches 47.3% in DALL-E 2, 45.7% in SDv2.1, and 18% even in SDXL.
-
Limitations of Prior Work: Most research treats this as an inherent artifact of generation stochasticity or a simple generation failure, with little rigorous causal analysis or targeted solution.
-
Key Challenge:
- Dataset? Manual inspection reveals only ~4% incompleteness in training data, far below the 45.7% observed in generated images.
-
Data Augmentation? ✓ Fine-tuning with RandomCrop causes the incompleteness rate to increase monotonically with epochs, while fine-tuning on unaugmented images causes it to decrease monotonically — consistently across both seen and unseen object categories.
-
Goal: Although RandomCrop is the culprit, it is indispensable for model diversity and generalization, and retraining is prohibitively expensive. A training-free, inference-time solution is therefore required.
Method¶
Core Idea¶
During the early denoising steps — the critical phase that determines coarse structure — the latent representation is optimized via gradient updates to suppress the probability of objects appearing near image boundaries.
Key Design 1: Cross-Attention Constraint¶
The cross-attention map \(M^{cross}_x\) corresponding to the object category tokens in the prompt is extracted; this map reflects the semantic "footprint" of the object in the image.
Key Design 2: Self-Attention Constraint¶
Cross-attention primarily encodes semantic information and lacks spatial structural detail. Self-attention maps are therefore incorporated as follows: - Gaussian smoothing is applied to \(M^{cross}_x\). - A clustering algorithm selects \(K\) keypoints \(p_1, \dots, p_K\). - Self-attention maps at each keypoint are extracted and averaged to yield \(M^{self}_{avg}\).
Dispelling Loss¶
- \(A_{sur}\): attention activation in the boundary region (to be suppressed).
- \(A_{inter}\): attention activation in a randomly sampled interior region (to be reinforced).
- Effect: implicitly guides the subject toward the image center.
Latent Optimization¶
This update is applied only during early timesteps \(t > T_1\), where coarse structure is determined; subsequent steps proceed with standard denoising to preserve image quality.
Key Experimental Results¶
Main Results¶
| Method | Requires LLM | HOIR↓ | LOIR↓ | Time(s)↓ | CLIP-IQA↑ | PickScore↑ | ImageReward↑ |
|---|---|---|---|---|---|---|---|
| SD v2.1 | ✗ | 45.7% | 32.0% | 5.51 | 0.714 | 20.03 | 0.221 |
| GLIGEN | ✗ | 34.2% | 27.9% | 8.74 | 0.672 | 20.59 | 0.177 |
| LayoutGPT | ✓ | 30.3% | 31.5% | 14.36 | 0.631 | 21.94 | 0.175 |
| SLD | ✓ | 27.1% | 23.1% | 9.54 | 0.694 | 23.07 | 0.253 |
| Ours | ✗ | 17.3% | 11.7% | 5.75 | 0.703 | 23.41 | 0.327 |
Incompleteness Rate Across Models¶
| Model | HOIR |
|---|---|
| DALL-E 2 | 47.3% |
| SDv2 | 45.5% |
| SDv3 | 30.1% |
| SDXL | 18.0% |
Key Findings¶
- The causal analysis experiments constitute a core contribution: controlled comparisons (with/without RandomCrop fine-tuning; seen/unseen categories) rigorously establish RandomCrop as the primary cause.
- The proposed method reduces SDv2.1's incompleteness rate to 17.3% — approaching SDXL's 18% — without any training.
- Computational overhead is minimal: only 0.24 s additional latency (5.51 → 5.75 s), far below LLM-based methods.
- Image quality metrics (PickScore, ImageReward) improve rather than degrade, confirming that completeness gains do not come at the expense of quality.
- Jointly constraining both cross-attention and self-attention outperforms using either in isolation.
Highlights & Insights¶
- Rigorous root cause analysis: The paper not only proposes a method but also conducts principled investigation into the problem's origin, offering scientific value beyond the technical contribution.
- Minimalist design: No training, no LLM, no auxiliary networks — only a few gradient steps inserted into the inference process.
- Orthogonal to SDXL: SDXL mitigates the issue from the training side; the proposed method addresses it from the inference side, and the two approaches can be combined.
Limitations & Future Work¶
- Penalizing boundary activation may over-constrain layout in certain scenes (e.g., compositions intentionally spanning image boundaries).
- Hyperparameters \(\alpha, \beta, T_1, K\) require tuning.
- Effectiveness on multi-object scenes is not thoroughly explored.
Related Work & Insights¶
- Controlled generation: Attend-and-Excite, P2P, MasaCtrl
- Layout control: GLIGEN, LayoutDiffusion, BoxDiff
- Seed selection: SeedSelect, S2ST
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Root cause analysis paired with an elegant solution.
- Technical Depth: ⭐⭐⭐⭐ — Rigorous experimental design.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple models, multiple metrics, and causal validation.
- Value: ⭐⭐⭐⭐⭐ — Training-free, plug-and-play, with negligible overhead.