Penalizing Boundary Activation for Object Completeness in Diffusion Models¶

Conference: ICCV 2025 arXiv: 2509.16968 Code: https://github.com/HaoyXu7/Object_Completeness Area: Diffusion Models / Image Generation Keywords: Object Completeness, RandomCrop, Attention Constraint, Training-Free, Boundary Penalty

TL;DR¶

This paper investigates the root cause of incomplete object generation in diffusion models — the RandomCrop data augmentation used during training — and proposes a training-free boundary activation penalty method. By applying cross-attention and self-attention constraints during early denoising steps, the method suppresses object generation near image boundaries, reducing the object incompleteness rate of SDv2.1 from 45.7% to 17.3%.

Background & Motivation¶

Background: Images generated by diffusion models frequently exhibit incomplete objects cropped by image boundaries — e.g., "a red car" showing only its front half, or "a bathtub" missing its right portion. The incompleteness rate reaches 47.3% in DALL-E 2, 45.7% in SDv2.1, and 18% even in SDXL.
Limitations of Prior Work: Most research treats this as an inherent artifact of generation stochasticity or a simple generation failure, with little rigorous causal analysis or targeted solution.
Key Challenge:
Dataset? Manual inspection reveals only ~4% incompleteness in training data, far below the 45.7% observed in generated images.
Data Augmentation? ✓ Fine-tuning with RandomCrop causes the incompleteness rate to increase monotonically with epochs, while fine-tuning on unaugmented images causes it to decrease monotonically — consistently across both seen and unseen object categories.
Goal: Although RandomCrop is the culprit, it is indispensable for model diversity and generalization, and retraining is prohibitively expensive. A training-free, inference-time solution is therefore required.

Method¶

Core Idea¶

During the early denoising steps — the critical phase that determines coarse structure — the latent representation is optimized via gradient updates to suppress the probability of objects appearing near image boundaries.

Key Design 1: Cross-Attention Constraint¶

The cross-attention map \(M^{cross}_x\) corresponding to the object category tokens in the prompt is extracted; this map reflects the semantic "footprint" of the object in the image.

Key Design 2: Self-Attention Constraint¶

Cross-attention primarily encodes semantic information and lacks spatial structural detail. Self-attention maps are therefore incorporated as follows: - Gaussian smoothing is applied to \(M^{cross}_x\). - A clustering algorithm selects \(K\) keypoints \(p_1, \dots, p_K\). - Self-attention maps at each keypoint are extracted and averaged to yield \(M^{self}_{avg}\).

Dispelling Loss¶

\[\mathcal{L} = \alpha \cdot A_{sur} - \beta \cdot A_{inter}\]

\(A_{sur}\): attention activation in the boundary region (to be suppressed).
\(A_{inter}\): attention activation in a randomly sampled interior region (to be reinforced).
Effect: implicitly guides the subject toward the image center.

Latent Optimization¶

\[z_t' = z_t - \alpha_t \cdot \nabla_{z_t}(\mathcal{L}^{cross} + \mathcal{L}^{self})\]

This update is applied only during early timesteps \(t > T_1\), where coarse structure is determined; subsequent steps proceed with standard denoising to preserve image quality.

Key Experimental Results¶

Main Results¶

Method	Requires LLM	HOIR↓	LOIR↓	Time(s)↓	CLIP-IQA↑	PickScore↑	ImageReward↑
SD v2.1	✗	45.7%	32.0%	5.51	0.714	20.03	0.221
GLIGEN	✗	34.2%	27.9%	8.74	0.672	20.59	0.177
LayoutGPT	✓	30.3%	31.5%	14.36	0.631	21.94	0.175
SLD	✓	27.1%	23.1%	9.54	0.694	23.07	0.253
Ours	✗	17.3%	11.7%	5.75	0.703	23.41	0.327

Incompleteness Rate Across Models¶

Model	HOIR
DALL-E 2	47.3%
SDv2	45.5%
SDv3	30.1%
SDXL	18.0%

Key Findings¶

The causal analysis experiments constitute a core contribution: controlled comparisons (with/without RandomCrop fine-tuning; seen/unseen categories) rigorously establish RandomCrop as the primary cause.
The proposed method reduces SDv2.1's incompleteness rate to 17.3% — approaching SDXL's 18% — without any training.
Computational overhead is minimal: only 0.24 s additional latency (5.51 → 5.75 s), far below LLM-based methods.
Image quality metrics (PickScore, ImageReward) improve rather than degrade, confirming that completeness gains do not come at the expense of quality.
Jointly constraining both cross-attention and self-attention outperforms using either in isolation.

Highlights & Insights¶

Rigorous root cause analysis: The paper not only proposes a method but also conducts principled investigation into the problem's origin, offering scientific value beyond the technical contribution.
Minimalist design: No training, no LLM, no auxiliary networks — only a few gradient steps inserted into the inference process.
Orthogonal to SDXL: SDXL mitigates the issue from the training side; the proposed method addresses it from the inference side, and the two approaches can be combined.

Limitations & Future Work¶

Penalizing boundary activation may over-constrain layout in certain scenes (e.g., compositions intentionally spanning image boundaries).
Hyperparameters \(\alpha, \beta, T_1, K\) require tuning.
Effectiveness on multi-object scenes is not thoroughly explored.

Controlled generation: Attend-and-Excite, P2P, MasaCtrl
Layout control: GLIGEN, LayoutDiffusion, BoxDiff
Seed selection: SeedSelect, S2ST

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Root cause analysis paired with an elegant solution.
Technical Depth: ⭐⭐⭐⭐ — Rigorous experimental design.
Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple models, multiple metrics, and causal validation.
Value: ⭐⭐⭐⭐⭐ — Training-free, plug-and-play, with negligible overhead.