Skip to content

Penalizing Boundary Activation for Object Completeness in Diffusion Models

Conference: ICCV 2025 arXiv: 2509.16968 Code: https://github.com/HaoyXu7/Object_Completeness Area: Diffusion Models / Image Generation Keywords: Object Completeness, RandomCrop, Attention Constraint, Training-Free, Boundary Penalty

TL;DR

This paper investigates the root cause of incomplete object generation in diffusion models — the RandomCrop data augmentation used during training — and proposes a training-free boundary activation penalty method. By applying cross-attention and self-attention constraints during early denoising steps, the method suppresses object generation near image boundaries, reducing the object incompleteness rate of SDv2.1 from 45.7% to 17.3%.

Background & Motivation

  • Background: Images generated by diffusion models frequently exhibit incomplete objects cropped by image boundaries — e.g., "a red car" showing only its front half, or "a bathtub" missing its right portion. The incompleteness rate reaches 47.3% in DALL-E 2, 45.7% in SDv2.1, and 18% even in SDXL.

  • Limitations of Prior Work: Most research treats this as an inherent artifact of generation stochasticity or a simple generation failure, with little rigorous causal analysis or targeted solution.

  • Key Challenge:

  • Dataset? Manual inspection reveals only ~4% incompleteness in training data, far below the 45.7% observed in generated images.
  • Data Augmentation? ✓ Fine-tuning with RandomCrop causes the incompleteness rate to increase monotonically with epochs, while fine-tuning on unaugmented images causes it to decrease monotonically — consistently across both seen and unseen object categories.

  • Goal: Although RandomCrop is the culprit, it is indispensable for model diversity and generalization, and retraining is prohibitively expensive. A training-free, inference-time solution is therefore required.

Method

Core Idea

During the early denoising steps — the critical phase that determines coarse structure — the latent representation is optimized via gradient updates to suppress the probability of objects appearing near image boundaries.

Key Design 1: Cross-Attention Constraint

The cross-attention map \(M^{cross}_x\) corresponding to the object category tokens in the prompt is extracted; this map reflects the semantic "footprint" of the object in the image.

Key Design 2: Self-Attention Constraint

Cross-attention primarily encodes semantic information and lacks spatial structural detail. Self-attention maps are therefore incorporated as follows: - Gaussian smoothing is applied to \(M^{cross}_x\). - A clustering algorithm selects \(K\) keypoints \(p_1, \dots, p_K\). - Self-attention maps at each keypoint are extracted and averaged to yield \(M^{self}_{avg}\).

Dispelling Loss

\[\mathcal{L} = \alpha \cdot A_{sur} - \beta \cdot A_{inter}\]
  • \(A_{sur}\): attention activation in the boundary region (to be suppressed).
  • \(A_{inter}\): attention activation in a randomly sampled interior region (to be reinforced).
  • Effect: implicitly guides the subject toward the image center.

Latent Optimization

\[z_t' = z_t - \alpha_t \cdot \nabla_{z_t}(\mathcal{L}^{cross} + \mathcal{L}^{self})\]

This update is applied only during early timesteps \(t > T_1\), where coarse structure is determined; subsequent steps proceed with standard denoising to preserve image quality.

Key Experimental Results

Main Results

Method Requires LLM HOIR↓ LOIR↓ Time(s)↓ CLIP-IQA↑ PickScore↑ ImageReward↑
SD v2.1 45.7% 32.0% 5.51 0.714 20.03 0.221
GLIGEN 34.2% 27.9% 8.74 0.672 20.59 0.177
LayoutGPT 30.3% 31.5% 14.36 0.631 21.94 0.175
SLD 27.1% 23.1% 9.54 0.694 23.07 0.253
Ours 17.3% 11.7% 5.75 0.703 23.41 0.327

Incompleteness Rate Across Models

Model HOIR
DALL-E 2 47.3%
SDv2 45.5%
SDv3 30.1%
SDXL 18.0%

Key Findings

  • The causal analysis experiments constitute a core contribution: controlled comparisons (with/without RandomCrop fine-tuning; seen/unseen categories) rigorously establish RandomCrop as the primary cause.
  • The proposed method reduces SDv2.1's incompleteness rate to 17.3% — approaching SDXL's 18% — without any training.
  • Computational overhead is minimal: only 0.24 s additional latency (5.51 → 5.75 s), far below LLM-based methods.
  • Image quality metrics (PickScore, ImageReward) improve rather than degrade, confirming that completeness gains do not come at the expense of quality.
  • Jointly constraining both cross-attention and self-attention outperforms using either in isolation.

Highlights & Insights

  1. Rigorous root cause analysis: The paper not only proposes a method but also conducts principled investigation into the problem's origin, offering scientific value beyond the technical contribution.
  2. Minimalist design: No training, no LLM, no auxiliary networks — only a few gradient steps inserted into the inference process.
  3. Orthogonal to SDXL: SDXL mitigates the issue from the training side; the proposed method addresses it from the inference side, and the two approaches can be combined.

Limitations & Future Work

  • Penalizing boundary activation may over-constrain layout in certain scenes (e.g., compositions intentionally spanning image boundaries).
  • Hyperparameters \(\alpha, \beta, T_1, K\) require tuning.
  • Effectiveness on multi-object scenes is not thoroughly explored.
  • Controlled generation: Attend-and-Excite, P2P, MasaCtrl
  • Layout control: GLIGEN, LayoutDiffusion, BoxDiff
  • Seed selection: SeedSelect, S2ST

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — Root cause analysis paired with an elegant solution.
  • Technical Depth: ⭐⭐⭐⭐ — Rigorous experimental design.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple models, multiple metrics, and causal validation.
  • Value: ⭐⭐⭐⭐⭐ — Training-free, plug-and-play, with negligible overhead.