Skip to content

Generative Photomontage

Conference: CVPR 2025
arXiv: 2408.07116
Code: https://github.com/seanjliu/generative_photomontage
Area: Diffusion Models
Keywords: Image Synthesis, Diffusion Features, Graph Cut Optimization, Self-Attention Injection, ControlNet

TL;DR

The paper proposes the Generative Photomontage framework, allowing users to select different regions from multiple images generated by ControlNet. It achieves seamless blending through multi-label graph cut segmentation in the diffusion feature space and self-attention feature injection, enabling fine-grained compositional control over generated images.

Background & Motivation

Background: Text-to-image models can generate high-quality images from simple conditions (e.g., text, sketches), but the generation process is inherently a "dice roll"—it is challenging for users to obtain a completely satisfactory result in a single run. Conditional control methods like ControlNet restrict the output space, but given the same conditions and different random seeds, they still produce diverse results that vary significantly in lighting, appearance, and background.

Limitations of Prior Work: Users might prefer a specific object in the first image, the background of the second, and a detail in the third, but there is no mechanism to combine them into an ideal image. Existing image editing methods either edit high-level style (such as MasaCtrl), cause color distortion when blending in pixel space (such as gradient-domain fusion in Interactive Digital Photomontage), or suffer from structural changes due to operations in noise space (such as Blended Latent Diffusion).

Key Challenge: How can one naturally and harmoniously blend multiple image regions while preserving the appearance of local regions selected by the user? Graph cuts in pixel space are sensitive to low-level color, which often places seams in undesirable locations, while existing diffusion model editing methods often fail to simultaneously ensure local fidelity and global consistency.

Goal: Design a framework that allows users to first generate a large batch of candidate images, then "select" their preferred local regions from them and automatically blend them together, treating the output of generative models as intermediate products rather than final results.

Key Insight: Since ControlNet images share the same input conditions (such as edge maps or depth maps), the generated images have a consistent spatial structure. This provides a natural foundation for region segmentation and composition across images. The Key features of the self-attention layer in diffusion models possess rich semantic and appearance information, making them suitable as a feature space for graph cut optimization.

Core Idea: Perform multi-label graph cut segmentation in the diffusion feature space instead of the pixel space, followed by seamless blending via self-attention Q/K/V feature injection.

Method

Overall Architecture

The input is a stack of images generated by ControlNet (using the same input conditions but different random seeds). The user marks the regions they want to keep in different images with simple brush strokes. The system first performs multi-label graph cut optimization in the diffusion feature space to obtain the segmentation result, and then injects the blended self-attention features during the denoising process to generate the final composite image.

Key Designs

  1. Diffusion Feature Space Graph Cut Segmentation:

    • Function: Find the optimal region segmentation in the generated image stack based on user strokes, such that the seams align with semantic boundaries as much as possible.
    • Mechanism: Utilize the Key features \(K \in \mathbb{R}^{w \times h \times d}\) of the self-attention layer as the feature space for graph cuts (at 1/8 resolution of the original image). The energy function consists of a unary term (penalizing label assignments that violate user strokes, with cost \(C=10^6\)) and a binary term (encouraging seams to lie where feature gradients are large across all images). The binary term computes feature distances using the top-10 PCA components of the Key features of each image, and encourages seams to align with semantic edges through an exponential decay function \(\lambda e^{-|f_i(p)-f_i(q)|/(2\sigma)}\).
    • Design Motivation: Graph cuts in pixel space are sensitive to low-level color variations, often placing seams along undesirable edges. Diffusion features capture higher-level semantic information, making the segmentation boundary semantically more reasonable; simultaneously, performing composition in the feature space during subsequent steps avoids the extra post-processing required by pixel-space composition.
  2. Self-Attention Feature Injection Synthesis:

    • Function: Combine the features of each segmented region into a consistent blended image.
    • Mechanism: Based on the label allocation map obtained from the graph cut, compute composite features \(Q^{comp}\), \(K^{comp}\), and \(V^{comp}\) for each self-attention layer. For K and V, the stored features corresponding to the regions are directly concatenated from each image according to the mask. For Q, the base image region uses the model's currently generated \(Q^{model}\) instead of the stored \(Q^B\), allowing the model to adaptively adjust the structure for better seam blending.
    • Design Motivation: \(Q\) controls image structures, while \(K/V\) control the appearance. If \(Q^B\) of the base image is forcibly injected, the model cannot adaptively adjust the structure near the seams, leading to issues like residual shadows. Utilizing \(Q^{model}\) unleashes the model's capability for high-resolution semantic alignment near low-resolution graph cut boundaries.
  3. Multi-Image User Interaction Workflow:

    • Function: Allow users to select regions from multiple images via simple brush strokes, designating one image as the base image.
    • Mechanism: During the generation phase, save the Q/K/V features for every timestep and self-attention layer of all images. After the user annotates, the system automatically performs segmentation and blending, using the base image's seed and prompt for the final denoising. Compared to segmenting image-by-image with SAM, multi-label graph cut automatically handles overlapping and coverage issues between images.
    • Design Motivation: Treating the output of the generative model as intermediate products rather than the final results converts the generation process from a "dice roll" to "roll the dice first, then choose the best combination," compositionally improving the success rate.

Loss & Training

This method is a plug-and-play inference-time framework requiring no additional training. The graph cut optimization is solved using the \(\alpha\)-expansion algorithm for multi-label problems. The hyperparameters \(\lambda=100\) and \(\sigma=10\) are fixed in all experiments.

Key Experimental Results

Main Results

Method Masked LPIPS ↓ Masked SSIM ↑ PSNR ↑ Seam Gradient Score
Ours 0.123 0.815 22.46 0.339 (within range)
IDP [1] 0.104 0.888 20.13 0.306
BLD [4] 0.222 0.772 20.27 0.393
MasaCtrl+CtrlNet 0.230 0.680 18.34 0.341
CollageDiffusion 0.243 0.605 20.57 0.559* (out of range)

Ablation Study

Configuration Masked LPIPS ↓ PSNR ↑ Description
Full model (Ours) 0.123 22.46 Full model
w/ K^concat, V^concat 0.243 18.37 Utilizing concatenated K/V instead of composite features seriously degrades local fidelity
w/ K^model, V^model 0.268 18.85 Using model-generated K/V instead of stored features loses original appearance

Key Findings

  • The injection scheme of K/V is key to preserving local appearance: utilizing composite masks to inject stored K/V achieves a PSNR that is 3.6-4.1 dB higher than alternative schemes.
  • The handling of Q is highly subtle: using \(Q^{model}\) instead of \(Q^B\) for the base region avoids structural rigidity at seams, aligning the low-resolution graph cut boundary with high-resolution semantic features.
  • User studies (324 responses) indicate that this method significantly outperforms all baselines in blending quality and is comparable to MasaCtrl+CtrlNet in realism.
  • Compared to single-image segmentation with SAM, the graph-cut-based multi-image segmentation is significantly better at label consistency and user stroke adherence.

Highlights & Insights

  • 巧妙利用扩散特征作为图割空间 (Clever utilization of diffusion features as a graph cut space): Migrating classic graph cut optimization from pixel space to the self-attention feature space of the diffusion model simultaneously solves two problems: seam placement and feature blending. This idea can be generalized to any task requiring cross-image semantic alignment.
  • \(Q^{model}\) vs \(Q^B\) 的深刻洞察 (Profound insight into \(Q^{model}\) vs \(Q^B\)): Discovering the concrete manifestation of the decoupled properties where Q controls structure and K/V controls appearance in synthesis tasks. This provides theoretical and experimental support for opting to retain the model's adaptive Q in the base region.
  • 用户交互范式的转变 (Shift in interactive generation paradigms): Shifting from "asking the model to generate the correct result in one shot" to "asking the model to generate multiple candidates and then combining them." This change in mindset is highly inspiring for interactive generative content creation.

Limitations & Future Work

  • Storing Q/K/V features for every timestep and every self-attention layer of all images beforehand incurs large memory overhead.
  • It relies on the shared spatial structure of ControlNet images, making it unable to handle mixtures of images generated under different conditions.
  • It is implemented based on Stable Diffusion 1.5, and its generalizability to newer diffusion architectures (e.g., DiT, SD3) has not been verified.
  • When the user-selected regions exhibit severe semantic conflicts, the synthesis result may not be natural enough.
  • vs Interactive Digital Photomontage: The classic work performs graph cuts + gradient-domain blending in pixel space. This paper conducts graph cuts + self-attention injection in the diffusion feature space, solving color distortion and sensitivity to low-level noise.
  • vs MasaCtrl / Cross-Image Attention: These methods prioritize high-level style transfer, which can alter local appearances. This paper emphasizes the precise preservation of the local appearance of user-selected regions.
  • vs Blended Latent Diffusion: Blending in noise space can lead to changes in structure and appearance. This paper blends in the attention feature space, yielding higher fidelity.

Rating

  • Novelty: ⭐⭐⭐⭐ The idea of migrating classic graph cuts to the diffusion feature space is novel, but the core components (graph cuts, attention injection) have foundations in prior work.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely thorough quantitative evaluation, user studies, ablation studies, and comparisons with various baselines.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear motivations, detailed methodology description, and high-quality figures.
  • Value: ⭐⭐⭐⭐ Proposes a practical interactive framework for image creation, which offers direct value to creative workflows.