OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models¶

Conference: ECCV 2024
arXiv: 2403.10983
Area: Image Generation

TL;DR¶

OMG is proposed as an occlusion-friendly personalized multi-concept image generation framework. Through two-stage sampling (layout generation + concept noise blending), it achieves strong identity preservation and natural lighting harmonization. It can be integrated out-of-the-box with various single-concept models (such as LoRA and InstantID) without requiring additional training.

Background & Motivation¶

Multi-concept customization in personalized text-to-image generation is an important yet challenging task. Existing methods face three major challenges:

Identity Degradation: When generating multiple concepts simultaneously in a single image, the identity preservation of each concept drops significantly (experiments demonstrate that the Identity Alignment of generating two concepts simultaneously is markedly lower than generating them individually).

Failure in Occlusion Handling: When concept regions overlap, methods like Mix-of-Show merge cross-attention results through simple linear addition, which leads to layout conflicts and identity degradation.

Incoherence between Foreground and Background: The lighting of the foreground and background in multi-concept images is often unnatural.

Additional Training Overhead: Existing methods require additional network optimization to merge multiple concept models, resulting in high computational overhead.

Method¶

Overall Architecture¶

OMG adopts a two-stage sampling scheme:

Stage 1 (Layout Generation + Visual Understanding Information Collection): Generates a non-customized image with a reasonable layout using a text prompt without personalized identifiers, while simultaneously collecting cross-attention maps and concept masks.
Stage 2 (Multi-concept Personalized Denoising): Integrates multiple concepts under occlusion considerations by utilizing the collected visual understanding information and the proposed concept noise blending strategy.

Key Designs¶

Visual Understanding Information Preparation (Stage 1):

A text prompt \(p\) containing only category names (e.g., "man", "woman") rather than personalized identifiers is input to generate a non-customized image \(x_{ncus} = T2I(p)\) via SDXL. During the denoising process, the cross-attention maps \(A_t\) of each step are stored. A visual understanding model is then utilized to extract concept masks \(\{M_1, M_2, \cdots, M_k\}\) from the generated image and the category names.

Concept Noise Blending:

Core formula:

\[z_{t-1} = (1 - \bigcup_{i=0}^{k} M_i) * z'_{t-1} + \sum_{i=0}^{k} M_i * C_{t-1}^i\]

where \(z'_{t-1}\) represents the global non-customized noise, and \(C_{t-1}^i = T2I_c^i(z_t, p^i, t)\) is the concept noise generated by the \(i\)-th single-concept model under its exclusive text prompt \(p^i\). Each single-concept model is solely responsible for generating its specific region, effectively mitigating identity degradation.

Occlusion Layout Preservation:

At each timestep in the second stage, the currently generated attention map is overwritten by the corresponding attention map stored from Stage 1 to maintain a layout consistent with the non-customized image:

\[z'_{t-1} = T2I(z_t, p, t)\{A_t^g \leftarrow A_t\}\]

Influence of the Starting Denoising Timestep:

Experiments show that the starting timestep of concept noise blending exerts a crucial impact on identity preservation and layout. The early steps control the image layout, while the later steps display the concept identity. The optimal starting step is around step 35 (out of 50 total DDIM steps). Lighting incoherence is related to the image layout information and gradually harmonizes as the starting step shifts later.

Loss & Training¶

OMG requires no training during the inference phase, modifying only the sampling process. The training of single-concept LoRA models uses the standard diffusion loss:

\[\mathcal{L} = E_{z_0, \epsilon, t} \|\epsilon - \varepsilon_\theta(z_t, t, c)\|_2^2\]

The LoRA rank is set to 256, with a text encoder learning rate of \(3e^{-5}\) and a UNet learning rate of \(3e^{-3}\).

Key Experimental Results¶

Main Results¶

Quantitative Comparison of Single-Concept & Multi-Concept Personalization:

Method	Character ID↑ (Single/Multi)	Object ImgAlign↑ (Single/Multi)
DreamBooth	0.456 / 0.480	0.805 / 0.800
Textual Inversion	0.292 / 0.294	0.784 / 0.781
Custom Diffusion	0.370 / 0.322	0.840 / 0.778
Mix-of-show	0.422 / 0.436	0.791 / 0.780
OMG	0.514 / 0.510	0.842 / 0.810

Multi-Concept Compositional Generation Performance:

Method	man+woman	man+man	woman+woman	object+object	Average
DreamBooth	0.302	0.258	0.192	0.784	0.384
Textual Inversion	0.122	0.131	0.064	0.675	0.248
Custom Diffusion	0.265	0.210	0.212	0.757	0.361
Mix-of-show	0.361	0.219	0.143	0.736	0.365
OMG	0.487	0.377	0.293	0.763	0.480

Ablation Study¶

Ablation studies validate the contribution of each component:

Layout Preservation: After its inclusion, the image structure becomes more reasonable, avoiding generative distortions.
Concept Noise Blending vs. Region-Controlled Sampling: The region-controlled sampling of Mix-of-Show results in missing concepts and cluttered layouts in occluded areas, whereas concept noise blending effectively handles occlusion while achieving harmonized foreground-background lighting.
Varying Number of Concepts (1 \(\to\) 5): Even when the number of concepts increases to 5, the identity consistency of each concept is still maintained.

Key Findings¶

The gap in identity degradation \(\Delta\) from single-concept to multi-concept is only -0.004 (Identity Alignment), which is much lower than Custom Diffusion's -0.048.
OMG significantly outperforms the runner-up, DreamBooth (0.384), in average multi-concept score with 0.480, yielding a 25% improvement.
When combined with InstantID, it requires no additional training and generates more natural and realistic colors.

Highlights & Insights¶

Plug-and-play Architecture: It can directly use LoRA models from the civitai.com community without additional training or model merging, which greatly lowers the barrier to multi-concept customization.
Elegant Occlusion Handling: By decoupling layout and identity injection in two stages, it cleverly addresses the long-standing challenge of multi-concept occlusion.
Discovery of Critical Hyperparameters: The starting step of noise blending simultaneously controls identity preservation and lighting harmonization, offering an intuitive means for adjustment.
Flexible Compatibility with Single-Concept Models: It supports both training-based (LoRA) and training-free (InstantID) single-concept methods.

Limitations & Future Work¶

It relies on visual understanding models (e.g., segmentation models) to extract concept masks, making the final quality dependent on the mask quality.
Each concept requires individual model inference, causing the computational overhead to scale linearly as the number of concepts increases.
There is still room for improvement in distinguishing between intra-class concepts (e.g., two women; the IDA for woman+woman is only 0.293).

Rating¶

⭐⭐⭐⭐ (4/5)

Novelty: ★★★★ — The design of the two-stage sampling and concept noise blending is novel and intuitive.
Technical Quality: ★★★★ — A training-free, inference-time method with strong engineering practicality.
Experimental Thoroughness: ★★★★★ — Comprehensive comparisons including single-concept/multi-concept/ablation, complete with user studies and quantitative metrics.
Practicality: ★★★★★ — Direct compatibility with community models, offering high practical value.