Skip to content

MagicEraser: Erasing Any Objects via Semantics-Aware Control

Conference: ECCV 2024
arXiv: 2410.10207
Area: Image Generation

TL;DR

Proposes MagicEraser, an object erasure framework based on diffusion models. Through a three-stage design of content initialization, prompt tuning, and semantics-aware attention refocusing, it achieves high-quality object erasure and harmonious background generation without requiring user text inputs.

Background & Motivation

The object erasure task requires removing specified objects from an image and generating a background that harmonizes with the surrounding environment, which is a specific subtask of image inpainting. Existing methods face two major challenges:

Limitations of GAN Methods: Methods like LaMa and MAT perform well on simple repetitive textures (sky, grass), but generate blurry and inconsistent content when facing complex textures or backgrounds with inconsistent illumination.

Dilemma of Diffusion Model Methods: Stable Diffusion Inpainting requires high-quality text prompts to generate reasonable results. Short prompts (e.g., "boat on the lake") are prone to generating new undesired objects, whereas obtaining precise long descriptions is highly unfriendly to average users.

The underlying reasons are: (1) traditional inpainting training uses random masks, where the model learns to "restore missing regions" rather than "generate harmonious backgrounds"; (2) there is a semantic misalignment between the global text conditions and local erased regions in diffusion models.

Method

Overall Architecture

MagicEraser is built upon Stable Diffusion Inpainting and consists of two stages:

Stage 1: Content Initialization — Employs a pre-trained GAN model (Big-LaMa) to roughly fill the erased region. Stage 2: Controllable Generation — Utilizes two plug-and-play modules to control the diffusion generation process.

Key Designs

1. Content Initialization

Generating directly from random noise (denoising strength s=1) tends to deviate from the original image. Using a pre-trained inpainting model (LaMa) to initialize the content of the erased region, which is then encoded into VAE latent variables as the diffusion starting point (s=0.9), balances texture harmony and prevents the generation of unnecessary objects.

2. Prompt Tuning

Combines Textual Inversion and LoRA to achieve automatic prompting without user input: - Defines a placeholder "R∗" representing the concept of "background completion" and trains its token embedding v∗. - Automatically obtains background labels (e.g., "sky", "beach") using panoptic segmentation to construct the prompt "A photo of R∗ sky". - Trains with a 50% probability of using short prompts and a 50% probability of using long descriptions generated by LLaVA. - Trains only the LoRA parameters and v∗ to avoid damaging the generation capability of the pre-trained model.

3. Semantics-Aware Attention Refocusing

Uses Mask2Former panoptic segmentation results to classify pixels into three categories: - Masked region (m): The region that needs to be filled. - Positive region (p): Regions whose semantics belong to the background. - Negative region (n): Regions whose semantics are similar to the erased object.

Modulates self-attention: enhances the attention interaction between the masked region and positive regions, while suppressing interactions with negative regions and itself:

\[M = W_{pos} \cdot Mask_{pos} - W_{neg} \cdot Mask_{neg}\]
\[A' = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d}}\right)\]

4. Training Data Construction (OLRD)

An innovative data construction strategy: objects are selected from the original image, translated to regions annotated as background by panoptic segmentation, and blended to construct "object-mask-clean background" triplets, directly teaching the model the concept of "erasing objects to restore background".

Loss & Training

Joint optimization of LoRA fine-tuning and Textual Inversion:

\[L = \mathbb{E}_{z_0, z_{masked}, m, t, y, \epsilon_t}\left[\|\epsilon_t - \epsilon_{\theta, \phi}(z'_t, t, \tau(y), m)\|^2\right]\]

Key Experimental Results

Main Results

Method Dataset PSNR ↑ SSIM ↑ LPIPS ↓ FID ↓
MAT OpenImages 26.994 0.949 0.030 31.30
Co-Mod OpenImages 26.446 0.941 0.033 30.40
LaMa OpenImages 21.618 0.936 0.055 37.10
SD Inpainting OpenImages 26.096 0.942 0.036 31.10
MagicEraser OpenImages 28.123 0.947 0.032 30.02
MAT RealHM 21.484 0.843 0.107 51.73
SD Inpainting RealHM 21.758 0.846 0.116 45.05
MagicEraser RealHM 23.620 0.861 0.101 46.56

MagicEraser significantly leads in PSNR metrics across three datasets, outperforming MAT by 1.1dB on OpenImages and SD Inpainting by 1.9dB on RealHM.

Ablation Study

Model Configuration PSNR ↑ SSIM ↑ LPIPS ↓ FID ↓
Baseline + Random Mask Training 21.331 0.815 0.134 52.10
Baseline + OLRD 22.130 0.834 0.119 50.73
+ Content Initialization 22.891 0.840 0.109 48.91
+ Attention Refocusing 23.277 0.844 0.110 48.93
+ Prompt Tuning 23.311 0.858 0.104 47.94
MagicEraser (All) 23.620 0.861 0.101 46.56

Each component makes a positive contribution: OLRD data construction is the foundation (+0.8dB), content initialization provides a good starting point (+0.7dB), and prompt tuning contributes more than attention refocusing.

Key Findings

  • MagicEraser outperforms commercial products Adobe Photoshop Generative Fill (23.620 vs 22.913 PSNR) and Google Photos Eraser (vs 20.310).
  • Prompt tuning (global semantic control) is more important than attention refocusing (local spatial control), and they exhibit good complementarity.
  • OLRD data construction brings a 0.8dB PSNR improvement compared to traditional random mask training.
  • The training-free attention refocusing module is indeed training-free and modulated only in the early denoising steps (t=1~0.7).

Highlights & Insights

  1. User-Friendly Design: Eliminates the need for manual text prompts by using automatic panoptic segmentation and the learned R∗ concept, significantly lowering the user barrier.
  2. Clever Data Construction Strategy: Translating objects into background regions avoids the "foreground restoration" bias of traditional inpainting training, directly training the model for "background completion".
  3. Semantics-Aware Attention Modulation: Unlike search-based loss optimization for attention maps, directly modifying attention scores is more efficient and training-free.
  4. High Practical Value: Performance outperforming commercial products demonstrates strong potential for practical deployment.

Limitations & Future Work

  • Heavy reliance on the panoptic/semantic segmentation model, where segmentation quality directly affects the accuracy of positive/negative region division.
  • Experiments only conducted at 512×512 resolution; performance in high-resolution scenarios remains to be validated.
  • Hyperparameters such as λ_pos and λ_neg need to be set manually.
  • Performance may degrade when the erased region occupies an extremely large proportion of the image.

Rating

  • Novelty: ⭐⭐⭐⭐ — Semantics-aware attention refocusing and OLRD data construction strategies are novel.
  • Technical Depth: ⭐⭐⭐⭐ — The multi-stage cooperative design is complete, and each component has a clear motivation.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three datasets + commercial product comparison + comprehensive ablation.
  • Writing Quality: ⭐⭐⭐⭐ — The problem definition is clear, and the distinction from traditional inpainting is well articulated.