MagicEraser: Erasing Any Objects via Semantics-Aware Control¶

Conference: ECCV 2024
arXiv: 2410.10207
Area: Image Generation

TL;DR¶

Proposes MagicEraser, an object erasure framework based on diffusion models. Through a three-stage design of content initialization, prompt tuning, and semantics-aware attention refocusing, it achieves high-quality object erasure and harmonious background generation without requiring user text inputs.

Background & Motivation¶

The object erasure task requires removing specified objects from an image and generating a background that harmonizes with the surrounding environment, which is a specific subtask of image inpainting. Existing methods face two major challenges:

Limitations of GAN Methods: Methods like LaMa and MAT perform well on simple repetitive textures (sky, grass), but generate blurry and inconsistent content when facing complex textures or backgrounds with inconsistent illumination.

Dilemma of Diffusion Model Methods: Stable Diffusion Inpainting requires high-quality text prompts to generate reasonable results. Short prompts (e.g., "boat on the lake") are prone to generating new undesired objects, whereas obtaining precise long descriptions is highly unfriendly to average users.

The underlying reasons are: (1) traditional inpainting training uses random masks, where the model learns to "restore missing regions" rather than "generate harmonious backgrounds"; (2) there is a semantic misalignment between the global text conditions and local erased regions in diffusion models.

Method¶

Overall Architecture¶

MagicEraser is built upon Stable Diffusion Inpainting and consists of two stages:

Stage 1: Content Initialization — Employs a pre-trained GAN model (Big-LaMa) to roughly fill the erased region. Stage 2: Controllable Generation — Utilizes two plug-and-play modules to control the diffusion generation process.

Key Designs¶

1. Content Initialization

Generating directly from random noise (denoising strength s=1) tends to deviate from the original image. Using a pre-trained inpainting model (LaMa) to initialize the content of the erased region, which is then encoded into VAE latent variables as the diffusion starting point (s=0.9), balances texture harmony and prevents the generation of unnecessary objects.

2. Prompt Tuning

Combines Textual Inversion and LoRA to achieve automatic prompting without user input: - Defines a placeholder "R∗" representing the concept of "background completion" and trains its token embedding v∗. - Automatically obtains background labels (e.g., "sky", "beach") using panoptic segmentation to construct the prompt "A photo of R∗ sky". - Trains with a 50% probability of using short prompts and a 50% probability of using long descriptions generated by LLaVA. - Trains only the LoRA parameters and v∗ to avoid damaging the generation capability of the pre-trained model.

3. Semantics-Aware Attention Refocusing

Uses Mask2Former panoptic segmentation results to classify pixels into three categories: - Masked region (m): The region that needs to be filled. - Positive region (p): Regions whose semantics belong to the background. - Negative region (n): Regions whose semantics are similar to the erased object.

Modulates self-attention: enhances the attention interaction between the masked region and positive regions, while suppressing interactions with negative regions and itself:

\[M = W_{pos} \cdot Mask_{pos} - W_{neg} \cdot Mask_{neg}\]

\[A' = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d}}\right)\]

4. Training Data Construction (OLRD)

An innovative data construction strategy: objects are selected from the original image, translated to regions annotated as background by panoptic segmentation, and blended to construct "object-mask-clean background" triplets, directly teaching the model the concept of "erasing objects to restore background".

Loss & Training¶

Joint optimization of LoRA fine-tuning and Textual Inversion:

\[L = \mathbb{E}_{z_0, z_{masked}, m, t, y, \epsilon_t}\left[\|\epsilon_t - \epsilon_{\theta, \phi}(z'_t, t, \tau(y), m)\|^2\right]\]

Key Experimental Results¶

Main Results¶

Method	Dataset	PSNR ↑	SSIM ↑	LPIPS ↓	FID ↓
MAT	OpenImages	26.994	0.949	0.030	31.30
Co-Mod	OpenImages	26.446	0.941	0.033	30.40
LaMa	OpenImages	21.618	0.936	0.055	37.10
SD Inpainting	OpenImages	26.096	0.942	0.036	31.10
MagicEraser	OpenImages	28.123	0.947	0.032	30.02
MAT	RealHM	21.484	0.843	0.107	51.73
SD Inpainting	RealHM	21.758	0.846	0.116	45.05
MagicEraser	RealHM	23.620	0.861	0.101	46.56

MagicEraser significantly leads in PSNR metrics across three datasets, outperforming MAT by 1.1dB on OpenImages and SD Inpainting by 1.9dB on RealHM.

Ablation Study¶

Model Configuration	PSNR ↑	SSIM ↑	LPIPS ↓	FID ↓
Baseline + Random Mask Training	21.331	0.815	0.134	52.10
Baseline + OLRD	22.130	0.834	0.119	50.73
+ Content Initialization	22.891	0.840	0.109	48.91
+ Attention Refocusing	23.277	0.844	0.110	48.93
+ Prompt Tuning	23.311	0.858	0.104	47.94
MagicEraser (All)	23.620	0.861	0.101	46.56

Each component makes a positive contribution: OLRD data construction is the foundation (+0.8dB), content initialization provides a good starting point (+0.7dB), and prompt tuning contributes more than attention refocusing.

Key Findings¶

MagicEraser outperforms commercial products Adobe Photoshop Generative Fill (23.620 vs 22.913 PSNR) and Google Photos Eraser (vs 20.310).
Prompt tuning (global semantic control) is more important than attention refocusing (local spatial control), and they exhibit good complementarity.
OLRD data construction brings a 0.8dB PSNR improvement compared to traditional random mask training.
The training-free attention refocusing module is indeed training-free and modulated only in the early denoising steps (t=1~0.7).

Highlights & Insights¶

User-Friendly Design: Eliminates the need for manual text prompts by using automatic panoptic segmentation and the learned R∗ concept, significantly lowering the user barrier.
Clever Data Construction Strategy: Translating objects into background regions avoids the "foreground restoration" bias of traditional inpainting training, directly training the model for "background completion".
Semantics-Aware Attention Modulation: Unlike search-based loss optimization for attention maps, directly modifying attention scores is more efficient and training-free.
High Practical Value: Performance outperforming commercial products demonstrates strong potential for practical deployment.

Limitations & Future Work¶

Heavy reliance on the panoptic/semantic segmentation model, where segmentation quality directly affects the accuracy of positive/negative region division.
Experiments only conducted at 512×512 resolution; performance in high-resolution scenarios remains to be validated.
Hyperparameters such as λ_pos and λ_neg need to be set manually.
Performance may degrade when the erased region occupies an extremely large proportion of the image.

Rating¶

Novelty: ⭐⭐⭐⭐ — Semantics-aware attention refocusing and OLRD data construction strategies are novel.
Technical Depth: ⭐⭐⭐⭐ — The multi-stage cooperative design is complete, and each component has a clear motivation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three datasets + commercial product comparison + comprehensive ablation.
Writing Quality: ⭐⭐⭐⭐ — The problem definition is clear, and the distinction from traditional inpainting is well articulated.