Erase Diffusion: Empowering Object Removal Through Calibrating Diffusion Pathways (EraDiff)¶
Conference: CVPR 2025
arXiv: 2503.07026
Code: None
Area: Diffusion Models
Keywords: Object Removal, Image Inpainting, Diffusion Pathway Calibration, Self-Rectifying Attention, Chain-Rectifying Optimization
TL;DR¶
This paper proposes EraDiff, which establishes a progressive diffusion pathway from "object-containing" to "pure background" through the Chain-Rectifying Optimization (CRO) paradigm, and suppresses artifacts during sampling using the Self-Rectifying Attention (SRA) mechanism. This enables the diffusion model to truly comprehend the "erasure intention," achieving a SOTA Local FID (3.799) on OpenImages V5 and significantly outperforming SD2-Inpaint and LaMa in complex real-world scenes.
Background & Motivation¶
-
Background: LDM-based image inpainting methods have made great progress in generating natural content, and models like SD2-Inpaint can generate high-quality inpainting results.
-
Limitations of Prior Work: Existing methods perform poorly when applied to object removal (erase inpainting)—the model tends to regenerate objects within the masked region instead of removing them. For instance, if a user wants to erase a slice of pizza, the model might generate another slice of pizza instead of a clean plate.
-
Key Challenge: Standard diffusion training only establishes a denoising pathway from "noise \(\to\) clean image," where the training objective is to reconstruct the original content (including objects) covered by the mask, rather than understanding the intention to "remove objects." Consequently, the diffusion pathway of the model inherently leads from noise to an "object-containing image." Furthermore, artifacts generated by the mask shape and noise levels at early denoising stages are amplified by the self-attention mechanism.
-
Goal: (1) How to enable the diffusion model to learn an erasing pathway from "object to background"? (2) How to rectify pathway deviations caused by artifacts during the sampling process?
-
Key Insight: It is observed that the erase task requires a dedicated diffusion pathway—the model should denoise along a trajectory where the "object gradually disappears," rather than the standard "noise \(\to\) original image" pathway. Meanwhile, regarding self-attention allocation, the masked region should not attend to itself (which might contain artifacts), but to the background region.
-
Core Idea: Constructing a progressive erasing diffusion pathway using dynamic mixup images (CRO) + forcing the masked region to only attend to the background using an attention mask (SRA).
Method¶
Overall Architecture¶
Based on the SD2-Inpaint architecture, EraDiff introduces two core improvements: using CRO during training to establish an erase-specific diffusion pathway, and replacing standard self-attention with SRA during inference to rectify sampling deviations. The inputs are the noisy image, binary mask, and masked image, while the output is a clean background with the object removed.
Key Designs¶
-
Chain-Rectifying Optimization (CRO):
- Function: Establishes a progressive diffusion transition pathway from "object-containing noisy image" to "clean background."
- Mechanism: First, a matting model is utilized to segment the main object from the original image. This object is pasted onto the background after transformations such as rotation and scaling, yielding a synthesized image \(x_0^{obj}\). Then, for each timestep \(t\), a dynamic mixed image is constructed as \(\tilde{x}_t^{mix} = (1-\lambda_t) x_0^{ori} + \lambda_t x_0^{obj}\), where \(\lambda_t\) decreases monotonically with the timestep (the object is more prominent at large \(t\) and gradually disappears at small \(t\)). \(\lambda_t\) changes synchronously with the noise schedule \(1-\bar{\alpha}_t\). Adding noise to the mixed image obtains the latent state \(x_t^{mix}\) of the new pathway. The new optimization objective turns into minimizing the distance between the model's predicted prior state and the ground-truth prior mixed state: \(\min_\theta \|x_{t-\gamma}^{mix} - p_\theta(\hat{x}_{t-\gamma}^{mix} | x_t^{mix})\|^2\), where \(\gamma\) is randomly sampled to control the step-skipping magnitude.
- Design Motivation: The endpoint of the standard training diffusion pathway is the original image (containing the object), whereas the endpoint of the CRO pathway is the pure background. By witnessing the process where "the object gradually disappears" at each timestep, the model naturally learns the erase intention. The progressive change of the mixup strategy also stabilizes the training process, avoiding the difficulty of predicting the entire masked region in one step at low-noise stages.
-
Self-Rectifying Attention (SRA):
- Function: Prevents the diffusion of artifact information during sampling, guiding the masked region to extract features from the background rather than itself.
- Mechanism: The image mask is downsampled and flattened into a vector \(m\) to construct an extended attention mask matrix \(m'_{i,j}\)—its value is 1 when \(m_i=0\) or \(m_j=0\) (i.e., either the Query or the Key comes from the background), and \(-\infty\) otherwise. This mask is added to the attention weights: \(\text{SRA}(Q,K,V) = \text{Softmax}(\frac{QK^T}{\sqrt{d}} \cdot m') V\). This ensures that Queries in the masked region can only attend to Keys/Values of the background (ignoring themselves), while the background region is unaffected by the masked region.
- Design Motivation: In standard self-attention, the masked region attends to itself, but during early denoising stages, the masked region is filled with noise/artifacts. These artifact features are mistakenly treated as key information and amplified in subsequent steps, ultimately causing object regeneration. SRA forces the masked region to only look at the background, fundamentally cutting off the propagation path of artifacts.
-
Dynamic Image Synthesis:
- Function: Constructs "object-containing vs. pure background" training pairs without requiring extra paired data.
- Mechanism: A foreground object is extracted from the original image using a matting model, undergoes random transformations (rotation, scaling), and is pasted back onto the background region of the original image, generating a new object-containing image \(x_0^{obj}\). The original image \(x_0^{ori}\) serves as the background target. Through timestep-dependent mixup, the two are progressively blended to simulate the object fading process. This self-synthesis strategy is highly cost-effective and does not require real "object/no-object" paired data.
- Design Motivation: The ideal training data for the erasing task consists of "with-object/without-object" versions of the same scene, but such paired data is virtually non-existent. The self-synthesis strategy cleverly utilizes the foreground objects of the image itself to create approximate pairs. Although the position and pose of the synthesized object differ from the original, it is sufficient for the model to learn the concept of "object disappearance."
Loss & Training¶
CRO optimization objective: \(\min_\theta \mathbb{E}_{\gamma, t} \|x_{t-\gamma}^{mix} - p_\theta(\hat{x}_{t-\gamma}^{mix} | x_t^{mix})\|_2^2\), where \(\gamma \in (0, \gamma_m)\), \(\gamma_m = 100\). \(\lambda_t\) shares the same schedule as \(1 - \bar{\alpha}_t\). Fine-tuned based on SD2-Inpaint using the Adam optimizer with a learning rate of \(3 \times 10^{-6}\), trained on Nvidia A100 GPUs.
Key Experimental Results¶
Main Results¶
| Method | FID↓ | LPIPS↓ | Local FID↓ |
|---|---|---|---|
| SD2-Inpaint | 3.805 | 0.301 | 8.852 |
| SD2-Inpaint* (with prompt) | 4.019 | 0.308 | 7.194 |
| PowerPaint | 6.027 | 0.289 | 10.021 |
| Inst-Inpaint | 11.423 | 0.410 | 43.472 |
| LaMa | 7.533 | 0.219 | 6.091 |
| EraDiff (ours) | 6.540 | 0.192 | 3.799 |
Note: EraDiff leads by a large margin in Local FID and LPIPS. Local FID reflects the generation quality of the erased region, which is the most central metric for the erasing task.
Ablation Study¶
| Configuration | Local FID↓ | GPT-4o Erasing Success Rate↑ |
|---|---|---|
| Full EraDiff | 3.799 | 83.43% |
| w/o CRO | 5.713 | 72.96% |
| w/o SRA | 4.950 | 78.54% |
| w/o CRO + SRA | 8.852 | 27.80% |
| w/o mix-up | NaN | NaN |
Key Findings¶
- CRO contributes the most—without CRO, Local FID rises from 3.799 to 5.713, and the GPT score drops from 83.43% to 72.96%.
- SRA also plays a significant role—without SRA, Local FID rises to 4.950, validating that artifact propagation is indeed a critical issue.
- w/o CRO+SRA is equivalent to standard SD2-Inpaint, which has a GPT score of only 27.80%, indicating that standard diffusion pathways are completely unsuitable for the erasing task.
- Without mix-up, the training fails to converge (NaN), verifying the necessity of progressive mixup for stabilizing CRO training.
- In the GPT-4o pairwise evaluation, EraDiff outperforms SD2-Inpaint in 80.90% of cases, and outperforms LaMa in 51.54% of cases.
- In user studies, EraDiff receives the highest scores in both erasing effectiveness and visual coherence.
Highlights & Insights¶
- Redefining Diffusion Pathways: The key insight of CRO is that the erase task requires a dedicated diffusion pathway rather than a generic denoising pathway. By constructing a sequence of intermediate states where the "object gradually fades" via mixup, the erase intention is elegantly encoded into the diffusion pathway. This concept can be transferred to any diffusion task requiring a "directional concept shift" (e.g., style morphing, seasonal transition).
- SRA: Zero-Parameter Attention Rectification: Without adding any extra parameters, modifying the attention mask alone effectively suppresses artifact propagation. This trick is simple yet powerful, making it directly applicable to any attention-based inpainting models.
- GPT-4o as an Erasing Quality Evaluator: Traditional FID/LPIPS metrics fail to evaluate "whether the object was successfully removed." This paper utilizes GPT-4o for pairwise comparison, establishing a more reasonable evaluation paradigm for the erasing task.
Limitations & Future Work¶
- The position and pose of objects in synthesized training pairs do not perfectly align with the original scene, which may introduce domain gaps.
- SRA may cause the masked region to over-rely on distant background context, potentially lacking semantic rationality for large-area masks.
- The model is fine-tuned only on SD2-Inpaint; its performance has not been verified on newer diffusion architectures (e.g., SDXL, Flux).
- While Local FID is superb, the overall FID is moderate, suggesting a slight potential trade-off in global image quality.
- No comparison was made with the latest instruction-based editing methods (e.g., subsequent work of InstructPix2Pix).
Related Work & Insights¶
- vs SD2-Inpaint: SD2-Inpaint utilizes a standard denoising pathway whose endpoint is "reconstructing the original image" rather than the "background"; EraDiff redefines the endpoint of the pathway through CRO.
- vs LaMa: LaMa, based on FFCs+GAN, is highly skilled at texture completion and offers good visual coherence but lacks understanding of the erasing intention; EraDiff is superior in erasing effectiveness, although slightly weaker in global FID.
- vs Inst-Inpaint/PowerPaint: These methods rely on text prompts to locate erasure targets, which suffer from unstable text instruction-following; EraDiff uses only a mask without text, making it more suitable for large-scale applications.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Redefines the erasing task from the perspective of diffusion pathways; both CRO and SRA designs display profound methodological thinking.
- Experimental Thoroughness: ⭐⭐⭐⭐ GPT-4o evaluation + user study + ablation + attention visualization + denoising process visualization.
- Writing Quality: ⭐⭐⭐⭐ Deep problem analysis and clear methodological motivation; however, the abundance of mathematical formulas requires careful reading.
- Value: ⭐⭐⭐⭐⭐ Directly addresses the practical pain point of "object regeneration" in object removal, with ready-to-use value for photo editing, advertising, and social media.