Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models¶
Conference: ECCV 2024
arXiv: 2403.11105
Area: Image Generation
TL;DR¶
Proposes SPDInv—a source prompt disentangled inversion method. By modeling the inversion process as a fixed-point search problem and solving it using a pretrained diffusion model, the inverted noise map is disentangled from the source prompt, significantly boosting text-driven image editing quality.
Background & Motivation¶
- Core pipeline of text-driven image editing: given a source image \(\rightarrow\) obtain a latent noise map via inversion \(\rightarrow\) edit using a target prompt.
- DDIM inversion is the most commonly used method but possesses a fundamental drawback: the inverted noise map is tightly coupled with the source prompt.
- Existing improvement methods (such as NTI, NPI, and DirectInv) focus heavily on minimizing reconstruction error \(D_{Rec}\) while neglecting noise error \(D_{Noi}\).
- The noise map coupled with the source prompt conflicts when edited by the target prompt, leading to editing artifacts and content inconsistency.
- Key Insight: An ideal inversion should satisfy the fixed-point constraint \(z_t = C_{t,1} \cdot z_{t-1} + C_{t,2} \cdot \epsilon_\theta(z_t, t, c)\), but DDIM inversion substitutes \((z_t, t)\) with \((z_{t-1}, t-1)\) as network inputs, introducing the source prompt prior.
Method¶
Overall Architecture¶
The core idea of SPDInv is to minimize the noise error \(D_{Noi}\) (instead of the reconstruction error \(D_{Rec}\)), making the inverted noise map as independent of the source prompt as possible.
- Obtain an initial approximation \(z_t\) using standard DDIM inversion.
- Convert the fixed-point constraint into an optimization problem at each inversion step.
- Solve it via gradient descent using the pretrained diffusion model.
Key Designs¶
1. Fixed-Point Constraint Analysis
The ideal inversion equation (Eq.1) requires \((z_t, t)\) as network input, but DDIM inversion practically uses \((z_{t-1}, t-1)\). Since \(z_{t-1}\) is obtained by conditional denoising of \(z_t\) under the source prompt, this injects the source prompt information into \(z_t\).
Reformulating the ideal inversion as a fixed-point problem: $\(x = f_\theta(x), \quad x = z_t, \quad f_\theta(x) = C_{t,2} \cdot \epsilon_\theta(x, t, c) + C_{t,1} \cdot z_{t-1}\)$
2. Gradient Descent-Based Fixed-Point Search
- Unlike AIDI's fixed-loop iterations (which are unstable and sub-optimal), SPDInv reformulates the fixed-point constraint as a loss function: $\(L = \|f_\theta(z_t) - z_t\|_2\)$
- Perform gradient descent on \(z_t\) using the pretrained diffusion model with frozen parameters: $\(z_t := z_t - \eta \nabla L\)$
- Introduce a threshold \(\delta\) for early stopping: early inversion steps require more optimization loops, while late steps (\(t > T/2\)) converge in just a few iterations.
3. Extension to Customized Image Generation
Integrating SPDInv into customized image generation methods like ELITE: 1. Convert the given image to a text embedding space ("S") using ELITE. 2. Invert the image to obtain the noise map using SPDInv (preserving layout and background). 3. Generate the edited results using a new text prompt (e.g., "a white S").
Loss & Training¶
The optimization objective for each inversion step: $\(\arg\min_{z_t} L = \|f_\theta(z_t) - z_t\|_2\)$
Where \(f_\theta(z_t) = C_{t,2} \cdot \epsilon_\theta(z_t, t, c) + C_{t,1} \cdot z_{t-1}\). The pretrained network parameters are frozen, and only the latent features \(z_t\) are updated.
Key Experimental Results¶
Main Results¶
Comparison of different inversion methods under three editing engines on PIE-Bench:
| Inversion Method | Editing Engine | DINO↓(×10³) | PSNR↑ | LPIPS↓(×10³) | MSE↓(×10⁴) | SSIM↑(×10²) | CLIP↑ | Time (s) |
|---|---|---|---|---|---|---|---|---|
| DDIM | P2P | 69.43 | 17.87 | 208.80 | 219.88 | 71.14 | 25.01 | 11.55 |
| NTI | P2P | 13.44 | 27.03 | 60.67 | 35.86 | 84.11 | 24.75 | 137.54 |
| DirectINV | P2P | 11.65 | 27.22 | 54.55 | 32.86 | 84.76 | 25.02 | 19.94 |
| AIDI | P2P | 12.16 | 27.01 | 56.39 | 36.90 | 84.27 | 24.92 | 87.21 |
| SPDInv | P2P | 8.81 | 28.60 | 36.01 | 24.54 | 86.23 | 25.26 | 27.04 |
| DDIM | MasaCtrl | 28.38 | 22.17 | 106.62 | 86.97 | 79.67 | 23.96 | 11.55 |
| DirectINV | MasaCtrl | 24.70 | 22.64 | 87.94 | 81.09 | 81.33 | 24.38 | 19.94 |
| SPDInv | MasaCtrl | 20.48 | 24.12 | 71.74 | 64.77 | 82.54 | 24.61 | 27.04 |
| DDIM | PNP | 28.22 | 22.28 | 113.33 | 83.51 | 79.00 | 24.95 | 11.55 |
| SPDInv | PNP | 15.58 | 26.72 | 91.55 | 34.69 | 82.04 | 25.14 | 27.04 |
Compared to the second-best method under the P2P engine, SPDInv: improves DINO by 24%, decreases LPIPS by 21%, and decreases MSE by 13%.
Ablation Study¶
Influence of hyperparameters on performance (subset of PIE-Bench):
| Hyperparameter | DINO↓(×10³) | PSNR↑ | LPIPS↓(×10³) | MSE↓(×10⁴) | SSIM↑(×10²) | CLIP↑ |
|---|---|---|---|---|---|---|
| K=5 | 8.52 | 31.49 | 22.31 | 10.42 | 90.21 | 26.70 |
| K=25 | 8.43 | 31.61 | 21.70 | 10.12 | 90.28 | — |
| K=50 | Better | Better | Better | Better | Better | Slightly lower |
| δ=5e-4 | Worse | Worse | Worse | Worse | Worse | — |
| δ=5e-6 | Best | Best | Best | Best | Best | — |
| η=0.005 | Slightly worse | Slightly worse | Slightly worse | Slightly worse | Slightly worse | Slightly higher |
| η=0.001 | Best | Best | Best | Best | Best | — |
Customized image editing results (SPDInv-ELITE vs. original ELITE):
| Method | DINO↓(×10³) | PSNR↑ | LPIPS↓(×10³) | MSE↓(×10⁴) | SSIM↑(×10²) | CLIP↑ |
|---|---|---|---|---|---|---|
| ELITE (Original) | 148.37 | 14.83 | 201.94 | 359.58 | 67.62 | 15.72 |
| BlendDM | 59.21 | 15.51 | 244.07 | 306.75 | 67.45 | 20.21 |
| InstructP2P | 155.49 | 18.19 | 161.20 | 362.01 | 78.05 | 20.06 |
| SPDInv-ELITE | 21.23 | 24.14 | 74.36 | 48.73 | 88.90 | 19.18 |
Compared to the original ELITE, SPDInv-ELITE: improves DINO by 85%, improves PSNR by 62%, and decreases MSE by 86%.
Key Findings¶
- The fixed-point search method (SPDInv) significantly outperforms direct iteration (AIDI), validating the effectiveness of the gradient-based search strategy.
- The early phase of the inversion process (\(t < T/2\)) requires more optimization loops to satisfy the fixed-point constraint, whereas the later phase converges extremely fast.
- SPDInv is plug-and-play: requiring only about 10 lines of code modification to integrate into existing editing pipelines.
- It demonstrates consistent advantages under three different editing engines (P2P, MasaCtrl, PNP), proving the generality of the method.
Highlights & Insights¶
- Precise problem definition: The paradigm shift from "reconstruction error" to "noise error" is the most significant insight of this paper.
- Elegant and solid theoretical derivation: A complete logical chain established from ideal inversion to fixed-point constraints to the optimization problem.
- Extremely high engineering practicality: only 10 lines of code modification, support for multiple editing engines, and extensibility to customized generation.
- Inversion time (27s) is vastly lower than NTI (138s) and AIDI (87s) while offering the best performance.
- The extension direction of empowering customized generation methods (such as ELITE) with local editing capabilities is highly valuable.
Limitations & Future Work¶
- Based on Stable Diffusion v1.4; compatibility with newer models (e.g., SD-XL, SD3) remains unverified.
- Each inversion step can require up to K=25 forward passes, still imposing computational overhead for real-time editing scenarios.
- The convergence of fixed-point search lacks rigorous theoretical guarantees and relies on empirically-set hyperparameters.
- The effectiveness of source prompt disentanglement is yet to be verified for extremely large-scale edits (e.g., completely changing the scene).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Redefines the problem from the perspective of noise error; the fixed-point search idea is highly novel.
- Value: ⭐⭐⭐⭐⭐ — Plug-and-play, 10 lines of code, multi-engine compatibility, and highly extensible.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three engines × two benchmark datasets, thorough ablation, and extensions to customized editing.
- Writing Quality: ⭐⭐⭐⭐ — The motivation is clearly articulated, but there are many mathematical symbols, and some derivations could be more concise.