Addressing Text Embedding Leakage in Diffusion-based Image Editing¶
Conference: ICCV 2025 arXiv: N/A Code: GitHub Area: Image Editing / Diffusion Models Keywords: attribute leakage, diffusion image editing, EOS embedding, cross-attention masking, multi-object editing
TL;DR¶
This paper proposes the ALE framework, which systematically addresses attribute leakage in diffusion-based text-guided image editing through three components: Object-Restricted Embedding (ORE) to decouple semantic entanglement in EOS tokens, Region-Guided Blended Cross-Attention Masking (RGB-CAM) to constrain spatial attention, and Background Blending (BB) to preserve unedited regions. A new evaluation benchmark, ALE-Bench, is also introduced.
Background & Motivation¶
- Background: Diffusion-based text-guided image editing has advanced substantially, yet attribute leakage—where editing a target object unintentionally affects unrelated regions—remains a critical unsolved problem.
- Limitations of Prior Work: Existing methods attempt to constrain editing effects by manipulating attention maps but fail to address the root cause of EOS embedding entanglement. Two types of leakage are identified: (1) Target-External Leakage (TEL), where attributes of a target object spread to non-target regions; and (2) Target-Internal Leakage (TIL), where attributes of one target object affect another.
- Key Challenge: The EOS (End-of-Sequence) token in autoregressive text encoders (e.g., CLIP) indiscriminately aggregates information from all tokens in the prompt, causing attributes to diffuse spatially without discrimination through cross-attention layers.
- Goal: To systematically eliminate both TEL and TIL, preserve background integrity, and establish a dedicated evaluation benchmark for multi-object editing.
Method¶
Overall Architecture¶
ALE is a tuning-free image editing framework built upon a dual-branch diffusion model architecture. It consists of three complementary components: ORE addresses entanglement at the text embedding level, RGB-CAM constrains spatial attention leakage, and BB handles background preservation. All three components are necessary—BB alone prevents TEL but not TIL; ORE + RGB-CAM reduces TIL but cannot protect the background.
Key Designs¶
-
Object-Restricted Embedding (ORE): Assigns each target object in the prompt an independent, semantically isolated text embedding by encoding each object separately rather than placing all objects in a single prompt. This severs semantic entanglement in the EOS token at its source.
-
Region-Guided Blended Cross-Attention Masking (RGB-CAM): Uses segmentation masks to constrain cross-attention so that each object's attention is restricted to its designated region. By blending segmentation masks with attention maps, it prevents improper spatial diffusion of attributes and enables precise region-level editing.
-
Background Blending (BB): Blends the source image latent with the edited latent in non-target regions throughout the denoising process, preserving structural integrity and appearance consistency of unedited areas to effectively prevent TEL.
Loss & Training¶
ALE is a training-free method that operates entirely at inference time by modifying the diffusion sampling process. It is compatible with existing editing pipelines such as Prompt-to-Prompt as the underlying backbone.
Key Experimental Results¶
Main Results¶
| Method | TELS↓ | TILS↓ | Structure Dist↓ | PSNR↑ | SSIM↑ |
|---|---|---|---|---|---|
| P2P | 21.52 | 17.26 | 0.1514 | 11.15 | 0.5589 |
| MasaCtrl | 20.18 | 16.74 | 0.0929 | 14.99 | 0.7346 |
| InfEdit | 19.59 | 16.69 | 0.0484 | 16.74 | 0.7709 |
| ALE | 16.03 | 15.28 | 0.0167 | 30.04 | 0.9228 |
ALE substantially outperforms all baselines across every metric, improving PSNR from 16.74 to 30.04 and SSIM from 0.77 to 0.92.
Ablation Study¶
- BB alone: addresses TEL but cannot resolve TIL.
- ORE + RGB-CAM: reduces TIL but fails to protect the background (TEL persists).
- Full ALE (BB + ORE + RGB-CAM): simultaneously eliminates both TEL and TIL.
- Performance remains robust across varying numbers of edited objects (1/2/3) and editing types (color/material/object).
Key Findings¶
- The EOS token is the fundamental cause of attribute leakage, not merely an attention map issue.
- Multi-object editing is more prone to leakage than single-object editing; nevertheless, ALE achieves even better performance on 3-object editing than on 1-object editing.
- Combined editing types (e.g., color + object) exhibit the most severe leakage, yet ALE still effectively suppresses it.
Highlights & Insights¶
- The analysis is thorough and precise—correctly identifying EOS embedding entanglement as the root cause of attribute leakage rather than a surface-level attention artifact.
- ALE-Bench and the TELS/TILS metrics fill a gap in evaluation protocols for multi-object editing scenarios.
- The training-free design allows plug-and-play integration into various diffusion-based editing frameworks.
- Background preservation quality is exceptional (PSNR 30.04 vs. the prior best of 16.74).
Limitations & Future Work¶
- Currently limited to local, relatively simple transformations (color/material/object replacement); non-rigid transformations such as style transfer and pose modification are not supported.
- ALE-Bench contains only 20 carefully curated images, and its generalizability remains to be validated.
- RGB-CAM requires segmentation masks as input, which raises the barrier to practical use.
- ORE encodes each object separately, leading to linearly increasing computational overhead as the number of target objects grows.
Related Work & Insights¶
- Prompt-to-Prompt, MasaCtrl, and InfEdit serve as the primary tuning-free editing baselines for comparison.
- EOS token semantic entanglement may also manifest in other generative tasks that employ CLIP text encoders.
- The region-level editing control paradigm introduced here is transferable to video editing, 3D editing, and related domains.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The analysis of EOS embedding leakage is insightful and original.
- Technical Depth: ⭐⭐⭐⭐ — The three-component design is logically coherent and mutually complementary.
- Experimental Thoroughness: ⭐⭐⭐⭐ — New benchmark, new metrics, comprehensive comparisons, and ablations.
- Writing Quality: ⭐⭐⭐⭐⭐ — Problem formulation is clear and visualizations are excellent.
- Value: ⭐⭐⭐⭐ — Training-free, immediately applicable, and demonstrably effective.