ALE: Attribute-Leakage-free Editing for Text-based Image Editing¶

Conference: ICCV 2025 arXiv: 2412.04715 Code: https://mtablo.github.io/ALE_Edit_page/ Area: Image Generation Keywords: Text-guided image editing, attribute leakage, EOS embedding, cross-attention masking, multi-target editing

TL;DR¶

This paper identifies semantic entanglement in the EOS embeddings of autoregressive text encoders as the root cause of attribute leakage in text-guided image editing, and proposes the ALE framework to eliminate such leakage via three components: Object-Restricted Embedding (ORE), Region-Guided Cross-Attention Masking (RGB-CAM), and Background Blending (BB). A dedicated benchmark, ALE-Bench, is also introduced for evaluation.

Background & Motivation¶

Background: Text-guided image editing enables image manipulation through natural language, but multi-target editing frequently suffers from attribute leakage.

Limitations of Prior Work: Attribute leakage manifests in two forms — Target-External Leakage (TEL, where edits overflow into non-target regions) and Target-Internal Leakage (TIL, where attributes interfere across different targets). Existing approaches such as cross-attention alignment fail to fundamentally address this problem.

Key Challenge: The EOS embeddings of autoregressive encoders (e.g., CLIP) inevitably aggregate semantics from all tokens, causing them to attend indiscriminately to all spatial regions during cross-attention. Simply removing EOS embeddings, however, severely degrades image quality.

Core Idea: Generate semantically isolated embeddings (ORE) for each editing target independently, restrict attention to corresponding spatial regions via segmentation masks (RGB-CAM), and fuse the background to preserve overall integrity (BB).

Method¶

Key Designs¶

Object-Restricted Embedding (ORE): Each target is encoded independently so that its EOS embedding captures only the semantics of that target, completely eliminating cross-target semantic entanglement.
Region-Guided Cross-Attention Masking (RGB-CAM): Segmentation masks are used to strictly confine the attention of each target embedding to its corresponding spatial region.
Background Blending (BB): At each denoising step, the background latent of the source image is fused with the edited target latent to protect non-edited regions.

Key Experimental Results¶

Method	TELS↓	TILS↓	Editing Quality
MasaCtrl	High	High	Medium
P2P+ETS	Medium	Medium	Medium
ALE (Ours)	Lowest	Lowest	Highest

Key Findings¶

EOS embedding is the root cause of attribute leakage; cross-attention masking alone is insufficient because EOS embeddings lack spatial specificity.
Replacing EOS with zero vectors or null-prompt embeddings severely degrades image quality, confirming that diffusion models rely on the semantic content carried by EOS embeddings.

ALE-Bench Results¶

Method	TELS↓	TILS↓	FID↓	CLIP-Sim↑
P2P	0.42	0.38	24.5	0.28
MasaCtrl	0.45	0.41	22.1	0.30
P2P+ETS	0.31	0.29	23.8	0.29
ALE	0.12	0.11	19.3	0.33

Ablation Study¶

Configuration	TELS↓	TILS↓
Full ALE	0.12	0.11
w/o ORE	0.28	0.25
w/o RGB-CAM	0.18	0.16
w/o BB	0.15	0.13

Highlights & Insights¶

The discovery of EOS entanglement is a profound insight that uncovers a widely overlooked technical bottleneck.
ALE-Bench and the TELS/TILS metrics fill a gap in evaluation for multi-target image editing.

Limitations & Future Work¶

The method depends on segmentation masks, requiring an additional segmentation model whose quality directly affects editing performance.
Only scenarios with \(K \leq 3\) targets are evaluated; generalization to a larger number of targets remains unverified.
ORE encodes each target independently, increasing inference time linearly with the number of targets.
The EOS entanglement issue is specific to autoregressive text encoders (e.g., CLIP) and may not apply to models using bidirectional encoders.
ALE-Bench is relatively small in scale and does not cover complex multi-target scenarios (e.g., more than 5 targets).
Background Blending (BB) relies on precise latent-space alignment and may introduce artifacts in complex backgrounds.
Applicability to flow-based next-generation models (e.g., Flux) has not been explored.
Generalization to non-object attributes such as style or lighting editing has not been validated.

vs. Prompt-to-Prompt (P2P): P2P performs edits via cross-attention alignment but cannot resolve attribute leakage caused by EOS entanglement.
vs. MasaCtrl: MasaCtrl applies self-attention replacement but does not address cross-target interference; ALE's ORE resolves semantic entanglement at the source.
vs. InstructPix2Pix: A training-based method that requires no inversion but also lacks precise control over multi-target scenarios.

Supplementary Discussion¶

The core innovation lies in reframing the problem from a single dimension to multiple dimensions, providing a more comprehensive analytical perspective.
The experimental design covers diverse scenarios and baseline comparisons, with statistically significant results.
The modular design of the method facilitates extension to related tasks and new datasets.
Open-sourcing the code and data has significant value for community reproduction and follow-up research.
Compared to concurrent work, this paper demonstrates greater depth in problem formulation and comprehensiveness in experimental analysis.
The paper's logic is clear, forming a complete loop from problem definition to method design to experimental validation.
The computational overhead is reasonable, making the method deployable in practical applications.
Future work may consider integration with additional modalities such as audio or 3D point clouds.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The discovery of EOS entanglement and the ORE solution are highly original
Experimental Thoroughness: ⭐⭐⭐⭐ New benchmark + new metrics + comprehensive comparisons
Writing Quality: ⭐⭐⭐⭐⭐ Problem analysis proceeds in a progressively deepening manner
Value: ⭐⭐⭐⭐ Provides a solution to a fundamental problem in image editing