Lazy Diffusion Transformer for Interactive Image Editing¶

Conference: ECCV 2024
arXiv: 2404.12382
Area: Image Generation

TL;DR¶

Proposes LazyDiffusion, an asymmetric encoder-decoder Transformer architecture that compresses global information via a context encoder and executes diffusion denoising only on the masked region, achieving a 10× speedup with image quality comparable to full-image generation methods during interactive image editing.

Background & Motivation¶

Current diffusion-based inpainting methods suffer from severe computational wastage:

RegenerateImage methods (e.g., SDXL, SD Inpaint): Generate pixels for the entire image but discard everything except the masked region.

RegenerateCrop methods (e.g., cropping schemes commonly used in practical software): Only process a small cropped region around the mask. Although faster, this approach discards global context, leading to semantic inconsistency.

In interactive editing scenarios, user modifications typically cover only 10-20% of the image, making full-image regeneration highly wasteful. This paper introduces a "lazy" generation strategy—only generating the required pixels.

Method¶

Overall Architecture¶

LazyDiffusion decouples the generation process into two steps:

Global Context Encoder \(E\) (ViT): Processes the entire canvas and the mask to extract \(N=4096\) tokens, then retains only the \(N_{hole}\) tokens corresponding to the masked region as the compressed context. The encoder runs only once outside the diffusion loop.
Incremental Diffusion Decoder \(D\) (a variant of PixArt-α): At each denoising step, it only processes the tokens corresponding to the masked region, conditioned on both the compressed context and text prompts.

Key Idea: The self-attention mechanism in the encoder allows each token to encode global information; thus, discarding non-masked tokens still preserves complete semantic context.

Key Designs¶

Token Dropping: The \(N\) tokens output by the encoder are filtered via a max-pooling mask, retaining only the \(N_{hole}\) tokens in the masked region. This creates an information bottleneck, encouraging the encoder to compress global context into the tokens at the masked locations.

Context Conditioning: The decoder is conditioned by concatenating the noise tokens \(\mathcal{X}_{hole}^t\) and the context tokens \(\mathcal{T}_{hole}\) along the hidden dimension: \(\mathcal{X}_{hole}^{t-1} = D(\mathcal{X}_{hole}^t \oplus \mathcal{T}_{hole}; t, \mathbf{c})\)

Latent Space Operation: Operating within a 4-channel latent space using Stable Diffusion's pretrained VAE (8× downsampling). The final output undergoes Poisson blending to eliminate seams.

Training: The encoder is trained from scratch, while the decoder is initialized from PixArt-α pretrained weights. They are jointly trained for 100K iterations on 56 A100 GPUs with a batch size of 224.

Loss & Training¶

Trained using the Improved DDPM objective function to denoise and reconstruct latent pixels in the masked region.

Key Experimental Results¶

Main Results¶

Quantitative comparison on the OpenImages 10K dataset (at 1024×1024 resolution):

Method	CLIP Score ↑	FID ↓	Remarks
SD2-crop	0.21	6.95	Reference, different architecture
SDXL	0.21	6.88	Reference, different architecture
RegenerateCrop (PixArt)	0.19	9.35	Cropping scheme
RegenerateImage (PixArt)	0.19	7.38	Full-image generation
Ours	0.19	7.70	Masked region only

Runtime comparison (10% mask, single A100):

Method	Execution Mechanism	Latency
RegenerateImage	Full-image 4096 tokens	~374ms/step
RegenerateCrop	Fixed crop	Fixed
Ours	Masked tokens only	~28ms/step (10% mask)

User study (1778 responses, 48 users):

Alternative Method	User Preference for Ours
vs RegenerateCrop	81.0%
vs SD2-crop	82.5%
vs RegenerateImage	46.1%
vs SDXL	48.5%

Ablation Study¶

Encoder overhead analysis:

Component	Time
Context Encoder	73ms (runs only once)
Latent Space Encoder	97ms
Latent Space Decoder	176ms
T5 Text Encoder	21ms
Diffusion Decoder (10% mask, single step)	28ms
Diffusion Decoder (Full-image, single step)	374ms

Key Findings¶

Achieves a 10× speedup at 10% mask size, matching the speed of RegenerateCrop at 25% mask size.
FID increases by only 4% compared to full-image generation (7.70 vs 7.38), but is 26% lower than the cropping method (7.70 vs 9.35).
Compressed context preserves critical semantic information—performing on par with full-image methods in scenarios demanding high semantic consistency (e.g., generating bread of the same style on a tray).
The speedup advantage is more pronounced in scenarios with high diffusion steps and high resolutions.

Highlights & Insights¶

Precise Problem Definition: Shifting from "generating full images and then cropping" to "generating only the required pixels," making computation proportional to the size of the edited region.
Reverse MAE Design: Opposite to Masked Autoencoder—the encoder processes all tokens, whereas the decoder only processes the masked tokens.
Progressive Interaction: Distributes image generation costs over multiple user interactions, enabling true interactive use of diffusion models.
Orthogonal Contribution: Orthogonal to fast sampling, distillation, and other diffusion acceleration methods, allowing for cumulative speedups.
Support for Multimodal Conditioning: In addition to text, it supports local conditioning such as sketch guidance (similar to SDEdit).

Limitations & Future Work¶

The encoder still needs to process the full image (quadratic complexity), which may become a bottleneck for ultra-high-resolution images.
Subtle color shifts occasionally appear in the generated regions, requiring Poisson blending post-processing.
The training data consists of an internal dataset (220M images), making it difficult to fully replicate.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The design of asymmetric encoder-decoding + token dropping is both elegant and effective.
Value: ⭐⭐⭐⭐⭐ — The 10× speedup makes diffusion models practical for interactive editing pipelines for the first time.
Experimental Thoroughness: ⭐⭐⭐⭐ — Quantitative evaluation, user studies, and progressive generation demos are provided, though comparisons against open-source baselines are lacking.
Writing Quality: ⭐⭐⭐⭐⭐ — Clear writing, rich illustrations, with rigorous logic connecting motivation, methodology, and experiments.