Skip to content

Stable Flow: Vital Layers for Training-Free Image Editing

Conference: CVPR 2025
arXiv: 2411.14430
Code: Project Page
Area: Image Generation / Image Editing
Keywords: Image Editing, DiT, Flow Matching, Attention Injection, Training-Free Editing

TL;DR

Stable Flow proposes to automatically detect "vital layers" in DiT (FLUX) and inject attention features of the reference image only into these layers to achieve various training-free image editing operations, while introducing a latent nudging technique to improve the quality of flow model inversion for real images.

Background & Motivation

  • Next-generation T2I models like FLUX and SD3 adopt the DiT architecture combined with flow matching, significantly improving generation quality.
  • Flow-based models sample along straight trajectories, leading to reduced diversity—however, this work views this limitation as an editing advantage.
  • Editing methods in the UNet era (such as P2P, MasaCtrl) leveraged the coarse-to-fine-to-coarse hierarchical structure of UNet for attention injection.
  • The DiT architecture lacks the hierarchical structure of UNet, with unclear roles for each layer, making it impossible to directly migrate former injection strategies.
  • Standard inverse Euler ODE inversion performs poorly on FLUX, leading to reconstruction failures.
  • Two core problems need to be solved: (1) Which layers in DiT are suitable for injecting features? (2) How can real images be effectively inverted?

Method

Overall Architecture

A three-step framework: (1) Vital Layers Detection—removes each layer of DiT one by one to measure its perceptual impact on the generated image, defining high-impact layers as vital layers; (2) Vital Layers Injection Editing—generates the reference and edited images in parallel, replacing the image embeddings of the edited image with those of the reference image only within the vital layers; (3) Latent Nudging—multiplies the initial latent by a small coefficient \(\lambda=1.15\) during real image inversion to improve reconstruction quality.

Key Designs

Design 1: Automatic Vital Layers Detection - Function: Identification of the subset of layers in DiT that are critical for image formation. - Mechanism: A set of reference images \(G_{ref}\) is generated using 64 diverse text prompts; for each layer \(\ell\), the layer is bypassed via residual connection to generate \(G_\ell\); DINOv2 is used to calculate the perceptual similarity between \(G_{ref}\) and \(G_\ell\); the vitality score is defined as \(vitality(\ell) = 1 - \frac{1}{k} \sum d(M_{full}, M_{-\ell})\); layers that exceed a threshold \(\tau_{vit}\) are defined as vital layers. - Design Motivation: DiT lacks the structured hierarchy of UNet; experiments show that vital layers are scattered throughout the Transformer (not concentrated in one specific region); bypass experiments quantify the actual contribution of each layer to the image content.

Design 2: Vital Layers Attention Injection Editing - Function: Replacement of reference image embeddings only in vital layers to enable diverse editing operations (non-rigid deformation, object addition, global modification). - Mechanism: Parallel generation of the source image \(x\) (source prompt + seed) and the edited image \(\hat{x}\) (edit prompt + same seed); in the vital layers \(V\), the image embeddings of \(\hat{x}\) are replaced with those of \(x\); non-vital layers retain the original embeddings of \(\hat{x}\). - Design Motivation: Analysis reveals that multimodal attention in vital layers achieves a favorable balance between preserving source content and responding to text edits—unchanged regions primarily focus on visual features, while edited regions focus on text tokens (e.g., "avocado"); non-vital layers focus almost entirely on the image, which is unfavorable for editing.

Design 3: Latent Nudging Real Image Inversion - Function: Improving the inversion reconstruction quality of real images in the FLUX model. - Mechanism: Prior to inversion, the clean latent \(z_0\) is multiplied by \(\lambda=1.15\) to slightly shift it out of the training distribution; subsequently, the standard inverse Euler ODE is applied: \(z_t = z_{t-1} + (\sigma_t - \sigma_{t+1}) \cdot u_t(z_{t-1})\). - Design Motivation: Standard inversion assumes that \(u(z_t) \approx u(z_{t-1})\), but the straight trajectories in FLUX make this assumption invalid; a slight shift makes the model less likely to alter the image content during the forward process.

Loss & Training

  • A training-free method requiring no loss function.
  • DINOv2 perceptual similarity is used only during the vital layers detection stage.

Key Experimental Results

Main Results: Quantitative Comparison

Method CLIP_txt↑ CLIP_img↑ CLIP_dir↑
SDEdit 0.24 0.71 0.07
P2P+NTI 0.21 0.76 0.08
Instruct-P2P 0.22 0.87 0.07
MagicBrush 0.24 0.88 0.11
MasaCTRL 0.20 0.76 0.03
Stable Flow 0.23 0.92 0.14

Ablation Study

Configuration CLIP_txt↑ CLIP_img↑ CLIP_dir↑
Stable Flow 0.23 0.92 0.14
All-layer injection 0.17 0.98 0.00
Non-vital layer injection 0.25 0.72 0.09
No latent nudging 0.22 0.62 0.05

User Study (Two-Alternative Forced Choice, Win Rate)

vs Method Prompt Alignment Image Preservation Realism Overall
vs SDEdit 69% 68% 64% 71%
vs MasaCTRL 82% 80% 80% 72%
vs MagicBrush 61% 67% 77% 74%

Key Findings

  • All-layer injection -> CLIP_dir drops to 0 (completely copying the source image, resulting in no editing effect).
  • Non-vital layer injection -> significant drop in image preservation (CLIP_img 0.72), resulting in over-editing.
  • Approximately 20 vital layers are selected (out of 57 layers in FLUX), scattered across the Transformer.
  • Latent nudging improves CLIP_img from 0.62 to 0.92.

Highlights & Insights

  • Turning the low diversity drawback of flow models into an editing advantage is a very clever perspective.
  • The automatic detection method for vital layers is highly generalizable and can be applied to the analysis of other DiT models.
  • The same injection mechanism can handle non-rigid editing, object addition, and scene modification—achieving high unification.
  • Latent nudging is extremely simple (multiplying by 1.15) yet highly effective.

Limitations & Future Work

  • Validated only on the FLUX model; applicability to other DiT models such as SD3 requires testing.
  • The set of vital layers is fixed across all editing types, which may not be optimal for every edit.
  • The coefficient \(\lambda=1.15\) for latent nudging is empirical, lacking a theoretical explanation.
  • Future directions could explore adaptive vital layer selection, controllable editing strength, and extension to video editing.
  • P2P/MasaCTRL: Attention injection editing methods from the UNet era.
  • FLUX/SD3: Latest DiT + flow matching T2I models.
  • SDEdit: A classic method to achieve image editing through adding noise and denoising.
  • Insight: Understanding the internal functional division of layers is a key prerequisite for designing training-free editing methods.

Rating

⭐⭐⭐⭐ — The perspective of converting the low diversity of DiT into an editing advantage is highly novel; the vital layers detection method is generalizable and insightful; latent nudging is simple yet highly effective; the comprehensive quantitative and user study validation is very convincing.