Stable Flow: Vital Layers for Training-Free Image Editing¶
Conference: CVPR 2025
arXiv: 2411.14430
Code: Project Page
Area: Image Generation / Image Editing
Keywords: Image Editing, DiT, Flow Matching, Attention Injection, Training-Free Editing
TL;DR¶
Stable Flow proposes to automatically detect "vital layers" in DiT (FLUX) and inject attention features of the reference image only into these layers to achieve various training-free image editing operations, while introducing a latent nudging technique to improve the quality of flow model inversion for real images.
Background & Motivation¶
- Next-generation T2I models like FLUX and SD3 adopt the DiT architecture combined with flow matching, significantly improving generation quality.
- Flow-based models sample along straight trajectories, leading to reduced diversity—however, this work views this limitation as an editing advantage.
- Editing methods in the UNet era (such as P2P, MasaCtrl) leveraged the coarse-to-fine-to-coarse hierarchical structure of UNet for attention injection.
- The DiT architecture lacks the hierarchical structure of UNet, with unclear roles for each layer, making it impossible to directly migrate former injection strategies.
- Standard inverse Euler ODE inversion performs poorly on FLUX, leading to reconstruction failures.
- Two core problems need to be solved: (1) Which layers in DiT are suitable for injecting features? (2) How can real images be effectively inverted?
Method¶
Overall Architecture¶
A three-step framework: (1) Vital Layers Detection—removes each layer of DiT one by one to measure its perceptual impact on the generated image, defining high-impact layers as vital layers; (2) Vital Layers Injection Editing—generates the reference and edited images in parallel, replacing the image embeddings of the edited image with those of the reference image only within the vital layers; (3) Latent Nudging—multiplies the initial latent by a small coefficient \(\lambda=1.15\) during real image inversion to improve reconstruction quality.
Key Designs¶
Design 1: Automatic Vital Layers Detection - Function: Identification of the subset of layers in DiT that are critical for image formation. - Mechanism: A set of reference images \(G_{ref}\) is generated using 64 diverse text prompts; for each layer \(\ell\), the layer is bypassed via residual connection to generate \(G_\ell\); DINOv2 is used to calculate the perceptual similarity between \(G_{ref}\) and \(G_\ell\); the vitality score is defined as \(vitality(\ell) = 1 - \frac{1}{k} \sum d(M_{full}, M_{-\ell})\); layers that exceed a threshold \(\tau_{vit}\) are defined as vital layers. - Design Motivation: DiT lacks the structured hierarchy of UNet; experiments show that vital layers are scattered throughout the Transformer (not concentrated in one specific region); bypass experiments quantify the actual contribution of each layer to the image content.
Design 2: Vital Layers Attention Injection Editing - Function: Replacement of reference image embeddings only in vital layers to enable diverse editing operations (non-rigid deformation, object addition, global modification). - Mechanism: Parallel generation of the source image \(x\) (source prompt + seed) and the edited image \(\hat{x}\) (edit prompt + same seed); in the vital layers \(V\), the image embeddings of \(\hat{x}\) are replaced with those of \(x\); non-vital layers retain the original embeddings of \(\hat{x}\). - Design Motivation: Analysis reveals that multimodal attention in vital layers achieves a favorable balance between preserving source content and responding to text edits—unchanged regions primarily focus on visual features, while edited regions focus on text tokens (e.g., "avocado"); non-vital layers focus almost entirely on the image, which is unfavorable for editing.
Design 3: Latent Nudging Real Image Inversion - Function: Improving the inversion reconstruction quality of real images in the FLUX model. - Mechanism: Prior to inversion, the clean latent \(z_0\) is multiplied by \(\lambda=1.15\) to slightly shift it out of the training distribution; subsequently, the standard inverse Euler ODE is applied: \(z_t = z_{t-1} + (\sigma_t - \sigma_{t+1}) \cdot u_t(z_{t-1})\). - Design Motivation: Standard inversion assumes that \(u(z_t) \approx u(z_{t-1})\), but the straight trajectories in FLUX make this assumption invalid; a slight shift makes the model less likely to alter the image content during the forward process.
Loss & Training¶
- A training-free method requiring no loss function.
- DINOv2 perceptual similarity is used only during the vital layers detection stage.
Key Experimental Results¶
Main Results: Quantitative Comparison¶
| Method | CLIP_txt↑ | CLIP_img↑ | CLIP_dir↑ |
|---|---|---|---|
| SDEdit | 0.24 | 0.71 | 0.07 |
| P2P+NTI | 0.21 | 0.76 | 0.08 |
| Instruct-P2P | 0.22 | 0.87 | 0.07 |
| MagicBrush | 0.24 | 0.88 | 0.11 |
| MasaCTRL | 0.20 | 0.76 | 0.03 |
| Stable Flow | 0.23 | 0.92 | 0.14 |
Ablation Study¶
| Configuration | CLIP_txt↑ | CLIP_img↑ | CLIP_dir↑ |
|---|---|---|---|
| Stable Flow | 0.23 | 0.92 | 0.14 |
| All-layer injection | 0.17 | 0.98 | 0.00 |
| Non-vital layer injection | 0.25 | 0.72 | 0.09 |
| No latent nudging | 0.22 | 0.62 | 0.05 |
User Study (Two-Alternative Forced Choice, Win Rate)¶
| vs Method | Prompt Alignment | Image Preservation | Realism | Overall |
|---|---|---|---|---|
| vs SDEdit | 69% | 68% | 64% | 71% |
| vs MasaCTRL | 82% | 80% | 80% | 72% |
| vs MagicBrush | 61% | 67% | 77% | 74% |
Key Findings¶
- All-layer injection -> CLIP_dir drops to 0 (completely copying the source image, resulting in no editing effect).
- Non-vital layer injection -> significant drop in image preservation (CLIP_img 0.72), resulting in over-editing.
- Approximately 20 vital layers are selected (out of 57 layers in FLUX), scattered across the Transformer.
- Latent nudging improves CLIP_img from 0.62 to 0.92.
Highlights & Insights¶
- Turning the low diversity drawback of flow models into an editing advantage is a very clever perspective.
- The automatic detection method for vital layers is highly generalizable and can be applied to the analysis of other DiT models.
- The same injection mechanism can handle non-rigid editing, object addition, and scene modification—achieving high unification.
- Latent nudging is extremely simple (multiplying by 1.15) yet highly effective.
Limitations & Future Work¶
- Validated only on the FLUX model; applicability to other DiT models such as SD3 requires testing.
- The set of vital layers is fixed across all editing types, which may not be optimal for every edit.
- The coefficient \(\lambda=1.15\) for latent nudging is empirical, lacking a theoretical explanation.
- Future directions could explore adaptive vital layer selection, controllable editing strength, and extension to video editing.
Related Work & Insights¶
- P2P/MasaCTRL: Attention injection editing methods from the UNet era.
- FLUX/SD3: Latest DiT + flow matching T2I models.
- SDEdit: A classic method to achieve image editing through adding noise and denoising.
- Insight: Understanding the internal functional division of layers is a key prerequisite for designing training-free editing methods.
Rating¶
⭐⭐⭐⭐ — The perspective of converting the low diversity of DiT into an editing advantage is highly novel; the vital layers detection method is generalizable and insightful; latent nudging is simple yet highly effective; the comprehensive quantitative and user study validation is very convincing.