Skip to content

FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing

Conference: CVPR 2026
Paper: CVF Open Access
Code: To be confirmed
Area: Diffusion Models / Image Editing
Keywords: Complex Image Editing, Flow Matching, Velocity Decoupling, Orthogonal Decay, Rectified Flow

TL;DR

FlowDC decomposes complex target prompts with multiple editing goals into a sequence of progressive sub-prompts. It calculates "editing directions" for each goal along parallel trajectories and orthogonalizes them into a basis. By projecting the original editing velocity onto this basis, it retains components within the subspace and decays components orthogonal to the editing directions, achieving multi-target semantic alignment and source image consistency in a single round.

Background & Motivation

Background: Pre-trained text-to-image Flow Matching (FM) models (e.g., FLUX) have significantly advanced text-driven image editing. However, the majority of existing methods excel only at simple editing, where the target prompt contains only one editing objective (e.g., modifying a single object or attribute).

Limitations of Prior Work: Real-world demands often involve complex editing, requiring simultaneous changes to color, shape, objects, and styles. Existing approaches fall short in two ways: - Single-round editing: Feeding complex prompts directly into FM. Since pre-trained FM models have limited ability to process long-text semantics, multiple targets often lead to omitted edits or entangled effects. Attention manipulation or prompt decomposition methods suffer from mask overlap and poor generalization. - Multi-round editing: Decomposing long prompts into multiple single-target short prompts and editing them sequentially. However, the computational cost grows linearly with the number of rounds, and source inconsistency accumulates during each round (cf. Fig 2(b)).

Key Challenge: Effectively balancing semantic alignment (faithfully reflecting all targets without omissions or entanglement) and source image consistency (maintaining editing-unrelated areas) in complex editing.

Key Insight: It is observed that in inversion-free flow editing, the component of the editing velocity \(v^{edit}(t)\) orthogonal to the "editing displacement direction" often corresponds to unstable structural perturbations unrelated to the edit. This component is primarily responsible for destroying the source image structure. Rather than decomposing at the prompt level, it is more effective to operate directly on the velocity field.

Core Idea: Decouple complex editing into a parallel superposition of sub-editing effects (instead of sequential rounds) and decay the orthogonal components during velocity reconstruction. In short: "Keep components along editing directions; weaken those perpendicular to them."

Method

Overall Architecture

FlowDC is built upon inversion-free flow editing (FlowEdit style). Given a source image \(X^{src}\), source prompt \(P^{src}\), and complex target prompt \(P^{tar}\), the goal is to construct a stable editing trajectory \(Z^{edit}_t\) to complete complex editing in one round. Since the original velocity \(v^{edit}(t)\) from the complex prompt lacks precision, FlowDC "purifies" it into \(v'^{edit}(t)\) via two steps:

  1. Use an LLM to decompose the complex prompt into a sequence of progressive intermediate prompts, calculate multiple editing trajectories in parallel, and orthogonalize their directions into a basis (PSO).
  2. Project the original editing velocity onto the time-varying subspace spanned by this basis, retaining subspace components and strongly decaying orthogonal ones (VOD).

Reviewing the baseline (Rectified Flow + inversion-free editing): the editing velocity is the difference between the target and source velocity fields: $\(v^{edit}_\theta(t) = v_\theta(Z^{tar}_t, t, P^{tar}) - v_\theta(Z^{src}_t, t, P^{src})\)$ The trajectory is updated via forward Euler from \(t=1\) to \(0\): \(Z^{edit}_{t-\Delta t} = Z^{edit}_t - v^{edit}_\theta(t)\Delta t\). FlowDC modifies this \(v^{edit}(t)\).

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Source Image + Source/Target Prompts"] --> B["LLM Progressive Semantic Decoupling<br/>Complex Prompt → n Progressive Sub-prompts"]
    B --> C["Progressive Semantic Orthogonalization (PSO)<br/>Parallel trajectories compute n velocities<br/>→ Orthogonal basis {u_i}"]
    C --> D["Velocity Orthogonal Decay (VOD)<br/>Project original velocity to basis<br/>Retain subspace / Decay orthogonal components"]
    D -->|Reconstruct Velocity v′(t) to update main trajectory| E["Edited Image Z_0"]

Key Designs

1. LLM Progressive Semantic Decoupling: Scaling from hard long-text to stackable short-text

The failure of complex editing often stems from the FM model's inability to follow long prompts. FlowDC uses an LLM to decouple a complex prompt \(P^{tar}\) containing \(n\) targets \(\{e_i\}_{i=1}^n\) into a sequence of ordered, cumulative intermediate prompts \(\{P^{tar_i}\}_{i=1}^n\). Each \(P^{tar_i}\) contains the first \(i\) targets. This progressive accumulation ensures that each step adds exactly one goal, allowing the subsequent orthogonalization to isolate the contribution of each target.

2. Progressive Semantic Orthogonalization (PSO): Isolating independent directions for each target

PSO extracts the specific editing direction for each goal. It shares a single source trajectory \(Z^{src}_t = tX_1 + (1-t)X^{src}\) and performs Parallel Velocities Generation (PVG) across \(n\) intermediate prompts: $\(v_i(t) = v_\theta(Z^{tar_i}_t, t, P^{tar_i}) - v_{src}(t), \quad v_{src}(t)=v_\theta(Z^{src}_t,t,P^{src})\)$ Then, Progressive Vectors Orthogonalization (PVO) (essentially Gram–Schmidt) is applied to these vectors: $\(u_i(t) \leftarrow u_i(t) - \frac{\langle u_i(t),u_j(t)\rangle}{\lVert u_j(t)\rVert_2^2}\,u_j(t),\quad j=1,\dots,i-1\)$ This yields an orthogonal basis \(\{u_i(t)\}\). Since prompts are cumulative, orthogonalization effectively isolates the new semantic contribution of the \(i\)-th target into \(u_i\). Heatmaps (Fig 6) demonstrate that specific basis vectors highlight specific object regions.

3. Velocity Orthogonal Decay (VOD): Retaining editing directions, weakening the perpendicular

VOD purifies the original velocity \(v(t)\) using the basis. The subspace component is: $\(v_{sub}(t)=\text{Proj}(v(t),U(t))=\sum_{u_j(t)\in U(t)}\frac{\langle v(t),u_j(t)\rangle}{\lVert u_j(t)\rVert_2^2}\,u_j(t)\)$ The orthogonal component is \(v_{orth}(t)=v(t)-v_{sub}(t)\). Reconstruction uses selective decay: $\(v'(t)=\lambda_{sub}(t)\,v_{sub}(t)+\lambda_{orth}(t)\,v_{orth}(t)\)$ To preserve semantics, \(\lambda_{sub}(t)=1\). The orthogonal coefficient \(\lambda_{orth}(t)\) follows a piecewise linear decay: $\(\lambda_{orth}(t)=\begin{cases}\lambda_d+\dfrac{(\lambda_1-\lambda_d)(t-t_d)}{t_1-t_d}, & t\ge t_d\\[2mm] 1, & t<t_d\end{cases}\)$ With \(\lambda_1=0.1, \lambda_d=0.64\), the orthogonal component is suppressed to 0.1 in early time steps (\(t \approx t_1\)) where structural disruption is most likely. It gradually recovers to 1.0 for \(t < t_d\).

Implementation Details: To save computation, PVO is applied to displacements \(\{d_i\}\) (where \(d_i = Z^i_t - X^{src}\)) only during early steps (\(t \ge t_o\)). At the initial step (\(t_1\)), reference velocities \(\{v_i(t_g)\}\) at a guidance step \(t_g\) are used as PVO inputs to prevent subspace collapse.

Loss & Training

FlowDC is a training-free, inference-time method. No weights are updated. The base model is FLUX.1 dev. Guidance scales: 1.5/5.5. Total steps \(T=28\). Hyperparameters: \(t_1=27/28, t_g=22/28, t_o=27/28, \lambda_1=0.1, \lambda_d=0.64, t_d=20/28\).

Key Experimental Results

Evaluation uses PIE-Bench++ and the newly constructed Complex-PIE-Bench (4 targets per sample). Metrics: CLIP-T (alignment), CLIP-I/DINO/LPIPS (consistency).

Main Results

Dataset Method CLIP-T↑ CLIP-I↑ DINO↑ LPIPS↓
Complex-PIE-Bench FlowEdit 26.91 87.63 70.84 23.86
Complex-PIE-Bench RF-Edit 28.79 78.47 47.62 50.80
Complex-PIE-Bench Ours 27.69 87.72 71.69 23.72
PIE-Bench++ FlowEdit 25.12 88.27 70.26 22.44
PIE-Bench++ RF-Edit 26.88 78.04 45.35 49.12
PIE-Bench++ Ours 25.13 88.76 72.33 22.09

Interpretation: While RF-Edit achieves higher CLIP-T, its DINO/LPIPS scores are poor (47.62/50.80), indicating over-editing and structural destruction. FlowDC achieves the best scores in consistency while maintaining high CLIP-T, representing a superior balance.

Ablation Study

Configuration CLIP-T↑ CLIP-I↑ DINO↑ LPIPS↓ Note
Ours (Full) 27.69 87.72 71.69 23.72 PSO + VOD
w/o PSO 27.30 87.76 71.60 24.19 CLIP-T drops; complex targets are missed.
w/o VOD 29.23 79.63 47.58 49.89 Consistency collapses (DINO 71.69 → 47.58).

Key Findings

  • VOD is crucial for source consistency: Without it, CLIP-T increases slightly (more "aggressive" editing), but DINO crashes, and LPIPS spikes. Orthogonal components must be decayed to prevent structural disruption.
  • PSO ensures semantic alignment: Without it, CLIP-T drops as the model fails to capture all specific targets (e.g., failing to change grass to gold).
  • Clear division of labor: PSO ensures "editing everything correctly," while VOD ensures "not touching what shouldn't be moved."

Highlights & Insights

  • From Prompt Decomposition to Velocity-Space Orthogonalization: Unlike sequential multi-round methods, FlowDC uses Gram–Schmidt in the velocity subspace to isolate target directions. This allows parallel multi-target execution in a single round, avoiding cumulative errors.
  • "Orthogonal = Structural Destroyer": The direct diagnostic approach of identifying structural noise in the orthogonal component and using piecewise decay is a clean, effective design.
  • Engineering Optimization: Using displacements instead of velocities in later stages and reference velocities for initialization demonstrates practical maturity for deployment.

Limitations & Future Work

  • LLM Dependency: The quality depends on the LLM's progressive decomposition. Semantic overlap in targets may hinder the effectiveness of the orthogonal basis.
  • Hyperparameter Sensitivity: Parameters like \(t_g, \lambda_1, t_d\) may require tuning for different base models or step counts.
  • Orthogonality Boundaries: If targets are inherently highly correlated semantically, pure orthogonal bases might struggle to separate them perfectly.
  • vs FlowEdit: FlowEdit serves as the baseline for inversion-free editing. FlowDC outperforms it by adding PSO+VOD.
  • vs Multi-round methods: Sequential models incur linear costs and cumulative inconsistency; FlowDC's single-round approach is more efficient and consistent.
  • vs RF-Edit: RF-Edit represents the "over-editing" extreme; FlowDC uses orthogonal decay to mitigate this structural damage.

Rating

  • Novelty: ⭐⭐⭐⭐ Innovative shift from prompt-level to velocity-subspace decomposition.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Strong benchmarks and ablation; however, primarily validated on FLUX.1.
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation, well-defined algorithms, and supportive qualitative visualization.
  • Value: ⭐⭐⭐⭐ Training-free and plug-and-play for flow editing; addresses a highly practical need for complex multi-target editing.