Skip to content

Shifting the Breaking Point of Flow Matching for Multi-Instance Editing

Conference: ICML2026
arXiv: 2602.08749
Code: https://github.com/Blowing-Up-Groundhogs/IDAttn
Area: Image Generation / Image Editing / Flow Matching / Multi-Instance Editing
Keywords: Flow Matching, MMDiT, Multi-Instance Editing, Attention Disentanglement, Infographic Text Editing

TL;DR

Aiming at the chronic problem of "attribute leakage" during simultaneous multi-instance editing in MMDiT-based Rectified Flow Matching models (such as FLUX.1 Kontext), this paper proposes Instance-Disentangled Attention (IDAttn). By applying structured masks to joint attention, each editing instruction is bound to its corresponding bounding box. Together with a hierarchical disentanglement/harmonization scheduling strategy and efficient independent multi-prompt encoding, the model can complete \(N\) non-interfering edits in a single forward pass. It significantly outperforms multi-turn and concatenation baselines on its proposed Infographic text editing benchmark.

Background & Motivation

Background: Text-driven image editing has long been dominated by U-Net diffusion models. Recently, the community is migrating toward MMDiT + Rectified Flow Matching (e.g., Stable Diffusion 3, FLUX.1 Kontext), where the ODE formulation offers higher visual quality and faster inference. Editing typically involves concatenating reference image tokens with noise latents and feeding them into the same joint attention.

Limitations of Prior Work: Existing FM editors almost exclusively support "one-sentence-edits-whole-image" or a small number of edits. In multi-instance scenarios (e.g., dozens of text boxes in a single infographic), they either fail to edit large areas or suffer from attribute leakage, where semantics from box A seep into box B. Although multi-turn inference mitigates this, it incurs an explosive \(N\)-step inference cost and repeatedly damages background consistency.

Key Challenge: Flow matching learns a global velocity field \(v_\theta(x, t \mid c)\), where condition \(c\) is also injected globally. Joint attention allows prompt, latent, and context tokens to attend to each other freely. This leads to interference between query/key pairs of different instances in the shared vector field; instance-level isolation is not architecturally enforced.

Goal: Without modifying backbone weights or disrupting the global FM training objective, given \(N\) local instructions and bboxes \(\{(s_n, b_n)\}_{n=1}^N\), achieve (i) editing disentanglement (instructions do not interfere), (ii) locality (non-edited areas remain unchanged), (iii) global coherence (the final image remains harmonious), and (iv) single-forward-pass completion to maintain sub-linear inference costs.

Key Insight: The authors observe that the root cause of attribute leakage is structural—any two unrelated tokens are allowed to communicate in joint attention. Rather than iteratively optimizing attention maps during inference (as in P2P-like methods), it is better to directly modify the attention connectivity graph and map specific connectivity strategies to different depths of the MMDiT.

Core Idea: Use an additive \(\{0, -\infty\}\) mask to partition joint attention into "per-instance" subgraphs. Middle layers employ strict disentanglement to keep each instance's prompt/latent/context self-contained, while early and late layers employ harmonization to let global tokens reassemble the fragments into a coherent image.

Method

Overall Architecture

Input: Reference image \(I^{\mathrm{ref}}\), \(N\) bboxes \(b_n\), \(N\) text instructions \(s_n\), and a global/empty prompt \(s_g\). The backbone is FLUX.1 Kontext (MMDiT + Rectified Flow Matching), kept frozen. The workflow is as follows:

  1. Independent Multi-Prompt Encoding: \(s_g\) and each \(s_n\) are separately fed into text encoders to obtain variable-length embeddings, which are concatenated into the final text token sequence \(Z^{\text{text}}\). This prevents conceptual crosstalk during the encoding phase.
  2. Token Space Partitioning: Based on "modality \(\times\) instance attribution", the joint token sequence \(Z = Z^{\text{text}} \| Z^{\text{latent}} \| Z^{\text{context}}\) is partitioned into global prompt \(\mathbb{T}_g\), instance prompts \(\mathbb{T}_n\), background latent \(\mathbb{L}_u\), instance latents \(\mathbb{L}_n\), background context \(\mathbb{C}_u\), and instance contexts \(\mathbb{C}_n\) (a token can belong to multiple \(n\) if bboxes overlap).
  3. Hierarchical Attention Mask Scheduling: Early layers \(L_{\text{early}}\) and late layers \(L_{\text{late}}\) of the MMDiT use the harmonious mask \(M^{\mathrm{har}}\), while middle layers \(L_{\text{mid}}\) use the disentangled mask \(M^{\mathrm{dis}}\). The resulting velocity field satisfies both instance isolation and global coherence.
  4. Single ODE Integration: Using the constrained \(v_\theta\) described above, the ODE is integrated once from \(t=0\) to \(t=1\) to obtain \(I^{\text{edit}}\), completing \(N\) edits simultaneously.
  5. Optional LoRA Fine-tuning: A LoRA adapter (\(r=32\)) is applied to MMDiT on the Crello Edit training subset. Using the same masking strategy, \(\mathcal{L}_{\mathrm{FM}}\) is minimized to alleviate the "reluctance to edit" problem for small regions or short instructions in the original model.

Key Designs

  1. Instance-Disentangled Attention (IDAttn) Core Operator:

    • Function: Rewrites standard joint attention \(\mathrm{Attn}(Q,K,V) = \mathrm{softmax}(QK^\top/\sqrt{d})V\) as \(\mathrm{IDAttn}(Q,K,V,M) = \mathrm{softmax}(QK^\top/\sqrt{d} + M)V\), using an additive \(\{0, -\infty\}\) mask to cut off token pairs that should not connect.
    • Mechanism: Defines two complementary masks. \(M^{\mathrm{dis}}\) (Disentangled) allows only three types of connectivity: internal communication within an instance \(\mathbb{T}_n \cup \mathbb{L}_n \cup \mathbb{C}_n\), global prompt \(\mathbb{T}_g\) unidirectionally attending to all latent/context, and background latent/context attending only to global and non-instance prompt tokens. Any cross-instance \(\mathbb{T}_n \leftrightarrow \mathbb{T}_m\) (\(n \neq m\)) is blocked by \(-\infty\). \(M^{\mathrm{har}}\) (Harmonious) relaxes this to allow instance latents/contexts to attend to each other and all image tokens, maintaining isolation only between prompts.
    • Design Motivation: Move attribute leakage prevention from the loss/optimization level to the architectural level, strictly prohibiting cross-instance prompt-prompt and prompt-latent coupling. Retaining the unidirectional prompt→latent global channel ensures global style tokens can still influence all regions without fragmenting the background.
  2. Hierarchical Mask Scheduling (early/mid/late layer scheduling):

    • Function: Determines whether each MMDiT layer uses \(M^{\mathrm{har}}\) or \(M^{\mathrm{dis}}\), balancing "instance separation" and "global coherence."
    • Mechanism: Based on observations that Transformers extract coarse features in early layers, perform semantic binding in middle layers, and provide global coordination in late layers, the authors set the schedule as \((L_{\text{early}}, L_{\text{mid}}, L_{\text{late}}) = (M^{\mathrm{har}}, M^{\mathrm{dis}}, M^{\mathrm{har}})\). Ablations of 8 combinations in Table 1 show this three-stage approach is simultaneously optimal across Tgt CLIP / Bg LPIPS / Loc CLIP / AR. Using \(M^{\mathrm{dis}}\) throughout sacrifices background consistency, while \(M^{\mathrm{har}}\) throughout drops performance to vanilla FLUX levels (AR only 80%).
    • Design Motivation: Pure disentanglement fragments the background into patches, while pure harmonization muddles multi-instance semantics. Disentanglement is placed in the sensitive semantic-binding middle section, with harmonious layers used for "layout initialization" and "final merging."
  3. Efficient Independent Multi-Prompt Encoding:

    • Function: Compresses the total text token length to be linearly related to the actual semantic volume rather than \(N \times L_{\text{pad}}\), ensuring instance text representations do not pollute each other.
    • Mechanism: Original instructions are split into \(s_g\) (empty prompt in practice) and \(\{s_n\}_{n=1}^N\). Each is individually passed through the text encoder to obtain variable-length embeddings before being concatenated into the final \(Z^{\text{text}}\). Compared to (i) single-prompt backend masking (where semantics are already polluted during encoding) and (ii) multi-prompt padding to equal length (where computation explodes linearly with \(N\)), this design achieves both isolation by construction and cost-efficiency.
    • Design Motivation: In extreme scenarios like InfoEdit where \(N\) can reach 285, padding to 77 tokens \(\times\) 285 would cause attention \(O((NL)^2)\) to exceed memory limits. Inference time curves in Figure 3 confirm that this strategy makes wall-clock time grow almost linearly rather than quadratically with \(N\).

Loss & Training

No additional training is required for the inference phase. Optional domain-specific fine-tuning reuses the conditional rectified flow matching loss \(\mathcal{L}_{\mathrm{FM}} = \mathbb{E}_{t, x_1, x_0}[\|v_\theta(x_t, t \mid c) - (x_1 - x_0)\|^2]\), where \(x_t = (1-t)x_0 + t x_1\). After integrating IDAttn and multi-prompt encoding, LoRA (\(r=32\)) is applied to all MMDiT layers and fine-tuned on 1,512 samples from the Crello Edit training set to specifically address underfitting in small regions and short instructions.

Key Experimental Results

Main Results

On the natural image multi-instance editing benchmark LoMOE-Bench (80 images, 2-7 edits per image):

Method Tgt CLIP↑ LPIPS\(_\text{B}\) SSIM\(_\text{B}\) Loc CLIP↑ HPS↑ AR%↑
LoMOE 26.00 0.090 0.834 29.40 0.546 98.96
LayerEdit 25.61 0.147 0.864 29.07 0.186 100.00
FLUX (Vanilla) 24.71 0.206 0.830 27.58 -0.059 94.79
FLUX μT (Multi-turn) 25.71 0.150 0.873 28.27 0.550 94.27
FLUX w/ v.c. 24.49 0.170 0.893 27.60 0.265 92.71
Ours 25.60 0.099 0.919 29.08 0.574 89.06

Ours achieves the best results in background consistency (LPIPS / SSIM) and human preference (HPS), nearly matching LoMOE in Tgt/Loc CLIP. However, LoMOE relies on multi-diffusion stitching, which has high inference costs and cannot scale to large \(N\).

On the Infographic Editing Benchmark (Crello Edit + InfoEdit, Table 4 summary):

Method Crello FID↓ Crello CER↓ Crello AR%↑ InfoEdit FID↓ InfoEdit CER↓ InfoEdit AR%↑
FLUX (Vanilla) 10.06 0.65 68.72 4.36 0.77 39.94
FLUX μT 12.10 0.63 90.44 65.73 0.90 99.81
FLUX st. (Stitch) 15.69 0.59 73.49 10.48 0.66 63.25
Calligrapher μT 10.15 0.73 51.21 113.23 0.92 99.98
Ours 9.45 0.61 52.00 2.41 0.64 52.61
Ours + ft 10.85 0.52 92.16 2.80 0.56 80.90

On InfoEdit (up to 285 instructions per image), FID dropped significantly from 4.36 to 2.41/2.80, with CER decreasing to 0.56. The fine-tuned (ft) version maintained low FID while improving AR to 80.9%, proving that LoRA adaptation effectively compensates for the model's reluctance to edit small areas with short instructions.

Ablation Study

Config Tgt CLIP↑ LPIPS\(_\text{B}\) Loc CLIP↑ AR%↑
Full \(M^{\mathrm{dis}}\) 25.51 0.108 29.14 94.27
Full \(M^{\mathrm{har}}\) 25.05 0.103 28.54 80.21
\((\mathrm{dis}, \mathrm{har}, \mathrm{har})\) 25.01 0.099 28.53 82.29
\((\mathrm{dis}, \mathrm{dis}, \mathrm{har})\) 25.63 0.100 29.21 91.15
\((\mathrm{har}, \mathrm{dis}, \mathrm{dis})\) 25.59 0.103 29.23 93.23
\((\mathrm{har}, \mathrm{dis}, \mathrm{har})\) (Final) 25.67 0.091 29.26 92.19

Another set of results (Table 2) showed that enabling efficient prompt encoding alone without IDAttn yields performance similar to the vanilla baseline. However, adding IDAttn alone drives Tgt CLIP to 25.67 and reduces LPIPS to 0.091—IDAttn is the source of quality, while efficient prompt encoding is the source of efficiency; they are orthogonal.

User studies and Gemini 3 Flash LLM-as-Judge (Table 5) Elo ratings show the proposed method leading significantly in both LoMOE and Infographics categories (User: 1589 vs FLUX: 1331 / FLUX μT: 680).

Key Findings

  • Layer placement for disentanglement/harmonization is critical: Placing \(M^{\mathrm{dis}}\) in the middle and \(M^{\mathrm{har}}\) at the early/late stages is Pareto optimal. \(M^{\mathrm{dis}}\) in early layers severely degrades multi-instance editing (AR drops to the 80% range), indicating that hard-cutting token communication is detrimental during coarse feature extraction.
  • IDAttn is robust to imprecise bboxes: Excessively loose boxes (e.g., cotton balls in Fig 5a) are handled well as the backbone retains internal localization capabilities. For nested or overlapping boxes (e.g., giraffes in Fig 5b), the softmax distribution becomes sharper on smaller boxes, "automatically favoring small-box instructions" and resolving conflicts robustly.
  • Larger N highlights the method's advantage: CER and AR curves vs. \(N\) show that FLUX baselines effectively "give up" on most instructions when \(N \geq 10\), whereas the proposed method remains stable in the tens-of-edits range.
  • Dependency on external bboxes is the primary limitation: The method does not perform localization itself and relies on OCR/detectors for \(b_n\). Entirely incorrect bboxes lead to editing failure; the authors leave end-to-end localize+edit for agentic future work.

Highlights & Insights

  • Attribute leakage redefined as an "architectural problem" rather than a "loss problem": Unlike inference-time optimization routes like P2P or Attend-and-Excite, IDAttn uses a mask to cut off cross-instance token paths. It requires no iterative optimization, has near-zero deployment cost, and is plug-and-play for any MMDiT.
  • The "early/late harmonization + mid disentanglement" pattern is highly transferable: This aligns with recent research on ViT hierarchical representations (early=texture, mid=semantics, late=global). This logic could naturally extend to scenarios like multi-signal ControlNet, multi-subject generation, or multi-trajectory video editing.
  • Infographic text editing established as a new benchmark category: Unlike natural image multi-instance editing (often \(N \le 7\)), infographics reach \(N\) in the hundreds with box areas as small as 0.6%. This provides a real pressure test for "high-density small regions + strict layout," offering significant reference value for future text-rendering work.
  • Variable-length independent encoding is an underrated engineering trick: When instance counts explode, switching padding to "semantic-based billing" ensures isolation while controlling costs. This could theoretically be integrated into any multi-condition generation model with cross-attention.

Limitations & Future Work

  • Dependency on external localization: \(b_n\) must be provided by OCR, LayoutParser, or humans; incorrect boxes are fatal. End-to-end localize-and-edit is a clear gap left for future agentic pipelines.
  • Fine-tuning limited to Crello distributions: Generalization across complex hand-drawn posters, tables, or comic panels has not been systematically evaluated.
  • Binary \(\{0, -\infty\}\) masks: There remains a risk of unnatural transitions at adjacent instance boundaries; soft masks (learned continuous attention bias) or integration with attention rollout methods could be considered.
  • Validation limited to FLUX.1 Kontext: While mask logic should apply to any MMDiT-based FM model, actual gains on SD3, OmniGen, or Lumina remain to be verified.
  • Metrics biased toward OCR/CLIP: CER is not sensitive to font style preservation. Finer metrics (e.g., glyph similarity, layout IoU) are needed for scenarios like "translating language while preserving font."
  • vs. P2P / Attend-and-Excite (Hertz 2023, Chefer 2023): These perform inference-time attention map optimization on U-Net diffusion + cross-attention. Ours uses architecture-level hard masks on MMDiT + joint attention, eliminating per-sample optimization and remaining naturally compatible with Rectified FM.
  • vs. LoMOE (Chakrabarty 2024) / LayerEdit (Fu 2026): They handle multi-instance editing via multi-diffusion stitching or layer-wise learning during discrete denoising steps. Ours completes all edits in a single forward pass and pushes benchmarks from a few boxes to the hundreds found in InfoEdit.
  • vs. FLUX.1 Kontext (Labs 2025) Native Editing: FLUX drops instructions when \(N > 5\), and multi-turn μT suffers image degradation when \(N > 20\). Ours acts as a "multi-instruction parallel" plugin for FLUX, positioning itself similarly to how ControlNet relates to SD.
  • vs. Calligrapher (Ma 2025): Calligrapher is fine-tuned for text editing but handles only one box at a time; multi-turn use ruins the background. Ours proves that "architectural disentanglement" is faster and more accurate for text rendering than "specialized fine-tuning + serial processing."
  • vs. Multi-prompt Encoding (Zhou 2025): Zhou et al. split instructions for independent encoding but use equal-length padding, causing costs to explode linearly with \(N\). Ours uses variable-length concatenation, preserving isolation while linking attention costs to actual semantic volume.

Rating

  • Novelty: ⭐⭐⭐⭐ Redefining attribute leakage from "optimization level" to "attention connectivity level" is a clean perspective shift; the hierarchical mask scheduling is concrete and non-trivial.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage with natural image + infographic benchmarks, 8-way mask schedule ablation, user study + LLM-as-Judge, and scalability curves; only slight drawback is testing on only one backbone (FLUX).
  • Writing Quality: ⭐⭐⭐⭐ Formula definitions, token partitioning, and mask matrices are clear. The mask visualization in Figure 1 is a key anchor for understanding.
  • Value: ⭐⭐⭐⭐ Infographic editing + multi-instruction parallelism are real industrial needs (e.g., localizing infographics). The open-sourced dataset will likely become a standard benchmark for text-aware editing; IDAttn itself is plug-and-play with a low barrier for community reuse.