VINCIE: Unlocking In-context Image Editing from Video¶
Conference: ICLR 2026 arXiv: 2506.10941 Code: vincie2025.github.io Area: Image Segmentation Keywords: in-context editing, video learning, multi-turn editing, DiT, segmentation prediction
TL;DR¶
VINCIE is a framework that first demonstrates that in-context image editing models can be learned entirely from native video data. By annotating videos as interleaved multimodal sequences and designing three proxy tasks (NIP/CSP/NSP), it achieves state-of-the-art performance on multi-turn editing benchmarks, improving the 5-turn editing success rate from less than 2% (baseline) to 25%.
Background & Motivation¶
Background: In-context image editing enables users to iteratively modify images through multi-turn interactions. Existing methods rely on task-specific pipelines and expert models (e.g., segmentation, inpainting) to construct paired training data.
Limitations of Prior Work: (1) Constructing paired data for multi-turn editing is extremely difficult; existing methods can only mine single-turn editing pairs. (2) Dependence on task-specific pipelines limits the generality and scalability of data. (3) Consistency degradation and error accumulation in multi-turn editing remain serious issues.
Key Challenge: A fundamental tension exists between the scarcity of high-quality multi-turn editing training data and the need for models to learn long-range contextual dependencies.
Goal: To investigate whether a meaningful in-context image editing model can be learned solely from video data, without any independently curated image pairs.
Key Insight: Videos naturally contain rich visual dynamics—object appearance and disappearance, pose changes, camera motion—which implicitly provide learning signals for editing operations.
Core Idea: Native video data is used to construct interleaved multimodal sequences (frames + transition descriptions + segmentation masks), and a DiT model is trained via three proxy tasks to learn context-aware image editing.
Method¶
Overall Architecture¶
(1) \(K\) frames are sparsely sampled from a video, and a VLM annotates inter-frame visual transition descriptions \(T_i\); (2) GroundingDINO + SAM2 generate segmentation masks for regions of editing interest (RoE); (3) an interleaved multimodal sequence \((I_0, T_0, M_{00}, M_{01}, I_1, \ldots, I_K)\) is constructed; (4) a DiT model is jointly trained via three proxy tasks.
Key Designs¶
-
Scalable Video Data Annotation Pipeline:
- Function: Converts native video into interleaved multimodal sequences suitable for training in-context editing models.
- Mechanism: A hybrid sampling strategy is adopted (uniform interval sampling + fixed-frame-count sampling). Adjacent frames are annotated by a VLM using chain-of-thought (CoT) reasoning to describe visual transitions (enumerating differences across aspects → summarizing into an editing instruction \(T_i\)), followed by GroundingDINO + SAM2 to extract RoE segmentation masks.
- Design Motivation: Uniform interval sampling captures fine-grained object-level changes, while fixed-frame-count sampling covers large-scale scene-level changes. RoE masks provide explicit spatial localization signals.
-
DiT Architecture and In-Context Compositional Learning:
- Function: Learns context-conditioned image generation within a Diffusion Transformer framework.
- Mechanism: The modeling objective is \(\log p(S) = \sum_{i=1}^{M} \log p(I_i | I_0, \ldots, T_{i-1}, I_{i-1})\). Learnable
<TURN>tokens separate turns; 1D RoPE is applied to text and 3D RoPE to images. Both full attention and block-level causal attention variants are provided. Random dropout is applied to contextual inputs (frames and masks) to improve generalization. - Design Motivation: Pre-trained weights from video foundation models provide strong initialization; context dropout encourages the model to flexibly leverage varying combinations of contextual information.
-
Three-Proxy-Task Learning Framework:
- Function: Enhances editing capability through three complementary tasks.
- Mechanism: (i) Next-frame Image Prediction (NIP)—the primary task, learning context-aware editing; (ii) Current Segmentation Prediction (CSP)—improves grounding ability by identifying regions to be edited; (iii) Next-frame Segmentation Prediction (NSP)—enhances controllable generation by assisting dynamic layout adjustment. All three tasks are jointly trained in a unified generative framework using the MSE diffusion loss under flow matching.
- Design Motivation: CSP helps the model understand "what changed," while NSP helps the model anticipate "what will change," and both jointly enhance the editing quality of NIP.
Loss & Training¶
The MSE diffusion loss under flow matching is used. RoE masks are included in training with an 80% probability. Context dropout rates: 20% for the current frame, 70% for the current RoE, and 70% for the next-frame RoE. Inference uses 50 sampling steps with CFG scale = 10. The 3B model is trained for 15k steps on 256 × H100 GPUs for approximately 30 hours; the 7B model for 40k steps for approximately 150 hours. The training data comprises approximately 10M session instances.
Key Experimental Results¶
Main Results¶
| Dataset | Metric | Ours (7B+SFT) | Prev. SOTA / Baseline | Notes |
|---|---|---|---|---|
| MagicBrush Turn-1 | DINO | 0.891 | 0.886 (Nano Banana) | Surpasses all academic methods |
| MagicBrush Turn-3 | DINO | 0.775 | 0.773 (Nano Banana) | Multi-turn advantage more pronounced |
| MSE-Bench Turn-1 | Success Rate | 0.950 | 0.937 (Step1X-Edit) | Second only to Bagel |
| MSE-Bench Turn-5 | Success Rate | 0.487 | 0.413 (Bagel) | Substantially outperforms open-source methods |
| MSE-Bench Turn-5 | Success Rate | 0.210→0.487 | 3B→7B+SFT | Significant scaling effect |
Ablation Study¶
| Configuration | Metric | Notes |
|---|---|---|
| w/o Seg. vs. w/ Seg. | MSE Turn-1: 0.847→0.887 | Segmentation prediction tasks improve editing performance |
| CS→I | MagicBrush DINO Turn-1: 0.797 | Current segmentation improves consistency |
| CS→NS→I | MagicBrush DINO Turn-3: 0.679 | Chained editing strategy is optimal |
| Pairwise vs. sequence | MSE Turn-5: 0.010→0.220 | Sequence data substantially outperforms paired data |
| Data scale 0.25M→10M | MSE Turn-5: 5%→22% | Approximately log-linear scaling |
Key Findings¶
- Training solely on video data is sufficient to match SOTA methods that use paired editing data; video pre-training followed by SFT yields the best results.
- In-context editing effectively mitigates artifact accumulation across multi-turn editing.
- The model exhibits emergent capabilities not explicitly trained for: multi-concept composition, story generation, and chained editing.
Highlights & Insights¶
- This work is the first to demonstrate the feasibility of learning in-context image editing from purely video data, opening a new paradigm for data sourcing. The massive scale of available video data confers a natural scalability advantage.
- The three proxy tasks are elegantly designed: CSP addresses "where things changed," NSP anticipates "where things will change," and their synergy enhances the editing quality of NIP—a decomposition strategy worth adopting broadly.
Limitations & Future Work¶
- Video training introduces subject position shifts, which are partially mitigated by segmentation prediction but not fully resolved.
- A substantial gap remains compared to commercial models (GPT-4o: 62.7%, Nano Banana: 64.3%) on MSE-Bench Turn-5.
- The evaluation relies solely on GPT-4o automatic assessment, without user satisfaction or preference studies.
Related Work & Insights¶
- vs. InstructPix2Pix: IP2P depends on GPT-3 + SD to generate paired data and supports only single-turn editing; VINCIE leverages video to naturally support multi-turn interactions.
- vs. OmniGen/OmniGen2: OmniGen uses paired editing data and suffers a sharp drop in multi-turn success rate (only 8.3% at MSE Turn-5); VINCIE's contextual modeling is substantially more robust.
- vs. UES/RealGeneral: These methods exploit only two frames from a video, ignoring long-range context; VINCIE uses complete multi-frame sequences.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to learn in-context editing from video data; a paradigm-level innovation with emergent capabilities.
- Experimental Thoroughness: ⭐⭐⭐⭐ Introduces the MSE-Bench benchmark; ablations are comprehensive; scaling analysis is thorough.
- Writing Quality: ⭐⭐⭐⭐ Motivation is clear; methodology is described systematically; experimental presentation is polished.
- Value: ⭐⭐⭐⭐⭐ Establishes the video→editing paradigm; data scalability addresses a core bottleneck in the field.