VINCIE: Unlocking In-context Image Editing from Video¶

Conference: ICLR 2026 arXiv: 2506.10941 Code: vincie2025.github.io Area: Image Segmentation Keywords: in-context editing, video learning, multi-turn editing, DiT, segmentation prediction

TL;DR¶

VINCIE is a framework that first demonstrates that in-context image editing models can be learned entirely from native video data. By annotating videos as interleaved multimodal sequences and designing three proxy tasks (NIP/CSP/NSP), it achieves state-of-the-art performance on multi-turn editing benchmarks, improving the 5-turn editing success rate from less than 2% (baseline) to 25%.

Background & Motivation¶

Background: In-context image editing enables users to iteratively modify images through multi-turn interactions. Existing methods rely on task-specific pipelines and expert models (e.g., segmentation, inpainting) to construct paired training data.

Limitations of Prior Work: (1) Constructing paired data for multi-turn editing is extremely difficult; existing methods can only mine single-turn editing pairs. (2) Dependence on task-specific pipelines limits the generality and scalability of data. (3) Consistency degradation and error accumulation in multi-turn editing remain serious issues.

Key Challenge: A fundamental tension exists between the scarcity of high-quality multi-turn editing training data and the need for models to learn long-range contextual dependencies.

Goal: To investigate whether a meaningful in-context image editing model can be learned solely from video data, without any independently curated image pairs.

Key Insight: Videos naturally contain rich visual dynamics—object appearance and disappearance, pose changes, camera motion—which implicitly provide learning signals for editing operations.

Core Idea: Native video data is used to construct interleaved multimodal sequences (frames + transition descriptions + segmentation masks), and a DiT model is trained via three proxy tasks to learn context-aware image editing.

Method¶

Overall Architecture¶

(1) \(K\) frames are sparsely sampled from a video, and a VLM annotates inter-frame visual transition descriptions \(T_i\); (2) GroundingDINO + SAM2 generate segmentation masks for regions of editing interest (RoE); (3) an interleaved multimodal sequence \((I_0, T_0, M_{00}, M_{01}, I_1, \ldots, I_K)\) is constructed; (4) a DiT model is jointly trained via three proxy tasks.

Key Designs¶

Scalable Video Data Annotation Pipeline:
- Function: Converts native video into interleaved multimodal sequences suitable for training in-context editing models.
- Mechanism: A hybrid sampling strategy is adopted (uniform interval sampling + fixed-frame-count sampling). Adjacent frames are annotated by a VLM using chain-of-thought (CoT) reasoning to describe visual transitions (enumerating differences across aspects → summarizing into an editing instruction \(T_i\)), followed by GroundingDINO + SAM2 to extract RoE segmentation masks.
- Design Motivation: Uniform interval sampling captures fine-grained object-level changes, while fixed-frame-count sampling covers large-scale scene-level changes. RoE masks provide explicit spatial localization signals.
DiT Architecture and In-Context Compositional Learning:
- Function: Learns context-conditioned image generation within a Diffusion Transformer framework.
- Mechanism: The modeling objective is \(\log p(S) = \sum_{i=1}^{M} \log p(I_i | I_0, \ldots, T_{i-1}, I_{i-1})\). Learnable <TURN> tokens separate turns; 1D RoPE is applied to text and 3D RoPE to images. Both full attention and block-level causal attention variants are provided. Random dropout is applied to contextual inputs (frames and masks) to improve generalization.
- Design Motivation: Pre-trained weights from video foundation models provide strong initialization; context dropout encourages the model to flexibly leverage varying combinations of contextual information.
Three-Proxy-Task Learning Framework:
- Function: Enhances editing capability through three complementary tasks.
- Mechanism: (i) Next-frame Image Prediction (NIP)—the primary task, learning context-aware editing; (ii) Current Segmentation Prediction (CSP)—improves grounding ability by identifying regions to be edited; (iii) Next-frame Segmentation Prediction (NSP)—enhances controllable generation by assisting dynamic layout adjustment. All three tasks are jointly trained in a unified generative framework using the MSE diffusion loss under flow matching.
- Design Motivation: CSP helps the model understand "what changed," while NSP helps the model anticipate "what will change," and both jointly enhance the editing quality of NIP.

Loss & Training¶

The MSE diffusion loss under flow matching is used. RoE masks are included in training with an 80% probability. Context dropout rates: 20% for the current frame, 70% for the current RoE, and 70% for the next-frame RoE. Inference uses 50 sampling steps with CFG scale = 10. The 3B model is trained for 15k steps on 256 × H100 GPUs for approximately 30 hours; the 7B model for 40k steps for approximately 150 hours. The training data comprises approximately 10M session instances.

Key Experimental Results¶

Main Results¶

Dataset	Metric	Ours (7B+SFT)	Prev. SOTA / Baseline	Notes
MagicBrush Turn-1	DINO	0.891	0.886 (Nano Banana)	Surpasses all academic methods
MagicBrush Turn-3	DINO	0.775	0.773 (Nano Banana)	Multi-turn advantage more pronounced
MSE-Bench Turn-1	Success Rate	0.950	0.937 (Step1X-Edit)	Second only to Bagel
MSE-Bench Turn-5	Success Rate	0.487	0.413 (Bagel)	Substantially outperforms open-source methods
MSE-Bench Turn-5	Success Rate	0.210→0.487	3B→7B+SFT	Significant scaling effect

Ablation Study¶

Configuration	Metric	Notes
w/o Seg. vs. w/ Seg.	MSE Turn-1: 0.847→0.887	Segmentation prediction tasks improve editing performance
CS→I	MagicBrush DINO Turn-1: 0.797	Current segmentation improves consistency
CS→NS→I	MagicBrush DINO Turn-3: 0.679	Chained editing strategy is optimal
Pairwise vs. sequence	MSE Turn-5: 0.010→0.220	Sequence data substantially outperforms paired data
Data scale 0.25M→10M	MSE Turn-5: 5%→22%	Approximately log-linear scaling

Key Findings¶

Training solely on video data is sufficient to match SOTA methods that use paired editing data; video pre-training followed by SFT yields the best results.
In-context editing effectively mitigates artifact accumulation across multi-turn editing.
The model exhibits emergent capabilities not explicitly trained for: multi-concept composition, story generation, and chained editing.

Highlights & Insights¶

This work is the first to demonstrate the feasibility of learning in-context image editing from purely video data, opening a new paradigm for data sourcing. The massive scale of available video data confers a natural scalability advantage.
The three proxy tasks are elegantly designed: CSP addresses "where things changed," NSP anticipates "where things will change," and their synergy enhances the editing quality of NIP—a decomposition strategy worth adopting broadly.

Limitations & Future Work¶

Video training introduces subject position shifts, which are partially mitigated by segmentation prediction but not fully resolved.
A substantial gap remains compared to commercial models (GPT-4o: 62.7%, Nano Banana: 64.3%) on MSE-Bench Turn-5.
The evaluation relies solely on GPT-4o automatic assessment, without user satisfaction or preference studies.

vs. InstructPix2Pix: IP2P depends on GPT-3 + SD to generate paired data and supports only single-turn editing; VINCIE leverages video to naturally support multi-turn interactions.
vs. OmniGen/OmniGen2: OmniGen uses paired editing data and suffers a sharp drop in multi-turn success rate (only 8.3% at MSE Turn-5); VINCIE's contextual modeling is substantially more robust.
vs. UES/RealGeneral: These methods exploit only two frames from a video, ignoring long-range context; VINCIE uses complete multi-frame sequences.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to learn in-context editing from video data; a paradigm-level innovation with emergent capabilities.
Experimental Thoroughness: ⭐⭐⭐⭐ Introduces the MSE-Bench benchmark; ablations are comprehensive; scaling analysis is thorough.
Writing Quality: ⭐⭐⭐⭐ Motivation is clear; methodology is described systematically; experimental presentation is polished.
Value: ⭐⭐⭐⭐⭐ Establishes the video→editing paradigm; data scalability addresses a core bottleneck in the field.