Instruction-based Image Manipulation by Watching How Things Move¶

Conference: CVPR 2025
arXiv: 2412.12087
Code: https://github.com/mingdeng-cao/InstructMove
Area: Diffusion Models / Image Editing
Keywords: Instructed Image Editing, Video Frame Pairs, Spatial Conditioning, Non-rigid Transformation, Multimodal Large Language Models

TL;DR¶

This paper proposes InstructMove, which constructs a large-scale real-world image editing dataset by sampling frame pairs from videos and generating editing instructions using multimodal large language models (MLLMs). Combined with a spatial conditioning strategy to fine-tune T2I models, it achieves SOTA performance on non-rigid editing tasks such as pose adjustment and viewpoint transformation.

Background & Motivation¶

Background: In the field of text-driven image editing, methods like InstructPix2Pix construct training datasets by generating editing instructions using language models and then generating target images using tuning-free methods, but the target images are synthetic and their quality is limited.
Limitations of Prior Work: Target images in existing datasets are synthetically generated by methods like Prompt-to-Prompt, which harbor severe appearance deviations and artifacts. Consequently, models struggle to handle complex non-rigid edits (such as pose changes and viewpoint adjustments) and fine-grained content preservation.
Key Challenge: The quality ceiling of synthetic target images restricts the upper-bound performance of editing models—synthetic images fail to capture natural deformations and motions of the real world.
Goal: (a) How to construct a large-scale editing dataset containing real target images? (b) How to flexibly alter the structure while preserving the original image content during editing?
Key Insight: Video frames naturally capture object motion (pose changes, element shifts, camera movement) while maintaining identity consistency of the subjects across frames, making them ideal data sources for editing.
Core Idea: Use video frame pairs as real source-target image pairs, employ MLLMs to generate editing instructions, and condition the reference image via spatial concatenation (instead of channel concatenation) to achieve flexible and high-fidelity instruction-driven image editing.

Method¶

Overall Architecture¶

The entire method is divided into two major parts: data construction and model training. Data construction samples frame pairs from internet videos, filters them using optical flow, and generates editing instructions using an MLLM, yielding a dataset of 6 million (source, target, instruction) triplets. Model training is built upon a pre-trained Stable Diffusion, where the reference image is concatenated with the noisy map along the spatial dimension (width direction) as input and trained using denoising loss. During inference, providing the original image and editing instructions generates the edited result.

Key Designs¶

Video Frame Pair Sampling and Filtering Pipeline:
- Function: Extract high-quality frame pairs with moderate transformations from videos.
- Mechanism: First, caption the videos and filter out videos unsuitable for editing (landscapes, static scenes, etc.); sample two frames at fixed 3-second intervals; calculate optical flow using RAFT for motion filtering—retaining frame pairs with moderate motion and discarding those with excessive or insufficient motion; compute background occlusion masks via backward warping to filter out frame pairs with excessive background changes.
- Design Motivation: Ensure sufficient but not excessive transformation between the frame pairs, avoiding static frames (which contain no editing information) and drastic motion frames (where correspondences are lost).
MLLM-Driven Instruction Generation:
- Function: Automatically generate high-quality editing instructions for frame pairs.
- Mechanism: Input the source and target frame pairs into GPT-4o or Pixtral-12B, prompting the model to analyze the difference between the two frames (subject changes, relative positions, camera angles, background changes) and generate absolute description instructions starting with action verbs (e.g., "move the bee to the center of the flower" rather than relative descriptions); allow the MLLM to reject frame pairs that are difficult to describe accurately.
- Design Motivation: Compared to methods that generate instructions using text-only LLMs (such as InstructPix2Pix), multimodal LLMs can directly observe image differences, resulting in more diverse and accurate instructions that cover non-rigid transformations and viewpoint changes.
Spatial Conditioning Strategy:
- Function: Allow the model to flexibly alter structures while preserving the original image content during editing.
- Mechanism: Concatenate the reference image latent \(z^s\) and the noisy target latent \(z^e_t\) along the width dimension to form a double-width input \(z_t = \text{Concat}_{width}([z^s, z^e_t])\), which is fed into the U-Net for denoising. The loss is computed on the cropped right half: \(\mathcal{L}_{\text{Edit}} = \mathbb{E}[\|\epsilon - \text{Crop}(\epsilon_\theta(z_t, C, t))\|^2]\). Unlike traditional channel concatenation, spatial concatenation does not force spatial alignment between the reference and target images, allowing the network to freely attend to any region of the reference image via cross-attention.
- Design Motivation: Channel concatenation forces spatial alignment, limiting the flexibility of non-rigid transformations. Spatial concatenation preserves the original architecture of the pre-trained T2I model, allowing seamless integration with additional control mechanisms like ControlNet.

Loss & Training¶

The standard diffusion denoising loss is used, calculated as the MSE loss solely on the right half of the output (corresponding to the target image). Training is based on SD 1.5 with a resolution of 512×512, learning rate of \(1 \times 10^{-4}\) using the Adam optimizer, trained for 100K steps on 8 A100 GPUs, with a total batch size of 256. Inference uses DDIM with 50 sampling steps. It supports mask-guided local editing (via latent blending) and ControlNet integration.

Key Experimental Results¶

Main Results¶

Method	CLIP-D ↑	CLIP-Inst ↑	CLIP-I ↑
NullTextInversion	0.0660	0.7648	0.9063
MasaCtrl	0.0436	0.8527	0.9160
InstructPix2Pix	0.0887	0.8569	0.9380
MagicBrush	0.0972	0.8648	0.9318
UltraEdit	0.0824	0.8571	0.9184
InstructMove (Ours)	0.1361	0.8724	0.9275

Method	Human Preference Rate
Imagic	5.0%
InstructPix2Pix	3.25%
MagicBrush	4.13%
InstructMove (Ours)	87.62%

Ablation Study¶

Configuration	CLIP-D ↑	CLIP-Inst ↑	CLIP-I ↑	Description
SC + IP2P data	0.1277	0.8414	0.9094	Trained with synthetic dataset
CC + Our data	0.0853	0.8679	0.8552	Channel conditioning
SC + Our data	0.1361	0.8724	0.9275	Full model

Key Findings¶

The dataset makes the largest contribution: using real video frame pairs vs. IP2P synthetic data improves CLIP-D by 6.6% and CLIP-I by 1.8%.
Spatial conditioning vs. Channel conditioning: SC significantly outperforms CC in both instruction alignment and content preservation. CC's CLIP-I of only 0.8552 suggests that channel concatenation severely damages content preservation.
In human evaluation, the model dominates all baselines with an 87.62% preference rate, demonstrating a massive advantage in non-rigid editing capabilities.
Existing methods obtain high CLIP-I scores on non-rigid editing because they barely modify the original image (false positives due to editing failure).

Highlights & Insights¶

Highlighting video as an editing data source is highly ingenious: video frames naturally provide identity-consistent source-target pairs and cover rich natural dynamic transformations, which are orders of magnitude higher quality than synthetic data. This idea can be transferred to any task requiring paired data.
Replacing channel concatenation with spatial concatenation is a minimalist yet highly efficient design: it unlocks non-rigid editing capabilities without modifying any network architecture, while maintaining compatibility with existing tools like ControlNet.
The pipeline of using MLLM as an automatic annotator exhibits excellent generalizability and can be extended to other tasks requiring image-pair descriptions.

Limitations & Future Work¶

MLLMs sometimes generate inaccurate instructions or miss subtle changes between frames, which may cause the editing model to introduce unintended viewpoint shifts.
It is limited to real transformations capturable in videos, making it unable to handle artistic edits like style transfer or object replacement (mitigated by the authors by mixing in other datasets).
The text understanding capability of the pre-trained T2I model limits the execution of complex editing instructions.
For future exploration: extending this method to video editing (leveraging continuous frames) or 3D-consistent editing.

vs. InstructPix2Pix: IP2P uses GPT-3 to generate text instructions + P2P to generate synthetic target images, where data quality is limited by the synthesis method. This paper uses real video frames + MLLMs, vastly improving data quality.
vs. UltraEdit/EmuEdit: These methods also attempt to improve the quality of editing datasets but still rely on synthetic target images. This work presents the first large-scale editing dataset using real images.
vs. Zero-shot Editing Methods (NullText, MasaCtrl): These methods require no training data but are slow and unstable. This work achieves better performance and efficiency through training.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The dataset construction pipeline using video frame pairs + MLLM instruction generation is highly innovative, and the spatial conditioning strategy is simple yet effective.
Experimental Thoroughness: ⭐⭐⭐⭐ Quantitative metrics, human evaluation, and ablation studies are comprehensive, though the self-constructed test set of 50 images is relatively small.
Writing Quality: ⭐⭐⭐⭐⭐ Clear logic, high-quality figures and tables, and well-articulated motivation.
Value: ⭐⭐⭐⭐⭐ Opens up a new paradigm for using video data to train image editing models, and both the dataset and the method will have a widespread impact.