Skip to content

ObjectMover: Generative Object Movement with Video Prior

Conference: CVPR 2025
arXiv: 2503.08037
Code: None (Todo)
Area: Diffusion Models / Image Editing
Keywords: Object Movement, Video Generation Prior, Sequence-to-Sequence, Game Engine Synthetic Data, Multi-Task Learning

TL;DR

ObjectMover models the task of object movement in images as a sequence-to-sequence problem. By fine-tuning a video generation model, it leverages cross-frame object consistency priors. Combined with high-quality synthetic data pairs generated by a game engine and a multi-task learning strategy, it achieves realistic relighting, occlusion completion, and synchronized shadow/reflection editing in complex real-world scenes.

Background & Motivation

Background: "Moving an object to another location" in image editing seems straightforward but is in fact one of the most challenging editing tasks. Existing methods are mostly based on image diffusion models (such as Stable Diffusion) for inpainting or editing, but these methods often break object movement down into a two-step "removal + pasting" operation.

Limitations of Prior Work: Simple "copy-paste" approaches cannot handle the cascade of effects that follow an object's relocation: the lighting needs to be reintegrated based on the new position, the object's pose needs to adjust according to perspective, occluded areas must be accurately filled, and shadows and reflections must be synchronized to their correct state at the new location, all while preserving the identity and appearance of the object. Existing image editing models lack an understanding of these physical consistency constraints.

Key Challenge: Image editing models only see single-frame information and lack prior knowledge of "how the same object should change under different lighting or viewpoints." However, this exact knowledge has already been learned by video models from massive amounts of real-world video data—where the natural dispersion of lighting, shadows, and reflections as an object moves between frames is implicitly encoded by the video generation model.

Goal: To build a generative model capable of reliably performing object movement in complex scenes while simultaneously handling sub-problems such as relighting, perspective adjustment, occlusion filling, and effect synchronization.

Key Insight: Object movement can inherently be viewed as a "two-frame video"—where the first frame is the original image and the second frame is the image after the object has moved. Video generation models naturally learn cross-frame object consistency and dynamic effect evolution, which is exactly the prior required for object movement.

Core Idea: Reframe object movement as a sequence-to-sequence prediction problem, fine-tune a pre-trained image-to-video model to execute this task, and use a game engine to synthesize high-quality training data.

Method

Overall Architecture

The pipeline of ObjectMover contains three core components: (1) modeling object movement as a sequence-to-sequence problem, where the input sequence is the original image (containing object masks and target location information) and the output sequence is the resulting image after the object moves; (2) fine-tuning based on a pre-trained image-to-video diffusion model; and (3) constructing synthetic training data using a game engine, while incorporating real video data through multi-task learning to enhance generalization.

Key Designs

  1. Sequence-to-Sequence Modeling:

    • Function: Transform object movement into a format that video generation models can process.
    • Mechanism: Treat the original image and the target result as a two-frame "video." The input includes the original image, an object mask (marking which object to move), and a target location mask (marking where to move it). The model is trained to predict the second frame (the moved result) given the first frame (original image + control signals). In this way, relighting in the new position, shadow changes, and reflection updates are naturally processed by the inter-frame consistency prior of the video model.
    • Design Motivation: Video generation models are trained on large amounts of real-world multi-frame events and have implicitly learned "the visual variation laws of the same object at different spatial-temporal locations." By aligning the task format, this prior knowledge can be directly transferred.
  2. Game Engine Synthetic Data Pipeline:

    • Function: Generate high-quality object movement data pairs to fine-tune the video model.
    • Mechanism: Since large-scale datasets of "before-and-after object movement" image pairs do not exist in the real world, the authors utilize a modern game engine (such as Unreal Engine) to synthesize data. Within the game engine, object positions can be precisely controlled, rendering pixel-accurate before-and-after comparisons. Simultaneously, the game engine correctly simulates physical effects such as lighting changes, shadow movements, and reflection updates. The synthetic scenes cover various indoor and outdoor environments, containing diverse materials and lighting conditions.
    • Design Motivation: The game engine provides physically correct rendering, enabling the generation of an arbitrary number of training pairs. Compared to extracting pairs from videos, synthetic data annotations are perfectly accurate, require no manual labeling, and allow control over difficulty and diversity.
  3. Multi-Task Learning Strategy:

    • Function: Introduce real-world video data on top of synthetic data to bridge the domain gap.
    • Mechanism: In addition to the core "object movement" task, related auxiliary tasks—object removal and object insertion—are trained simultaneously. These auxiliary tasks can be trained on real-world video data (extracting clips where objects appear or disappear), thereby utilizing both synthetic and real data under a unified framework. All tasks share the same network and are distinguished by different control signals.
    • Design Motivation: Models trained purely on synthetic data often generalize poorly to real scenes. Multi-task learning exposes the model to real-world textures, lighting, and scene distributions, significantly improving performance in real scenarios. Meanwhile, the auxiliary tasks themselves are valuable capabilities (ObjectMover simultaneously supports object movement, removal, and insertion).

Loss & Training

The model is fine-tuned based on a pre-trained image-to-video diffusion model using standard diffusion training loss. The training data includes object movement pairs synthesized by the game engine and object removal/insertion pairs extracted from real-world videos. Task control signals (e.g., mask types) are used to allow the model to distinguish between different tasks.

Key Experimental Results

Main Results

Method Movement Quality Lighting Consistency Occlusion Completion Identity Preservation
Copy-Paste Low No processing None Perfect
Paint-by-Example Medium Partial Yes Poor
AnyDoor Medium Partial Yes Medium
ObjectMover High Good Good Good

Ablation Study

Configuration Generation Quality Description
Full model (Synthetic + Real + Multi-task) Optimal Complete solution
Synthetic data only Obvious drop Poor generalization to real-world scenes
Image model only (No video prior) Significant drop Lacks cross-frame consistency knowledge
No multi-task learning Moderate drop Real video data cannot be utilized

Key Findings

  • Video generation prior is key: Removing the video prior (replacing it with an image diffusion model) leads to a significant degradation in performance under extreme lighting changes and synchronized shadow/reflection adjustments.
  • The combination of synthetic data and real-world video is far superior to a single data source: synthetic data provides precise supervision signals, while real-world videos supply domain knowledge.
  • Multi-task learning simultaneously improves the performance of all three tasks, with object removal and insertion tasks providing complementary gradient signals for the primary task.
  • The model performs exceptionally well in extreme lighting scenes (e.g., strong indoor lighting/shadows) and on reflective surfaces (e.g., water surfaces, mirrors).

Highlights & Insights

  • Video Models as Physical Priors: Utilizing the inter-frame consistency of video generation models to encode the variations in physical effects (lighting, shadows, reflections) is a highly ingenious insight. Without explicitly modeling physical processes, the video model has implicitly learned them.
  • Collaborative Training of Synthetic + Real: Unifying synthetic and real data through a multi-task learning framework addresses the classic problem of the synthetic data domain gap. This paradigm can be transferred to many image editing tasks lacking labeled data.
  • Three Capabilities in One Model: Object movement, removal, and insertion share the same network, demonstrating the potential of multi-task learning in image editing.

Limitations & Future Work

  • The code and models are not yet open-source, limiting reproducibility.
  • High inference cost due to reliance on video diffusion models.
  • Performance on extremely small objects or highly complex occlusion relationships remains to be verified.
  • Scene diversity of synthetic data may be constrained by the asset library of the game engine.
  • Future work could consider integrating 3D perception to handle perspective changes more accurately, or combining with LLMs to achieve language-guided object editing.
  • vs AnyDoor/Paint-by-Example: These methods are based on the image editing paradigm, pasting objects into new positions, and lack a natural understanding of changes in lighting and effects. ObjectMover fundamentally solves this problem through the video prior.
  • vs OmniEraser: OmniEraser focuses on object removal. ObjectMover incorporates removal as an auxiliary task through multi-task learning while simultaneously acquiring movement capabilities.
  • vs Traditional Image Harmonization: Traditional methods perform post-processing to harmonize lighting after pasting, which functions as a patch-like solution. ObjectMover's end-to-end generation is significantly more elegant.

Rating

  • Novelty: ⭐⭐⭐⭐ The idea of modeling object movement as sequence-to-sequence and leveraging video priors is highly creative.
  • Experimental Thoroughness: ⭐⭐⭐⭐ The experiments cover various scenes and baseline methods, and the qualitative results are impressive.
  • Writing Quality: ⭐⭐⭐⭐ The problem definition is clear and the methodology description is smooth.
  • Value: ⭐⭐⭐⭐ It provides a new paradigm for object editing tasks, and the utilization of video priors has broad inspirational value.