Visual Prompting for One-Shot Controllable Video Editing Without Inversion¶

Conference: CVPR 2025
arXiv: 2504.14335
Code: https://vp4video-editing.github.io/
Area: Image/Video Generation / Video Editing
Keywords: Video Editing, Visual Prompting, DDIM Inversion, Consistency Models, Stein Variational Gradient Descent

TL;DR¶

This work tackles the One-Shot Controllable Video Editing (OCVE) problem from a novel perspective of Visual Prompting. By leveraging an image inpainting diffusion model to perform editing propagation, and introducing Content Consistency Sampling (CCS) and Temporal-Content Consistency Sampling (TCS), the method achieves high-quality controllable video editing without DDIM inversion.

Background & Motivation¶

Background: One-Shot Controllable Video Editing (OCVE) is an efficient video editing paradigm where users only need to edit the first frame of a video using any image editing tool, and the system automatically propagates the edits to subsequent frames. Existing OCVE methods (such as AnyV2V, Videoshop) generally rely on DDIM inversion to map the source video into noise latent variables, followed by injecting editing guidance to generate the edited video.

Limitations of Prior Work: (1) DDIM inversion introduces approximation errors at each step, which accumulate and degrade reconstruction quality, thereby weakening the content consistency of the edited video. Specifically, approximating \(\epsilon_\theta(z_t, t)\) with \(\epsilon_\theta(z_{t-1}, t)\) during inversion is inaccurate. (2) To maintain temporal consistency, some approaches introduce video diffusion models for temporal priors; however, open-source video datasets are of limited quality, and video diffusion models are computationally extremely expensive.

Key Challenge: DDIM inversion is both the foundation and the bottleneck of existing OCVE methods—it is the key means to reconstruct source video information, yet its cumulative error is the root cause of content inconsistency.

Goal: To completely bypass DDIM inversion, establishing a video editing framework that works without inverting source frames into noise while ensuring both content consistency and temporal consistency.

Key Insight: The authors found that OCVE and Visual Prompting essentially perform the same task—propagating a specific modification across images. In visual prompting, exemplar image pairs demonstrate the transformation rules, which are then applied to a query image; in OCVE, the edits on the first frame must be propagated to subsequent frames.

Core Idea: Redefine OCVE as a visual prompting task, leveraging the visual reasoning capabilities of a pre-trained image inpainting diffusion model to perform edit propagation, using multi-step consistency sampling from consistency models to ensure content consistency, and utilizing Stein Variational Gradient Descent (SVGD) to guarantee temporal consistency.

Method¶

Overall Architecture¶

The overall pipeline consists of three steps: (1) Constructing a 2x2 grid containing the first source frame and the edited first frame as an exemplar pair, the current source frame as a query, and a blank region as the target edited frame to be generated, which is fed into an image inpainting diffusion model; (2) Generating edited frames that match the source content consistency via CCS; (3) Adjusting the temporal consistency of all edited frames via TCS. The input is the source video + the first edited frame, and the output is the fully edited video.

Key Designs¶

Combining Visual Prompting with Inpainting Diffusion Models:
- Function: Convert the OCVE task into a visual prompting format compatible with inpainting diffusion models, bypassing DDIM inversion.
- Mechanism: Organize the input information \(G(i)\) into a 2x2 grid—the top-left contains the first source frame, the top-right contains the edited first frame (as visual prompting exemplars), the bottom-left contains the \(i\)-th source frame (query), and the bottom-right is left blank for the model to generate (edit output). A mask \(M\) designates the bottom-right region as the generation target. The text prompt \(p\) is replaced by the CLIP-encoded edit direction difference vector: \(p = \lambda_1 \cdot \{E_{CLIP}(I^e(1)) - E_{CLIP}(I^s(1))\}\). Crucially, source frames are directly encoded into feature inputs via the encoder (rather than being inverted to noise), preventing inversion errors from the source.
- Design Motivation: Inpainting diffusion models naturally excel at filling missing regions while maintaining contextual consistency, which highly aligns with the OCVE requirement of "inferring edits from exemplars." Expressing the edit direction with a CLIP embedding difference captures editing intentions more precisely than text descriptions.
Content Consistency Sampling (CCS):
- Function: Multi-step consistency sampling based on consistency models to ensure that the generated edited frames maintain content consistency with the source frames.
- Mechanism: Firstly, modify the sampling equations of the inpainting diffusion model—by setting \(\sigma_t = \sqrt{1-\alpha_{t-1}}\) to eliminate the adjustment term, turning it into a non-Markovian process. Introduce consistency noise \(\epsilon^c\) to replace parameterized noise, constructing a consistency model \(\hat{f}(z_t, t, \epsilon^c(t)) = (z_t - \sqrt{1-\alpha_t}\epsilon^c(t))/\sqrt{\alpha_t}\). The key operations in CCS are: (a) enforcing source frame generation at the first timestep, achieved by calculating the corresponding consistency noise \(\epsilon^c(t; z_0^s) = (\hat{z}_t - \sqrt{\alpha_t}z_0^s)/\sqrt{1-\alpha_t}\); (b) utilizing a noise calibration mechanism to guide the generation to transition progressively from the source frame to the target edited frame, using the denoising difference \(\Delta\epsilon_t = \epsilon_\theta(z_t(I^e), t) - \epsilon_\theta(z_t(I^s), t)\) as the edit direction signal.
- Design Motivation: The Markovian nature of standard diffusion sampling makes images generated at different timesteps independent, failing to guarantee content consistency. A core property of consistency models is that "all points on the same trajectory map to the same initial state," which is leveraged here to guarantee consistency between the generated edited frames and the source frames.
Temporal-Content Consistency Sampling (TCS):
- Function: Treat all edited frames as empirical samples and optimize them via SVGD to approximate the temporal distribution of the source frames, ensuring temporal consistency across frames.
- Mechanism: The source frames are treated as \(N\) samples \(\{z(i)\}_{i=1}^N\) from a distribution, and the edited frames \(\{\hat{z}^{(0)}(i)\}_{i=1}^N\) generated by CCS need to be adjusted to approximate this distribution. Deterministic updates are performed using SVGD: \(\hat{z}_{\ell-1}^{(0)}(i) = \hat{z}_\ell^{(0)}(i) - \eta \cdot \hat{\phi}(\hat{z}_\ell^{(0)}(i))\), where \(\hat{\phi}\) contains two terms: a drive term based on the mean gradient of all samples (assuring optimization stability) and a repulsive term based on an RBF kernel (preventing mode collapse).
- Design Motivation: CCS processes each frame independently and does not explicitly model temporal relations between frames. TCS uses deterministic SVGD updates, which is faster than video diffusion models and does not rely on video training data. Treating video frames as empirical distribution samples and optimizing them using SVGD is highly ingenious.

Loss & Training¶

Both CCS and TCS are inference-time sampling strategies requiring no additional training. CCS uses 30 timesteps, and TCS uses 50 timesteps. Hyperparameters are set as \(\lambda_1 = 0.7\), \(\lambda_2 = 1.2\), and \(\eta = 2.0\). Stable Diffusion Inpainting 1.5 is adopted as the base model.

Key Experimental Results¶

Main Results¶

Quantitative comparison on the MagicBrush-derived dataset (10,388 editing tuples):

Method	CLIPtar↑	TIFA↑	CLIPsrc↑	Flow↓	FVD↓	CLIPTC↑	Time (s)↓
AnyV2V	87.1	67.0	91.3	24.6	17.1	93.9	149
Videoshop	88.8	64.4	91.0	19.0	14.8	95.2	32
Ours	90.1	69.1	93.2	21.9	15.2	97.1	19

Ablation Study¶

Configuration	CLIPtar↑	CLIPsrc↑	CLIPTC↑	Time (s)↓
w/o CCS	80.3	81.3	95.8	19
w/o TCS	89.8	92.8	89.8	18
Full model	90.1	93.2	97.1	19

Key Findings¶

Without CCS, source frame fidelity (CLIPsrc) drops drastically from 93.2 to 81.3, proving that CCS is crucial for maintaining content consistency.
Without TCS, temporal consistency (CLIPTC) drops from 97.1 to 89.8 (a decrease of 7.3 percentage points), demonstrating that TCS effectively ensures temporal smoothness.
Ours requires only 19 seconds to process a video clip, which is approximately \(8\times\) faster than AnyV2V (149s) and \(1.7\times\) faster than Videoshop (32s).
Substituting video diffusion models with a lightweight image diffusion model is the key source of the speed advantage.

Highlights & Insights¶

Ingenious Perspective Shift: Redefining OCVE as a visual prompting task yields the most central innovation of this work. This perspective shift directly eliminates the need for DDIM inversion, fundamentally resolving the cumulative error issue.
Training-free Sampling Strategies: Both CCS and TCS are modifications to the sampling methods during inference, requiring no extra training. This means they can be plug-and-played into any compatible diffusion model.
SVGD for Temporal Consistency: The concept of treating video frames as empirical distribution samples and using SVGD to match distributions is highly novel. Compared to video diffusion models, the deterministic updates of SVGD are both efficient and independent of video training data.
CLIP Embedding Difference as Edit Direction: It captures editing intentions more accurately than text descriptions and is naturally compatible with diffusion models based on CLIP text encoders.

Limitations & Future Work¶

Based on SD Inpainting 1.5, the resolution is limited, and it hasn't been adapted to more recent diffusion model architectures.
The 2x2 grid layout compresses the effective resolution to 1/4 of its original scale, potentially affecting detail fidelity.
The SVGD updates in TCS require processing all frames concurrently, which poses higher memory and computational demands for long videos.
The scope of edits is limited by the visual reasoning capabilities of the inpainting diffusion model; it may struggle with large-scale structural edits.
The potential combination with advanced video generation models (e.g., Sora) remains unexplored.

vs AnyV2V: AnyV2V utilizes DDIM inversion + a video diffusion model, being computationally expensive (149s) and affected by inversion errors. Ours bypasses inversion and employs an image diffusion model, achieving \(8\times\) faster speeds and better quality.
vs Videoshop: Videoshop also relies on DDIM inversion but optimizes features injection; ours fundamentally eliminates the need for inversion, outperforming it in both edit and source frame fidelity.
vs Consistency Models: This work creatively adapts the multi-step consistency sampling properties of consistency models to an inpainting diffusion model. This eliminates the need to train a consistency model, requiring only modifications to the sampling equations.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The perspective of utilizing visual prompting to solve OCVE is highly novel, and both CCS and TCS designs are supported by solid theoretical foundations.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on large-scale datasets, with solid ablation designs, but lacking a detailed report on human evaluations.
Writing Quality: ⭐⭐⭐⭐ The motivation is clearly discussed, and the methods are fully derived, though the mathematical notation is relatively dense.
Value: ⭐⭐⭐⭐⭐ A new paradigm that bypasses DDIM inversion; the dual improvement in speed and quality makes it highly practical.