ReCapture: Generative Video Camera Controls for User-Provided Videos Using Masked Video Fine-Tuning¶

Conference: CVPR 2025
arXiv: 2411.05003
Code: https://generative-video-camera-controls.github.io
Area: 3D Vision / Video Generation
Keywords: Video Camera Control, Novel View Synthesis, Diffusion Model Fine-tuning, LoRA, Point Cloud Rendering

TL;DR¶

ReCapture enables camera trajectory control for user-provided videos through a two-stage approach. It first generates a rough anchor video with the new camera trajectory using depth point cloud rendering or a multi-view image diffusion model, and then repairs and completes it using masked video fine-tuning (spatiotemporal LoRA). This approach maintains the original scene motion while enabling the video to be viewed from completely new perspectives.

Background & Motivation¶

Background: Video diffusion models can already control camera trajectories during the generation process (e.g., CameraCtrl, MotionCtrl). However, these methods are restricted to videos generated by the models themselves and cannot be applied to existing real-world videos provided by users.

Limitations of Prior Work: (1) 4D reconstruction methods (NeRF, 4D Gaussian Splatting) require multi-view inputs or distinct camera motion cues and cannot extrapolate beyond the original field of view; (2) Generative Camera Dolly requires paired 4D video training data (obtained via simulators), limiting its generalization ability to the training domain; (3) Monocular videos are inherently under-constrained—a single-view video cannot provide definitive information about what the scene looks like from other angles.

Key Challenge: Changing the camera trajectory of a user's video requires synthesizing content for unobserved viewpoints out of thin air. This is a severely under-constrained problem, yet users expect plausible and temporally consistent outputs.

Goal: Given a user-provided video and new camera trajectory parameters (translation, rotation, scaling, etc.), generate a novel-view video that preserves the original scene motion.

Key Insight: The problem is decomposed into two steps: first, a geometric method is used to obtain an "imperfect but directionally correct" anchor video; second, a video diffusion model prior is leveraged to repair and complete it.

Core Idea: Masked video fine-tuning: training a temporal LoRA on the known pixels of the anchor video using a masked loss to learn motion patterns, and a spatial LoRA to learn the scene appearance. During inference, the diffusion model automatically completes the missing regions.

Method¶

Overall Architecture¶

A two-stage pipeline. Stage 1 (Anchor Video Generation): Independently estimate depth for each frame \(\rightarrow\) render point clouds to the new camera pose, or use a multi-view diffusion model (CAT3D) to generate novel views frame-by-frame. This yields a rough anchor video with artifacts and missing regions, along with its valid pixel mask. Stage 2 (Masked Video Fine-Tuning): Train two LoRAs on SVD (Stable Video Diffusion)—a temporal LoRA to learn the motion patterns of the anchor video, and a spatial LoRA to learn the appearance of the source video. During inference, the diffusion model generates a clean, temporally consistent final output with missing regions plausibly completed.

Key Designs¶

Point Cloud Sequence Rendering:
- Function: To generate anchor videos for simple camera motions (translation, tilt, zoom).
- Mechanism: For each frame \(\mathbf{I}_i\), estimate depth \(\mathbf{D}_i\) using a monocular depth estimator (ZoeDepth), combine them into RGBD, and lift it into a 3D point cloud \(\mathcal{P}_i = \phi([\mathbf{I}_i, \mathbf{D}_i], \mathbf{K})\). Based on the user-specified camera extrinsic matrices \(\{\mathbf{P}_1, ..., \mathbf{P}_{N-1}\}\), project the point cloud to the new views: \(\mathbf{I}_i^a = \psi(\mathcal{P}_i, \mathbf{K}, \mathbf{P}_i)\). Concurrently, generate a binary valid pixel mask \(\mathbf{M}^a\), which marks the empty regions exposed due to the viewpoint change.
- Design Motivation: Point cloud rendering is geometrically accurate under small-angle motions, making it a reliable anchor signal. However, processing frame-by-frame independently leads to temporal inconsistency and black empty regions, which require repairing in the subsequent stage.
Multiview Image Diffusion:
- Function: To generate anchor videos for large-angle rotations (e.g., orbit shots).
- Mechanism: Employ the CAT3D multi-view image diffusion model to generate novel-view images \(p(\mathbf{I}_i^a | \mathbf{I}_i, \mathbf{P}_{cond}, \mathbf{P}_i)\) frame-by-frame independently, conditioned on the source frame and specified camera poses. Utilize a raymap for relative pose representation to avoid the difficulties of absolute pose estimation.
- Design Motivation: Point cloud rendering fails at large angles (exhibiting too many occluded regions and severe geometric distortions). Multi-view diffusion models can fill in these regions plausibly, but frame-by-frame independent generation lacks temporal consistency. These two methods are complementary.
Masked Video Fine-Tuning:
- Function: To repair artifacts in the anchor video, complete missing regions, and enhance temporal consistency.
- Mechanism: Train two distinct LoRAs on SVD:
  - Temporal LoRA: Injected into temporal transformer layers, trained on the anchor video using a masked diffusion loss: \(\mathcal{L}_{temp} = \mathbb{E}_{\epsilon,t}[\mathbf{M}^a \cdot \|\epsilon - \epsilon_\theta(\mathbf{V}_t^a, t, y)\|]\). The mask excludes invalid regions, forcing the model to learn motion patterns only from meaningful pixels.
  - Spatial LoRA: Injected into spatial self-attention layers, trained on random frames of the source video (with temporal layers disabled): \(\mathcal{L}_{spatial} = \mathbb{E}_{\epsilon,t,i}[\|\epsilon - \epsilon_\theta(\mathbf{I}_{i,t}, t, y)\|]\). This allows the model to learn the appearance and context of the source video.
  - During inference, both LoRAs are activated simultaneously, and the diffusion model automatically uses its prior to complete the invalid regions.
- Design Motivation: (a) The low-rank nature of LoRA prevents overfitting to the artifacts of the anchor video; (b) Keeping spatial layers frozen keeps the temporal LoRA focused solely on motion while the spatial LoRA handles appearance, achieving a decoupling of responsibilities; (c) SDEdit post-processing is applied to further eliminate blurriness.

Loss & Training¶

The final loss is \(\mathcal{L} = \mathcal{L}_{temp} + \mathcal{L}_{spatial}\).
Features during temporal LoRA training also pass through the spatial LoRA (without updating its parameters) to ensure compatibility.
LoRA rank = 16, learning rate = \(5e^{-4}\), total fine-tuning of 400 steps, taking only 5 minutes on a single A100 GPU.
Post-processing with SDEdit using only the spatial LoRA is performed after inference to remove blurriness.
Video diffusion model: SVD (I2V model), using the first frame of the anchor video as the image prompt.

Key Experimental Results¶

Main Results¶

Method	PSNR(all)↑	SSIM(all)↑	LPIPS(all)↓	PSNR(occ)↑
ReCapture (Ours)	20.92	0.596	0.402	18.92
Gen. Camera Dolly	20.30	0.587	0.408	18.60
ZeroNVS	15.68	0.396	0.508	14.18
4D-GS	14.92	0.388	0.584	14.55
HexPlane	15.38	0.428	0.568	14.71

VBench Metric	ReCapture	Gen. Camera Dolly
Subject Consistency	88.53%	83.02%
Background Consistency	92.02%	80.42%
Temporal Flickering	91.12%	74.64%
Motion Smoothness	98.24%	82.33%
Aesthetic Quality	57.35%	38.67%
Imaging Quality	64.75%	58.62%

Ablation Study¶

Component	Subject Cons.	BG Cons.	Flicker	Aesthetic
Anchor Video (Stage 1 only)	82.41%	77.45%	64.50%	34.94%
+ Temporal LoRA w/ Masks	85.24%	90.88%	89.60%	40.41%
+ Spatial LoRA	86.02%	91.24%	90.02%	49.18%
+ SDEdit (Full Method)	88.53%	92.02%	91.12%	57.35%

Key Findings¶

The masked Temporal LoRA is the most critical component: temporal flickering is reduced (Flicker score 64.5% \(\rightarrow\) 89.6%) and background consistency increases from 77.45% \(\rightarrow\) 90.88%, showing that the masked loss successfully ignores artifacts while learning motion patterns.
The Spatial LoRA mainly improves aesthetic quality (40.41% \(\rightarrow\) 49.18%) because it learns the correct appearance context from the source video to fill in empty regions.
SDEdit post-processing brings overall improvements, especially in aesthetic quality (49.18% \(\rightarrow\) 57.35%).
Our method outperforms Gen. Camera Dolly (which requires 4D training data) on the Kubric quantitative evaluation, proving that high performance can be achieved without paired 4D data.
High-level VBench metrics reflect the actual difference in visual quality better than low-level PSNR (Dolly has a comparable PSNR but is visually much blurrier).

Highlights & Insights¶

Masked fine-tuning is an elegant design: It does not require clean training data. The mask naturally informs the model of which areas are trustworthy and which require prior-based infilling. This acts fundamentally as a form of curriculum learning—learning known motions first, then inferring unobserved regions.
The decoupled spatial-temporal LoRA design is clean and effective: letting the spatial LoRA focus solely on appearance (trained on static frames) and the temporal LoRA focus solely on motion (trained on videos) prevents mutual interference.
The complementary nature of the two Stage 1 methods is highly practical: point cloud rendering is used for efficient simple motions, while multi-view diffusion is employed for large-angle rotations, demonstrating strong utility.
The entire method requires no paired training data (in contrast to Camera Dolly, which requires simulators to generate 4D data).

Limitations & Future Work¶

The method depends on the quality of monocular depth estimation; depth errors in complex scenes propagate to the anchor video.
Point cloud rendering is ineffective for large rotations, whereas multi-view diffusion models are computationally expensive and introduce temporal inconsistency that must be repaired in the next stage.
Each video requires individual LoRA training (400 steps / 5 minutes), which prevents real-time interaction.
For scenes with fast-moving objects or dramatic occlusion changes, the incompleted regions may appear unnatural.
Future directions: Leveraging stronger video diffusion models as backbones and exploring training-free camera control schemes.

vs Generative Camera Dolly: Requires paired 4D training data (generated by a simulator), showing good in-domain performance but poor generalization. ReCapture requires no paired data and generalizes using diffusion model priors.
vs 4D-GS / HexPlane: 4D reconstruction methods require multi-view cues and cannot extrapolate the field of view; ReCapture can plausibly 'hallucinate' unobserved regions.
vs CameraCtrl / MotionCtrl: These insert camera control during the generation process, which is only applicable to videos generated by the model itself. ReCapture processes existing, user-provided videos.
vs Still-Moving / Dreamix: Video personalization methods target different goals (subject/style-driven generation), but share common ground in LoRA and video fine-tuning techniques.

Rating¶

Novelty: ⭐⭐⭐⭐ The masked video fine-tuning concept is novel and generic, and the design of the two-stage divide-and-concur strategy is clever.
Experimental Thoroughness: ⭐⭐⭐⭐ Extensive evaluation dimension covering Kubric quantitative evaluation, high-level VBench metrics, ablation studies, and qualitative comparisons.
Writing Quality: ⭐⭐⭐⭐ The methodology is clearly described, and the diagrams are highly informative.
Value: ⭐⭐⭐⭐ Direct application value to video editing and content creation, with masked fine-tuning being transferrable to other video-to-video tasks.