Skip to content

AnyPortal: Zero-Shot Consistent Video Background Replacement

Conference: ICCV 2025 arXiv: 2509.07472 Code: To be released Area: Diffusion Models / Video Editing Keywords: Video background replacement, foreground relighting, zero-shot, diffusion models, temporal consistency

TL;DR

AnyPortal presents a zero-shot, training-free video background replacement framework that synergistically leverages IC-Light's relighting capability and the temporal prior of a video diffusion model (CogVideoX), together with a newly proposed Refinement Projection Algorithm (RPA) for pixel-level foreground preservation, running efficiently on a single 24 GB GPU.

Background & Motivation

Background: Video background replacement ("virtual transportation") in the film industry relies on green screens and complex post-production pipelines, entailing high cost and a steep barrier to entry. Rapid advances in AIGC have brought image-level background replacement (e.g., IC-Light) to a high level of quality, yet video-level replacement remains challenging.

Limitations of Prior Work: - IC-Light supports images only; per-frame processing causes severe inter-frame inconsistency. - Existing video diffusion models (CogVideoX, OpenSora) offer limited controllability, providing only coarse-grained control (edges, pose) and lacking pixel-level precision. - Adapting video models to background replacement requires large amounts of paired video training data, which is extremely scarce.

Key Challenge: IC-Light possesses a strong lighting prior but lacks video temporal modeling; video diffusion models have temporal priors but cannot precisely preserve foreground details. Naively combining the two leads to foreground consistency problems — existing DDIM inversion and latent manipulation solutions perform poorly in the highly compressed 3D latent space of video models.

Goal: Achieve video background replacement with natural foreground relighting, inter-frame temporal consistency, and pixel-level foreground preservation, without requiring any training.

Key Insight: Large pre-trained diffusion models already encode rich prior knowledge; the key lies in how to synergistically exploit them in a zero-shot setting.

Core Idea: A three-stage pipeline (background generation → lighting harmonization → consistency enhancement) combined with the newly proposed RPA algorithm achieves high-quality video background replacement without any training.

Method

Overall Architecture

AnyPortal is a three-stage pipeline: the inputs are a foreground video \(\mathbf{I}\) and a text prompt \(p\) describing the target background (or a background image), and the output is a replacement video \(\mathbf{I}'\) with preserved foreground, harmonized lighting, and temporal consistency. All models are frozen; no training or test-time optimization is performed.

Key Designs

  1. Stage 1: Motion-Aware Background Generation

    • Function: Generate a pure background video \(\mathbf{I}_b\) whose camera motion is consistent with the input video.
    • Mechanism: IC-Light \(\delta_p\) processes the first frame to obtain \(I_1'\); the Diffusion-As-Shader (DAS) framework then uses \(I_1'\) as the first frame and the 3D point motion of the original video as guidance to generate a preliminary video \(\bar{\mathbf{I}}_b\); finally, ProPainter removes foreground objects to yield the clean background \(\mathbf{I}_b\).
    • Design Motivation: The background camera motion must match the input video; however, the foreground generated by DAS may differ from the original, necessitating inpainting to remove it.
  2. Stage 2: Two-Step Light Harmonization

    • Function: Composite the foreground with the new background and achieve natural lighting fusion.
    • Mechanism: An image-guided model \(\delta_I(I_f, I_b)\) first produces a base compositing result, which is then noised and denoised for \(T_0\) steps using a text-guided model \(\delta_p\) in an SDEdit fashion to enhance the lighting. Cross-frame attention is introduced into both IC-Light models so that all frames aggregate the key/value representations of the first frame to maintain stylistic consistency.
    • Design Motivation: Using \(\delta_I\) alone yields insufficient lighting effects (e.g., back-lighting is absent); using \(\delta_p\) alone produces inconsistent backgrounds and lacks image guidance. The two-step combination exploits the strengths of both, and \(T_0\) allows adjustment of lighting intensity.
  3. Stage 3: Consistency Enhancement + Refinement Projection Algorithm (RPA)

    • Function: Enhance inter-frame temporal consistency via a video diffusion model while preserving pixel-level foreground details with RPA.
    • Mechanism:
      • \(\mathbf{I}_L\) is noised for \(T_1\) steps via SDEdit and then denoised by the video model \(\epsilon_\theta\) (with edge ControlNet to maintain coarse structure).
      • At each denoising step, RPA: ① decodes \(x_0^t\) to the pixel domain; ② separates high- and low-frequency components, replacing foreground high frequencies with those of the original video and retaining the denoised low frequencies (lighting), while inpainted results are used in the background region; ③ re-encodes the modified \(\tilde{\mathbf{I}}_0^t\) back to the latent space.
      • Key Innovation: To prevent reconstruction errors and stochasticity from VAE encode–decode cycles from accumulating and blurring the background, RPA computes a deterministic sampling direction \(\hat{\epsilon} = (x_0^t - \mu) / \sigma\), ensuring that \(\hat{x}_0^t\) exactly equals \(x_0^t\) in unmodified regions (zero-error projection).
    • Design Motivation: The 3D latent space of video models is highly compressed; conventional pixel-domain operations followed by re-encoding introduce errors. The zero-error projection property of RPA guarantees that only foreground details are altered while background regions remain entirely unaffected.

Loss & Training

Completely training-free — all models are frozen, and no loss functions or optimization are required. This is one of the paper's core advantages.

Key Experimental Results

Main Results

Comparison with zero-shot baselines on a test set of 30 samples:

Metric IC-Light TokenFlow DAS AnyPortal
Fram-Acc ↑ 0.983 0.541 0.937 0.973
Tem-Con ↑ 0.945 0.981 0.986 0.993
ID-Psrv ↓ 0.578 0.632 0.364 0.313
Mtn-Psrv ↑ 0.844 0.985 0.878 0.987
User-Pmt 1.11% 1.11% 29.72% 68.06%
User-Tem 0.56% 5.56% 28.61% 65.28%

Ablation Study

Configuration Fram-Acc ↑ Tem-Con ↑ ID-Psrv ↓ Mtn-Psrv ↑
Full model 0.973 0.993 0.313 0.987
w/o \(\delta_p\) 0.966 0.989 0.329 0.987
w/o Cst-Enh 0.970 0.961 0.353 0.973
w/o RPA 0.970 0.987 0.371 0.984

Key Findings

  • The consistency enhancement stage contributes most: removing it drops Tem-Con from 0.993 to 0.961, demonstrating the critical role of the video diffusion model's temporal prior.
  • RPA is essential for foreground preservation: removing it raises ID-Psrv from 0.313 to 0.371, and visually causes background blurring.
  • AnyPortal achieves a dominant preference rate of 60%+ across all user study metrics.
  • On a single RTX 4090 GPU, inference takes approximately 12 minutes per video (49 frames, 480×720).

Highlights & Insights

  • The zero-error projection in RPA is the most elegant design: by computing \(\hat{\epsilon} = (x_0^t - \mu)/\sigma\) as a deterministic sampling direction, it avoids the accumulation of VAE encode–decode errors. This idea is transferable to any scenario requiring pixel-level operations in a 3D latent space.
  • The modular design allows any component to be replaced with a newer pre-trained model at any time, making the framework naturally compatible with future advances in AIGC.
  • The two-step IC-Light harmonization cleverly leverages \(\delta_I\) for spatial consistency and \(\delta_p\) for enhanced lighting effects, with cross-frame attention providing additional stylistic consistency.

Limitations & Future Work

  • Low-quality or low-resolution inputs lead to poor high-frequency detail transfer (blurring in hair regions).
  • Unclear foreground–background boundaries cause blurring artifacts around the subject.
  • Video diffusion models still produce artifacts in fast-motion scenes.
  • Inference time of approximately 12 minutes per video is far from real-time.
  • Fixed resolution and length (480×720, 49 frames) due to CogVideoX constraints.
  • vs. IC-Light (per-frame): Severe inter-frame inconsistency and foreground color alteration; AnyPortal substantially improves both via the video model and RPA.
  • vs. TokenFlow: Limited editing capability and insufficient foreground control; AnyPortal outperforms it comprehensively across all metrics.
  • vs. DAS: Unable to preserve foreground motion dynamics; AnyPortal's RPA provides pixel-level guarantees.
  • vs. RelightVid: Requires fine-tuning AnimateDiff; AnyPortal is entirely training-free and benefits from the stronger CogVideoX backbone.

Rating

  • Novelty: ⭐⭐⭐⭐ The zero-error projection in RPA is a novel contribution, though the overall framework is a combination of existing modules.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Quantitative evaluation, user study, and complete ablation are provided, but the test set of only 30 samples is relatively small.
  • Writing Quality: ⭐⭐⭐⭐⭐ Logic is clear, figures are intuitive, and the three-stage structure is presented in a well-motivated, progressive manner.
  • Value: ⭐⭐⭐⭐ The first training-free video background replacement framework; highly practical, with a forward-looking modular design.