AnyPortal: Zero-Shot Consistent Video Background Replacement¶

Conference: ICCV 2025 arXiv: 2509.07472 Code: To be released Area: Diffusion Models / Video Editing Keywords: Video background replacement, foreground relighting, zero-shot, diffusion models, temporal consistency

TL;DR¶

AnyPortal presents a zero-shot, training-free video background replacement framework that synergistically leverages IC-Light's relighting capability and the temporal prior of a video diffusion model (CogVideoX), together with a newly proposed Refinement Projection Algorithm (RPA) for pixel-level foreground preservation, running efficiently on a single 24 GB GPU.

Background & Motivation¶

Background: Video background replacement ("virtual transportation") in the film industry relies on green screens and complex post-production pipelines, entailing high cost and a steep barrier to entry. Rapid advances in AIGC have brought image-level background replacement (e.g., IC-Light) to a high level of quality, yet video-level replacement remains challenging.

Limitations of Prior Work: - IC-Light supports images only; per-frame processing causes severe inter-frame inconsistency. - Existing video diffusion models (CogVideoX, OpenSora) offer limited controllability, providing only coarse-grained control (edges, pose) and lacking pixel-level precision. - Adapting video models to background replacement requires large amounts of paired video training data, which is extremely scarce.

Key Challenge: IC-Light possesses a strong lighting prior but lacks video temporal modeling; video diffusion models have temporal priors but cannot precisely preserve foreground details. Naively combining the two leads to foreground consistency problems — existing DDIM inversion and latent manipulation solutions perform poorly in the highly compressed 3D latent space of video models.

Goal: Achieve video background replacement with natural foreground relighting, inter-frame temporal consistency, and pixel-level foreground preservation, without requiring any training.

Key Insight: Large pre-trained diffusion models already encode rich prior knowledge; the key lies in how to synergistically exploit them in a zero-shot setting.

Core Idea: A three-stage pipeline (background generation → lighting harmonization → consistency enhancement) combined with the newly proposed RPA algorithm achieves high-quality video background replacement without any training.

Method¶

Overall Architecture¶

AnyPortal is a three-stage pipeline: the inputs are a foreground video \(\mathbf{I}\) and a text prompt \(p\) describing the target background (or a background image), and the output is a replacement video \(\mathbf{I}'\) with preserved foreground, harmonized lighting, and temporal consistency. All models are frozen; no training or test-time optimization is performed.

Key Designs¶

Stage 1: Motion-Aware Background Generation
- Function: Generate a pure background video \(\mathbf{I}_b\) whose camera motion is consistent with the input video.
- Mechanism: IC-Light \(\delta_p\) processes the first frame to obtain \(I_1'\); the Diffusion-As-Shader (DAS) framework then uses \(I_1'\) as the first frame and the 3D point motion of the original video as guidance to generate a preliminary video \(\bar{\mathbf{I}}_b\); finally, ProPainter removes foreground objects to yield the clean background \(\mathbf{I}_b\).
- Design Motivation: The background camera motion must match the input video; however, the foreground generated by DAS may differ from the original, necessitating inpainting to remove it.
Stage 2: Two-Step Light Harmonization
- Function: Composite the foreground with the new background and achieve natural lighting fusion.
- Mechanism: An image-guided model \(\delta_I(I_f, I_b)\) first produces a base compositing result, which is then noised and denoised for \(T_0\) steps using a text-guided model \(\delta_p\) in an SDEdit fashion to enhance the lighting. Cross-frame attention is introduced into both IC-Light models so that all frames aggregate the key/value representations of the first frame to maintain stylistic consistency.
- Design Motivation: Using \(\delta_I\) alone yields insufficient lighting effects (e.g., back-lighting is absent); using \(\delta_p\) alone produces inconsistent backgrounds and lacks image guidance. The two-step combination exploits the strengths of both, and \(T_0\) allows adjustment of lighting intensity.
Stage 3: Consistency Enhancement + Refinement Projection Algorithm (RPA)
- Function: Enhance inter-frame temporal consistency via a video diffusion model while preserving pixel-level foreground details with RPA.
- Mechanism:
  - \(\mathbf{I}_L\) is noised for \(T_1\) steps via SDEdit and then denoised by the video model \(\epsilon_\theta\) (with edge ControlNet to maintain coarse structure).
  - At each denoising step, RPA: ① decodes \(x_0^t\) to the pixel domain; ② separates high- and low-frequency components, replacing foreground high frequencies with those of the original video and retaining the denoised low frequencies (lighting), while inpainted results are used in the background region; ③ re-encodes the modified \(\tilde{\mathbf{I}}_0^t\) back to the latent space.
  - Key Innovation: To prevent reconstruction errors and stochasticity from VAE encode–decode cycles from accumulating and blurring the background, RPA computes a deterministic sampling direction \(\hat{\epsilon} = (x_0^t - \mu) / \sigma\), ensuring that \(\hat{x}_0^t\) exactly equals \(x_0^t\) in unmodified regions (zero-error projection).
- Design Motivation: The 3D latent space of video models is highly compressed; conventional pixel-domain operations followed by re-encoding introduce errors. The zero-error projection property of RPA guarantees that only foreground details are altered while background regions remain entirely unaffected.

Loss & Training¶

Completely training-free — all models are frozen, and no loss functions or optimization are required. This is one of the paper's core advantages.

Key Experimental Results¶

Main Results¶

Comparison with zero-shot baselines on a test set of 30 samples:

Metric	IC-Light	TokenFlow	DAS	AnyPortal
Fram-Acc ↑	0.983	0.541	0.937	0.973
Tem-Con ↑	0.945	0.981	0.986	0.993
ID-Psrv ↓	0.578	0.632	0.364	0.313
Mtn-Psrv ↑	0.844	0.985	0.878	0.987
User-Pmt	1.11%	1.11%	29.72%	68.06%
User-Tem	0.56%	5.56%	28.61%	65.28%

Ablation Study¶

Configuration	Fram-Acc ↑	Tem-Con ↑	ID-Psrv ↓	Mtn-Psrv ↑
Full model	0.973	0.993	0.313	0.987
w/o \(\delta_p\)	0.966	0.989	0.329	0.987
w/o Cst-Enh	0.970	0.961	0.353	0.973
w/o RPA	0.970	0.987	0.371	0.984

Key Findings¶

The consistency enhancement stage contributes most: removing it drops Tem-Con from 0.993 to 0.961, demonstrating the critical role of the video diffusion model's temporal prior.
RPA is essential for foreground preservation: removing it raises ID-Psrv from 0.313 to 0.371, and visually causes background blurring.
AnyPortal achieves a dominant preference rate of 60%+ across all user study metrics.
On a single RTX 4090 GPU, inference takes approximately 12 minutes per video (49 frames, 480×720).

Highlights & Insights¶

The zero-error projection in RPA is the most elegant design: by computing \(\hat{\epsilon} = (x_0^t - \mu)/\sigma\) as a deterministic sampling direction, it avoids the accumulation of VAE encode–decode errors. This idea is transferable to any scenario requiring pixel-level operations in a 3D latent space.
The modular design allows any component to be replaced with a newer pre-trained model at any time, making the framework naturally compatible with future advances in AIGC.
The two-step IC-Light harmonization cleverly leverages \(\delta_I\) for spatial consistency and \(\delta_p\) for enhanced lighting effects, with cross-frame attention providing additional stylistic consistency.

Limitations & Future Work¶

Low-quality or low-resolution inputs lead to poor high-frequency detail transfer (blurring in hair regions).
Unclear foreground–background boundaries cause blurring artifacts around the subject.
Video diffusion models still produce artifacts in fast-motion scenes.
Inference time of approximately 12 minutes per video is far from real-time.
Fixed resolution and length (480×720, 49 frames) due to CogVideoX constraints.

vs. IC-Light (per-frame): Severe inter-frame inconsistency and foreground color alteration; AnyPortal substantially improves both via the video model and RPA.
vs. TokenFlow: Limited editing capability and insufficient foreground control; AnyPortal outperforms it comprehensively across all metrics.
vs. DAS: Unable to preserve foreground motion dynamics; AnyPortal's RPA provides pixel-level guarantees.
vs. RelightVid: Requires fine-tuning AnimateDiff; AnyPortal is entirely training-free and benefits from the stronger CogVideoX backbone.

Rating¶

Novelty: ⭐⭐⭐⭐ The zero-error projection in RPA is a novel contribution, though the overall framework is a combination of existing modules.
Experimental Thoroughness: ⭐⭐⭐⭐ Quantitative evaluation, user study, and complete ablation are provided, but the test set of only 30 samples is relatively small.
Writing Quality: ⭐⭐⭐⭐⭐ Logic is clear, figures are intuitive, and the three-stage structure is presented in a well-motivated, progressive manner.
Value: ⭐⭐⭐⭐ The first training-free video background replacement framework; highly practical, with a forward-looking modular design.