AnyPortal: Zero-Shot Consistent Video Background Replacement¶
Conference: ICCV 2025 arXiv: 2509.07472 Code: To be released Area: Diffusion Models / Video Editing Keywords: Video background replacement, foreground relighting, zero-shot, diffusion models, temporal consistency
TL;DR¶
AnyPortal presents a zero-shot, training-free video background replacement framework that synergistically leverages IC-Light's relighting capability and the temporal prior of a video diffusion model (CogVideoX), together with a newly proposed Refinement Projection Algorithm (RPA) for pixel-level foreground preservation, running efficiently on a single 24 GB GPU.
Background & Motivation¶
Background: Video background replacement ("virtual transportation") in the film industry relies on green screens and complex post-production pipelines, entailing high cost and a steep barrier to entry. Rapid advances in AIGC have brought image-level background replacement (e.g., IC-Light) to a high level of quality, yet video-level replacement remains challenging.
Limitations of Prior Work: - IC-Light supports images only; per-frame processing causes severe inter-frame inconsistency. - Existing video diffusion models (CogVideoX, OpenSora) offer limited controllability, providing only coarse-grained control (edges, pose) and lacking pixel-level precision. - Adapting video models to background replacement requires large amounts of paired video training data, which is extremely scarce.
Key Challenge: IC-Light possesses a strong lighting prior but lacks video temporal modeling; video diffusion models have temporal priors but cannot precisely preserve foreground details. Naively combining the two leads to foreground consistency problems — existing DDIM inversion and latent manipulation solutions perform poorly in the highly compressed 3D latent space of video models.
Goal: Achieve video background replacement with natural foreground relighting, inter-frame temporal consistency, and pixel-level foreground preservation, without requiring any training.
Key Insight: Large pre-trained diffusion models already encode rich prior knowledge; the key lies in how to synergistically exploit them in a zero-shot setting.
Core Idea: A three-stage pipeline (background generation → lighting harmonization → consistency enhancement) combined with the newly proposed RPA algorithm achieves high-quality video background replacement without any training.
Method¶
Overall Architecture¶
AnyPortal is a three-stage pipeline: the inputs are a foreground video \(\mathbf{I}\) and a text prompt \(p\) describing the target background (or a background image), and the output is a replacement video \(\mathbf{I}'\) with preserved foreground, harmonized lighting, and temporal consistency. All models are frozen; no training or test-time optimization is performed.
Key Designs¶
-
Stage 1: Motion-Aware Background Generation
- Function: Generate a pure background video \(\mathbf{I}_b\) whose camera motion is consistent with the input video.
- Mechanism: IC-Light \(\delta_p\) processes the first frame to obtain \(I_1'\); the Diffusion-As-Shader (DAS) framework then uses \(I_1'\) as the first frame and the 3D point motion of the original video as guidance to generate a preliminary video \(\bar{\mathbf{I}}_b\); finally, ProPainter removes foreground objects to yield the clean background \(\mathbf{I}_b\).
- Design Motivation: The background camera motion must match the input video; however, the foreground generated by DAS may differ from the original, necessitating inpainting to remove it.
-
Stage 2: Two-Step Light Harmonization
- Function: Composite the foreground with the new background and achieve natural lighting fusion.
- Mechanism: An image-guided model \(\delta_I(I_f, I_b)\) first produces a base compositing result, which is then noised and denoised for \(T_0\) steps using a text-guided model \(\delta_p\) in an SDEdit fashion to enhance the lighting. Cross-frame attention is introduced into both IC-Light models so that all frames aggregate the key/value representations of the first frame to maintain stylistic consistency.
- Design Motivation: Using \(\delta_I\) alone yields insufficient lighting effects (e.g., back-lighting is absent); using \(\delta_p\) alone produces inconsistent backgrounds and lacks image guidance. The two-step combination exploits the strengths of both, and \(T_0\) allows adjustment of lighting intensity.
-
Stage 3: Consistency Enhancement + Refinement Projection Algorithm (RPA)
- Function: Enhance inter-frame temporal consistency via a video diffusion model while preserving pixel-level foreground details with RPA.
- Mechanism:
- \(\mathbf{I}_L\) is noised for \(T_1\) steps via SDEdit and then denoised by the video model \(\epsilon_\theta\) (with edge ControlNet to maintain coarse structure).
- At each denoising step, RPA: ① decodes \(x_0^t\) to the pixel domain; ② separates high- and low-frequency components, replacing foreground high frequencies with those of the original video and retaining the denoised low frequencies (lighting), while inpainted results are used in the background region; ③ re-encodes the modified \(\tilde{\mathbf{I}}_0^t\) back to the latent space.
- Key Innovation: To prevent reconstruction errors and stochasticity from VAE encode–decode cycles from accumulating and blurring the background, RPA computes a deterministic sampling direction \(\hat{\epsilon} = (x_0^t - \mu) / \sigma\), ensuring that \(\hat{x}_0^t\) exactly equals \(x_0^t\) in unmodified regions (zero-error projection).
- Design Motivation: The 3D latent space of video models is highly compressed; conventional pixel-domain operations followed by re-encoding introduce errors. The zero-error projection property of RPA guarantees that only foreground details are altered while background regions remain entirely unaffected.
Loss & Training¶
Completely training-free — all models are frozen, and no loss functions or optimization are required. This is one of the paper's core advantages.
Key Experimental Results¶
Main Results¶
Comparison with zero-shot baselines on a test set of 30 samples:
| Metric | IC-Light | TokenFlow | DAS | AnyPortal |
|---|---|---|---|---|
| Fram-Acc ↑ | 0.983 | 0.541 | 0.937 | 0.973 |
| Tem-Con ↑ | 0.945 | 0.981 | 0.986 | 0.993 |
| ID-Psrv ↓ | 0.578 | 0.632 | 0.364 | 0.313 |
| Mtn-Psrv ↑ | 0.844 | 0.985 | 0.878 | 0.987 |
| User-Pmt | 1.11% | 1.11% | 29.72% | 68.06% |
| User-Tem | 0.56% | 5.56% | 28.61% | 65.28% |
Ablation Study¶
| Configuration | Fram-Acc ↑ | Tem-Con ↑ | ID-Psrv ↓ | Mtn-Psrv ↑ |
|---|---|---|---|---|
| Full model | 0.973 | 0.993 | 0.313 | 0.987 |
| w/o \(\delta_p\) | 0.966 | 0.989 | 0.329 | 0.987 |
| w/o Cst-Enh | 0.970 | 0.961 | 0.353 | 0.973 |
| w/o RPA | 0.970 | 0.987 | 0.371 | 0.984 |
Key Findings¶
- The consistency enhancement stage contributes most: removing it drops Tem-Con from 0.993 to 0.961, demonstrating the critical role of the video diffusion model's temporal prior.
- RPA is essential for foreground preservation: removing it raises ID-Psrv from 0.313 to 0.371, and visually causes background blurring.
- AnyPortal achieves a dominant preference rate of 60%+ across all user study metrics.
- On a single RTX 4090 GPU, inference takes approximately 12 minutes per video (49 frames, 480×720).
Highlights & Insights¶
- The zero-error projection in RPA is the most elegant design: by computing \(\hat{\epsilon} = (x_0^t - \mu)/\sigma\) as a deterministic sampling direction, it avoids the accumulation of VAE encode–decode errors. This idea is transferable to any scenario requiring pixel-level operations in a 3D latent space.
- The modular design allows any component to be replaced with a newer pre-trained model at any time, making the framework naturally compatible with future advances in AIGC.
- The two-step IC-Light harmonization cleverly leverages \(\delta_I\) for spatial consistency and \(\delta_p\) for enhanced lighting effects, with cross-frame attention providing additional stylistic consistency.
Limitations & Future Work¶
- Low-quality or low-resolution inputs lead to poor high-frequency detail transfer (blurring in hair regions).
- Unclear foreground–background boundaries cause blurring artifacts around the subject.
- Video diffusion models still produce artifacts in fast-motion scenes.
- Inference time of approximately 12 minutes per video is far from real-time.
- Fixed resolution and length (480×720, 49 frames) due to CogVideoX constraints.
Related Work & Insights¶
- vs. IC-Light (per-frame): Severe inter-frame inconsistency and foreground color alteration; AnyPortal substantially improves both via the video model and RPA.
- vs. TokenFlow: Limited editing capability and insufficient foreground control; AnyPortal outperforms it comprehensively across all metrics.
- vs. DAS: Unable to preserve foreground motion dynamics; AnyPortal's RPA provides pixel-level guarantees.
- vs. RelightVid: Requires fine-tuning AnimateDiff; AnyPortal is entirely training-free and benefits from the stronger CogVideoX backbone.
Rating¶
- Novelty: ⭐⭐⭐⭐ The zero-error projection in RPA is a novel contribution, though the overall framework is a combination of existing modules.
- Experimental Thoroughness: ⭐⭐⭐⭐ Quantitative evaluation, user study, and complete ablation are provided, but the test set of only 30 samples is relatively small.
- Writing Quality: ⭐⭐⭐⭐⭐ Logic is clear, figures are intuitive, and the three-stage structure is presented in a well-motivated, progressive manner.
- Value: ⭐⭐⭐⭐ The first training-free video background replacement framework; highly practical, with a forward-looking modular design.