Reconstruct, Inpaint, Test-Time Finetune: Dynamic Novel-View Synthesis from Monocular Videos¶

Conference: NeurIPS 2025 arXiv: 2507.12646 Code: https://cog-nvs.github.io/ Area: 3D Vision Keywords: dynamic novel-view synthesis, monocular video, video inpainting, test-time finetuning, diffusion models

TL;DR¶

This paper proposes CogNVS, which decomposes dynamic scene novel-view synthesis into a three-stage pipeline — 3D reconstruction (recovering visible pixels) → video diffusion inpainting (generating occluded regions) → test-time finetuning (adapting to the target video distribution) — training the inpainting model with purely 2D video self-supervision to achieve zero-shot generalization to new test videos.

Background & Motivation¶

Background: Dynamic novel-view synthesis from monocular video is an extremely challenging problem. Two main directions exist: (i) test-time optimization of 4D representations (e.g., 4DGS, Shape-of-Motion) — geometrically accurate but requiring hours of computation and failing when novel viewpoints deviate significantly from training views; (ii) large-scale feed-forward video models (GCD, TrajectoryCrafter) — fast but lacking 3D consistency.

Limitations of Prior Work: - 4D optimization methods cannot handle occluded regions — they can only synthesize "co-visible" pixels - Data-driven methods suffer from over-hallucination — objects appear or disappear abruptly - Multi-view training data for dynamic scenes is scarce — the best available data source is monocular 2D video

Key Challenge: 3D reconstruction yields geometrically precise but incomplete renderings; diffusion-based generation yields complete but geometrically inconsistent results.

Key Insight: Disentanglement — co-visible pixels are rendered via 3D reconstruction (accurate), while occluded pixels are generated via video diffusion inpainting (creative).

Core Idea: CogNVS is a conditional video inpainting diffusion model that generates only occluded regions while preserving known regions. During training, 2D video self-supervision is used (randomly sampled camera trajectories construct co-visibility masks as training pairs); at test time, test-time finetuning (TTF) adapts the model to the target video.

Method¶

Overall Architecture¶

A three-stage pipeline: (1) Reconstruct: MegaSAM is used to obtain a dynamic 3D reconstruction \(\mathcal{G}_{src}\) and camera odometry \(\mathbf{c}_{src}\) from the monocular video. The scene is rendered to novel viewpoints to produce partially visible pixels \(\mathbf{V}_{nvs}^{cov}\). (2) Inpaint: CogNVS (built on CogVideoX-5B) conditions on the partially visible rendering to generate the complete novel-view video \(\mathbf{V}_{nvs}\) — inpainting occluded regions while also allowing updates to visible-region appearance (e.g., view-dependent lighting). (3) Test-Time Finetune: CogNVS is finetuned for 200–400 steps using AdamW on self-supervised pairs constructed from the target video, reducing the train-test domain gap.

Key Designs¶

Self-Supervised Data Construction:
- Function: Constructs inpainting training pairs from arbitrary 2D monocular videos without requiring multi-view ground truth.
- Mechanism: Given a source video → reconstruct → randomly sample \(N\) novel camera trajectories → find co-visible 3D points \(\mathcal{G}_{src,n}^{cov}\) between source and novel viewpoints → render to the source viewpoint to obtain a "partially visible source video" \(\mathbf{V}_{src,n}^{cov}\). Training pair = (partially visible source video, complete source video).
- Design Motivation: 3D-aware masks (as opposed to random 2D masks) better simulate real 3D visibility, as occlusion patterns are directly tied to viewpoint changes.
CogNVS Architecture:
- Built on CogVideoX-5B (Transformer-based video diffusion model) with self-attention and 3D-RoPE.
- Originally an image-to-video model, adapted to video-to-video — the conditioning video and target video are shape-aligned, eliminating the need for padding.
- Conditioning input: VAE-encoded partially visible novel-view rendering \(\mathbf{z}_{cond}\).
- Target output: complete novel-view video.
- Training objective: standard score matching \(\|\epsilon_\theta(\mathbf{z}_k, k, \mathbf{z}_{cond}) - \epsilon\|_2^2\).
Test-Time Finetuning (TTF):
- Function: Reduces the domain gap between pretraining data and the target test video.
- Mechanism: Self-supervised pairs are constructed from the target test video itself (using the same reconstruction + random trajectory approach), and CogNVS is finetuned for 200–400 steps with AdamW.
- Design Motivation: Lighting, appearance, and motion patterns vary substantially across videos — the general priors of the pretrained model are insufficient. TTF allows the model to internalize the characteristics of the target video while retaining general inpainting capability.
- Key Findings: TTF is the critical component separating "competitive" from "state-of-the-art" performance — without TTF, performance is on par with competing methods; with TTF, it surpasses all prior approaches.

Loss & Training¶

Pretraining: 10K 2D videos (SA-V, TAO, YouTube-VOS, DAVIS); full fine-tuning of all 42 Transformer layers; 12K steps; 3 days × 8 A6000 GPUs.
TTF: 200–400 AdamW steps per test video, learning rate 2e-5.

Key Experimental Results¶

Main Results — Kubric-4D (Synthetic Dynamic Scenes, Zero-Shot)¶

Method	PSNR↑	SSIM↑	LPIPS↓	FID↓	Training Data
GCD	20.1	0.734	0.186	85.3	Kubric (in-domain)
TrajectoryCrafter	18.5	0.701	0.228	102.4	Large-scale
Gen3C	19.3	0.715	0.210	93.5	Large-scale
CogNVS (zero-shot+TTF)	20.8	0.745	0.172	78.6	2D video (OOD)

Main Results — DyCheck (Real Dynamic Scenes)¶

Method	PSNR↑	SSIM↑	LPIPS↓	Type
Shape-of-Motion	21.2	0.665	0.341	Test-time optimization (hours)
MoSca	20.8	0.651	0.356	Test-time optimization
CogNVS (TTF)	21.9	0.683	0.312	Feed-forward + TTF (minutes)

Ablation Study — Impact of TTF¶

Configuration	Kubric PSNR↑	DyCheck PSNR↑	Notes
CogNVS (w/o TTF)	19.2	20.4	Pretrained model only
CogNVS (w/ TTF)	20.8	21.9	+1.6/+1.5 PSNR
Random 2D mask training (w/o TTF)	18.1	19.2	3D mask > 2D mask

Key Findings¶

TTF is the single most critical component — it contributes 1.5+ PSNR improvement, lifting performance from "competitive" to "surpassing all prior methods."
Zero-shot performance exceeds in-domain-trained GCD — CogNVS has never seen Kubric data yet outperforms GCD, demonstrating that the 2D video self-supervision + TTF paradigm generalizes more effectively than dataset-specific training.
3D-aware masks outperform random 2D masks by 1+ PSNR — occlusion patterns must be aligned with 3D visibility.
CogNVS produces sharp dynamic objects (whereas other methods are blurry in dynamic regions) — because the inpainting model specializes in occluded regions and does not need to fit dynamics from limited viewpoints as 4D representations do.

Highlights & Insights¶

The "reconstruct + inpaint + finetune" three-stage disentanglement is conceptually clean and each stage can be improved independently — best SLAM for reconstruction, best diffusion model for inpainting, standard TTF for adaptation.
Self-supervised training data construction is the core contribution — it transforms "NVS training requiring multi-view GT" into "training with arbitrary 2D video," unlocking vast quantities of internet video data.
TTF represents the best of both worlds between test-time optimization and feed-forward inference — retaining the robustness of data-driven methods (from large-scale pretraining) while achieving optimization-level accuracy (from test-time adaptation).
The method is the first to achieve "feed-forward speed + optimization-level accuracy" for dynamic scene NVS.

Limitations & Future Work¶

Inference still requires ~5 minutes per video due to the size of CogVideoX — smaller, faster diffusion models could accelerate this.
The method depends on MegaSAM reconstruction quality — failures in reconstruction propagate incorrect visibility masks.
Real-scene evaluation (DyCheck) covers only 5 videos — validation on larger real-world benchmarks remains to be conducted.
When the novel viewpoint diverges substantially from the source (e.g., 180° rotation), the reconstructed co-visible region becomes too sparse, effectively degrading to near-pure inpainting.

vs. GCD (ECCV'24): GCD is trained on Kubric and is domain-specific; CogNVS trains on 2D videos and generalizes in a zero-shot manner.
vs. TrajectoryCrafter (concurrent): TrajectoryCrafter over-hallucinates occluded regions (without preserving geometry); CogNVS's reconstruction prior ensures accuracy in co-visible regions.
vs. CAT4D (concurrent): CAT4D employs score distillation and has limited generalization; CogNVS achieves stronger generalization via TTF.
The "inpaint rather than reconstruct" paradigm can be extended to other 3D tasks such as 3D completion and scene editing.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Reframes dynamic NVS as a video inpainting problem; the combination of self-supervised training and TTF is well-motivated.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Zero-shot evaluation on 3 datasets with detailed ablations (TTF / mask type / training data / step count).
Writing Quality: ⭐⭐⭐⭐⭐ Stage-by-stage presentation is clear; the explanation of self-supervised data construction is intuitive.
Value: ⭐⭐⭐⭐⭐ Advances the state of the art in monocular dynamic NVS, with methodological insights applicable to broader 3D vision tasks.