Skip to content

SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras

Conference: CVPR 2026 arXiv: 2603.26481 Code: https://inspatio.github.io/sparse-cam4d/ Area: LLM Evaluation Keywords: sparse-camera 4D reconstruction, spatio-temporal distortion field, 4D Gaussian splatting, video diffusion models, dynamic scenes

TL;DR

This paper proposes SparseCam4D, the first method to achieve sparse-camera (2–3 views) 4D reconstruction on standard multi-camera dynamic scene benchmarks. The core innovation is the Spatio-Temporal Distortion Field (STDF), which explicitly models spatio-temporal inconsistencies in generative observations and decouples them from the underlying 4D Gaussian representation, enabling high-fidelity, spatio-temporally consistent rendering of dynamic scenes.

Background & Motivation

Background: High-quality 4D reconstruction relies on dense camera arrays (typically 18–21 synchronized cameras) to achieve photorealistic rendering. However, such costly laboratory-grade equipment severely limits practical applicability.

Limitations of Prior Work: Sparse-view 4D reconstruction faces two major challenges: (1) Geometry regularization methods (e.g., MonoFusion, which uses monocular depth and 3D tracking) provide only structural constraints without guaranteeing appearance quality — rendering quality degrades rapidly under viewpoint shifts; (2) Camera-controlled video diffusion models can generate high-quality multi-view data as auxiliary observations, but the generated frames suffer from severe spatio-temporal inconsistencies — spatial inconsistencies manifest as appearance/geometry discrepancies across views at the same timestep, while temporal inconsistencies manifest as flickering and motion instability within a single view across time. Using such data directly for reconstruction introduces significant blurring and artifacts.

Key Challenge: Although observations generated by diffusion models appear photorealistic, they exhibit systematic deviations from the true scene being reconstructed — deviations that span both spatial and temporal dimensions and cannot be simply ignored or handled independently.

Goal: To leverage rich but inconsistent generative observations to assist sparse-camera 4D reconstruction, extracting useful information while disentangling inconsistencies.

Key Insight: Explicitly model inconsistencies as a learnable spatio-temporal distortion field used during training and discarded at inference — with zero additional computational overhead.

Core Idea: A spatio-temporal distortion field decomposed via Ennea-planes jointly models spatial and temporal inconsistencies in generative observations, enabling 4D Gaussian splatting to learn correct scene representations from inconsistent diffusion-generated data.

Method

Overall Architecture

Input: \(N\) sparse-camera videos (\(N = 2\)\(3\), uncalibrated). Pipeline: (1) A video diffusion model generates auxiliary observations from novel viewpoints; (2) COLMAP provides coarse pose initialization; (3) A 4D Gaussian splatting scene representation and STDF are constructed and jointly optimized with pose, rendering, and regularization objectives. Real views are rendered via standard 4D Gaussians; generated views are rendered via STDF-warped 4D Gaussians. After training, STDF is discarded and only the canonical 4D Gaussians are retained.

Key Designs

  1. Spatio-Temporal Distortion Field (STDF):

    • Function: Models per-viewpoint, per-timestep inconsistencies for each generated view, producing distortion values for 4D Gaussian attributes.
    • Mechanism: Decomposes the 5D volume \((x, y, z, t, s)\) (spatial coordinates + time index + pose index) into 9 two-dimensional feature planes (excluding the semantically trivial \(t\)-\(s\) plane), termed Ennea-planes. Features are obtained by projecting a given coordinate onto each plane, applying bilinear interpolation, performing element-wise multiplication across planes, concatenating multi-resolution features, and decoding via a multi-head MLP into distortion offsets \(\Delta\mu, \Delta q_l, \Delta q_r, \Delta s\) for position, rotation, and scale. The distorted Gaussians \(\mathcal{G}'_{4D} = \mathcal{G}_{4D} + \Delta\mathcal{G}_{4D}\) are used to render generated views, while the original Gaussians render real views.
    • Design Motivation: K-planes decomposition factorizes high-dimensional problems into products of low-dimensional planes, yielding a compact yet expressive representation. The separate introduction of the \(s\) (pose index) and \(t\) (time index) dimensions enables joint modeling of spatial and temporal inconsistencies — ablations confirm that removing either dimension significantly degrades performance.
  2. Joint Pose Optimization:

    • Function: Corrects inaccurate camera extrinsics under sparse input.
    • Mechanism: Camera extrinsics (translation \(T\) and rotation quaternion \(q\)) are treated as learnable variables and optimized jointly with 4D Gaussian attributes, with a pose regularization loss \(\mathcal{L}_\text{pose} = \lambda_p(\|T - \hat{T}\| + \|q - \hat{q}\|)\) to prevent excessive deviation from COLMAP initialization. The first 3,000 iterations serve as a warm-up phase (standard training); pose and STDF are jointly optimized thereafter, with pose optimization stopping at 7,000 iterations.
    • Design Motivation: Spatio-temporal inconsistencies in generated frames can severely compromise COLMAP pose estimation accuracy; joint optimization progressively corrects this during training.
  3. Progressive Regularization Strategy:

    • Function: Ensures smoothness of STDF and training stability.
    • Mechanism: Generated views are supervised with a perceptual loss \(\mathcal{L}_\text{lpips}\) in place of pure pixel loss to capture texture and structural similarity. Total variation regularization \(\mathcal{L}_{TV}\) is applied to spatial planes. A second-order smoothness regularization \(\mathcal{L}_\text{smooth}\) is applied along the pose axis: \(\sum\|(P^{i,s-1} - P^{i,s}) - (P^{i,s} - P^{i,s+1})\|_2^2\), acting only on the \(xs, ys, zs\) planes.
    • Design Motivation: Diffusion-generated distortions vary continuously along the pose axis but may exhibit abrupt changes along the time axis; pose-axis smoothness regularization encodes this prior. Perceptual loss is more robust to local inconsistencies in generated images.

Loss & Training

Total loss: \(\mathcal{L} = \mathcal{L}_\text{input} + \mathcal{L}_\text{gen} + \mathcal{L}_\text{pose} + \mathcal{L}_{TV} + \mathcal{L}_\text{smooth}\)

  • Input views: \((1-\lambda)\mathcal{L}_1 + \lambda\mathcal{L}_\text{D-SSIM}\), \(\lambda = 0.2\)
  • Generated views: \(\lambda_1\mathcal{L}_1 + \lambda_2\mathcal{L}_\text{lpips}\), \(\lambda_1 = 0.02\), \(\lambda_2 = 0.2\)

Each scene is trained for 30,000 iterations, sampling one real-view image and one generated-view image per iteration. ViewCrafter generates auxiliary videos (25 frames per sequence). Training runs on a single A800 GPU.

Key Experimental Results

Main Results

Quantitative comparisons on three standard 4D benchmarks (2–3 sparse camera inputs):

Method Technicolor PSNR↑ Neural 3D PSNR↑ Nvidia Dynamic PSNR↑
4DGaussians 16.20 17.40 16.81
4D-Rotor 14.85 18.20 19.38
MonoFusion* 17.97 18.43 20.22
Ours 23.15 21.91 24.81

LPIPS also leads comprehensively: Technicolor 0.299 vs. MonoFusion 0.352; Nvidia 0.150 vs. 0.192.

Ablation Study

STDF ablation (Train/Jumping scenes, LPIPS↓ / SSIM↑):

Setting Train LPIPS Jumping LPIPS
w/o distortion field 0.608 0.319
w/o time axis 0.458 0.279
w/o pose axis 0.469 0.268
Full STDF 0.264 0.170

Removing pose optimization degrades LPIPS from 0.264→0.336 (Train) and 0.170→0.217 (Jumping).

Key Findings

  • Removing STDF and directly using generated images causes severe blurring: Spatio-temporal slice visualizations clearly reveal temporal jitter induced by inconsistencies.
  • Both spatial and temporal axes are indispensable: Removing either axis significantly degrades performance, confirming that generative inconsistencies span both spatio-temporal dimensions.
  • STDF visualizations are semantically meaningful: High-distortion regions (e.g., faces, wine bottles) correspond to areas where diffusion model generation exhibits the largest deformations.
  • Generalizes across diffusion models: Significant improvements are achieved under both ViewCrafter and ReCamMaster (+2.51 dB and +1.76 dB, respectively).
  • This represents the first evaluation of sparse-camera 4D reconstruction across all viewpoints on standard multi-camera dynamic scene benchmarks.

Highlights & Insights

  • Elegant "use during training, discard at inference" design: STDF adapts to generated observation inconsistencies only during training, incurring zero overhead at inference.
  • Ennea-plane decomposition: Extends K-planes from 4D to 5D by introducing the pose index dimension, yielding a compact and effective representation.
  • Paradigm shift from "combating inconsistency" to "explicitly modeling inconsistency": Rather than attempting to produce consistent diffusion outputs, the method acknowledges and explicitly models inconsistencies.
  • STDF visualizations offer interesting insights into how diffusion models "perceive the physical world" — different regions exhibit varying degrees of distortion.

Limitations & Future Work

  • Performance depends on the generation quality of the specific video diffusion model; insufficiently high generation quality may fail to provide useful auxiliary information.
  • Per-scene training is required, and 30k iterations still entail non-trivial computational cost.
  • Pose optimization requires COLMAP initialization, which may fail under extreme sparsity (e.g., a single camera).
  • Dynamic topological changes (e.g., object appearance/disappearance) are not addressed.
  • Future work may explore extending the STDF paradigm to other reconstruction tasks that leverage generative models as auxiliary supervision.
  • MonoFusion / Shape-of-Motion: Representative methods of the geometry regularization paradigm; this paper demonstrates that geometric priors alone are insufficient.
  • ViewCrafter / ReCamMaster: Camera-controlled video diffusion models that provide auxiliary observations but introduce inconsistencies.
  • K-planes: The foundation of factorized scene representations; STDF extends its design space.
  • The "explicit inconsistency modeling" philosophy underlying STDF can be broadly applied to all 3D/4D reconstruction tasks that leverage generative models.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First framework to jointly model spatio-temporal inconsistencies in generative observations; the Ennea-plane design is elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three standard benchmarks, comprehensive ablations, cross-diffusion-model validation, and visualization analysis — very rigorous.
  • Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear; visualizations of spatio-temporal inconsistencies are highly persuasive.
  • Value: ⭐⭐⭐⭐⭐ Reduces the camera requirement for 4D reconstruction from 20+ cameras to just 2–3, with enormous practical application potential.