SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras¶

Conference: CVPR 2026 arXiv: 2603.26481 Code: https://inspatio.github.io/sparse-cam4d/ Area: LLM Evaluation Keywords: sparse-camera 4D reconstruction, spatio-temporal distortion field, 4D Gaussian splatting, video diffusion models, dynamic scenes

TL;DR¶

This paper proposes SparseCam4D, the first method to achieve sparse-camera (2–3 views) 4D reconstruction on standard multi-camera dynamic scene benchmarks. The core innovation is the Spatio-Temporal Distortion Field (STDF), which explicitly models spatio-temporal inconsistencies in generative observations and decouples them from the underlying 4D Gaussian representation, enabling high-fidelity, spatio-temporally consistent rendering of dynamic scenes.

Background & Motivation¶

Background: High-quality 4D reconstruction relies on dense camera arrays (typically 18–21 synchronized cameras) to achieve photorealistic rendering. However, such costly laboratory-grade equipment severely limits practical applicability.

Limitations of Prior Work: Sparse-view 4D reconstruction faces two major challenges: (1) Geometry regularization methods (e.g., MonoFusion, which uses monocular depth and 3D tracking) provide only structural constraints without guaranteeing appearance quality — rendering quality degrades rapidly under viewpoint shifts; (2) Camera-controlled video diffusion models can generate high-quality multi-view data as auxiliary observations, but the generated frames suffer from severe spatio-temporal inconsistencies — spatial inconsistencies manifest as appearance/geometry discrepancies across views at the same timestep, while temporal inconsistencies manifest as flickering and motion instability within a single view across time. Using such data directly for reconstruction introduces significant blurring and artifacts.

Key Challenge: Although observations generated by diffusion models appear photorealistic, they exhibit systematic deviations from the true scene being reconstructed — deviations that span both spatial and temporal dimensions and cannot be simply ignored or handled independently.

Goal: To leverage rich but inconsistent generative observations to assist sparse-camera 4D reconstruction, extracting useful information while disentangling inconsistencies.

Key Insight: Explicitly model inconsistencies as a learnable spatio-temporal distortion field used during training and discarded at inference — with zero additional computational overhead.

Core Idea: A spatio-temporal distortion field decomposed via Ennea-planes jointly models spatial and temporal inconsistencies in generative observations, enabling 4D Gaussian splatting to learn correct scene representations from inconsistent diffusion-generated data.

Method¶

Overall Architecture¶

Input: \(N\) sparse-camera videos (\(N = 2\)–\(3\), uncalibrated). Pipeline: (1) A video diffusion model generates auxiliary observations from novel viewpoints; (2) COLMAP provides coarse pose initialization; (3) A 4D Gaussian splatting scene representation and STDF are constructed and jointly optimized with pose, rendering, and regularization objectives. Real views are rendered via standard 4D Gaussians; generated views are rendered via STDF-warped 4D Gaussians. After training, STDF is discarded and only the canonical 4D Gaussians are retained.

Key Designs¶

Spatio-Temporal Distortion Field (STDF):
- Function: Models per-viewpoint, per-timestep inconsistencies for each generated view, producing distortion values for 4D Gaussian attributes.
- Mechanism: Decomposes the 5D volume \((x, y, z, t, s)\) (spatial coordinates + time index + pose index) into 9 two-dimensional feature planes (excluding the semantically trivial \(t\)-\(s\) plane), termed Ennea-planes. Features are obtained by projecting a given coordinate onto each plane, applying bilinear interpolation, performing element-wise multiplication across planes, concatenating multi-resolution features, and decoding via a multi-head MLP into distortion offsets \(\Delta\mu, \Delta q_l, \Delta q_r, \Delta s\) for position, rotation, and scale. The distorted Gaussians \(\mathcal{G}'_{4D} = \mathcal{G}_{4D} + \Delta\mathcal{G}_{4D}\) are used to render generated views, while the original Gaussians render real views.
- Design Motivation: K-planes decomposition factorizes high-dimensional problems into products of low-dimensional planes, yielding a compact yet expressive representation. The separate introduction of the \(s\) (pose index) and \(t\) (time index) dimensions enables joint modeling of spatial and temporal inconsistencies — ablations confirm that removing either dimension significantly degrades performance.
Joint Pose Optimization:
- Function: Corrects inaccurate camera extrinsics under sparse input.
- Mechanism: Camera extrinsics (translation \(T\) and rotation quaternion \(q\)) are treated as learnable variables and optimized jointly with 4D Gaussian attributes, with a pose regularization loss \(\mathcal{L}_\text{pose} = \lambda_p(\|T - \hat{T}\| + \|q - \hat{q}\|)\) to prevent excessive deviation from COLMAP initialization. The first 3,000 iterations serve as a warm-up phase (standard training); pose and STDF are jointly optimized thereafter, with pose optimization stopping at 7,000 iterations.
- Design Motivation: Spatio-temporal inconsistencies in generated frames can severely compromise COLMAP pose estimation accuracy; joint optimization progressively corrects this during training.
Progressive Regularization Strategy:
- Function: Ensures smoothness of STDF and training stability.
- Mechanism: Generated views are supervised with a perceptual loss \(\mathcal{L}_\text{lpips}\) in place of pure pixel loss to capture texture and structural similarity. Total variation regularization \(\mathcal{L}_{TV}\) is applied to spatial planes. A second-order smoothness regularization \(\mathcal{L}_\text{smooth}\) is applied along the pose axis: \(\sum\|(P^{i,s-1} - P^{i,s}) - (P^{i,s} - P^{i,s+1})\|_2^2\), acting only on the \(xs, ys, zs\) planes.
- Design Motivation: Diffusion-generated distortions vary continuously along the pose axis but may exhibit abrupt changes along the time axis; pose-axis smoothness regularization encodes this prior. Perceptual loss is more robust to local inconsistencies in generated images.

Loss & Training¶

Total loss: \(\mathcal{L} = \mathcal{L}_\text{input} + \mathcal{L}_\text{gen} + \mathcal{L}_\text{pose} + \mathcal{L}_{TV} + \mathcal{L}_\text{smooth}\)

Input views: \((1-\lambda)\mathcal{L}_1 + \lambda\mathcal{L}_\text{D-SSIM}\), \(\lambda = 0.2\)
Generated views: \(\lambda_1\mathcal{L}_1 + \lambda_2\mathcal{L}_\text{lpips}\), \(\lambda_1 = 0.02\), \(\lambda_2 = 0.2\)

Each scene is trained for 30,000 iterations, sampling one real-view image and one generated-view image per iteration. ViewCrafter generates auxiliary videos (25 frames per sequence). Training runs on a single A800 GPU.

Key Experimental Results¶

Main Results¶

Quantitative comparisons on three standard 4D benchmarks (2–3 sparse camera inputs):

Method	Technicolor PSNR↑	Neural 3D PSNR↑	Nvidia Dynamic PSNR↑
4DGaussians	16.20	17.40	16.81
4D-Rotor	14.85	18.20	19.38
MonoFusion*	17.97	18.43	20.22
Ours	23.15	21.91	24.81

LPIPS also leads comprehensively: Technicolor 0.299 vs. MonoFusion 0.352; Nvidia 0.150 vs. 0.192.

Ablation Study¶

STDF ablation (Train/Jumping scenes, LPIPS↓ / SSIM↑):

Setting	Train LPIPS	Jumping LPIPS
w/o distortion field	0.608	0.319
w/o time axis	0.458	0.279
w/o pose axis	0.469	0.268
Full STDF	0.264	0.170

Removing pose optimization degrades LPIPS from 0.264→0.336 (Train) and 0.170→0.217 (Jumping).

Key Findings¶

Removing STDF and directly using generated images causes severe blurring: Spatio-temporal slice visualizations clearly reveal temporal jitter induced by inconsistencies.
Both spatial and temporal axes are indispensable: Removing either axis significantly degrades performance, confirming that generative inconsistencies span both spatio-temporal dimensions.
STDF visualizations are semantically meaningful: High-distortion regions (e.g., faces, wine bottles) correspond to areas where diffusion model generation exhibits the largest deformations.
Generalizes across diffusion models: Significant improvements are achieved under both ViewCrafter and ReCamMaster (+2.51 dB and +1.76 dB, respectively).
This represents the first evaluation of sparse-camera 4D reconstruction across all viewpoints on standard multi-camera dynamic scene benchmarks.

Highlights & Insights¶

Elegant "use during training, discard at inference" design: STDF adapts to generated observation inconsistencies only during training, incurring zero overhead at inference.
Ennea-plane decomposition: Extends K-planes from 4D to 5D by introducing the pose index dimension, yielding a compact and effective representation.
Paradigm shift from "combating inconsistency" to "explicitly modeling inconsistency": Rather than attempting to produce consistent diffusion outputs, the method acknowledges and explicitly models inconsistencies.
STDF visualizations offer interesting insights into how diffusion models "perceive the physical world" — different regions exhibit varying degrees of distortion.

Limitations & Future Work¶

Performance depends on the generation quality of the specific video diffusion model; insufficiently high generation quality may fail to provide useful auxiliary information.
Per-scene training is required, and 30k iterations still entail non-trivial computational cost.
Pose optimization requires COLMAP initialization, which may fail under extreme sparsity (e.g., a single camera).
Dynamic topological changes (e.g., object appearance/disappearance) are not addressed.
Future work may explore extending the STDF paradigm to other reconstruction tasks that leverage generative models as auxiliary supervision.

MonoFusion / Shape-of-Motion: Representative methods of the geometry regularization paradigm; this paper demonstrates that geometric priors alone are insufficient.
ViewCrafter / ReCamMaster: Camera-controlled video diffusion models that provide auxiliary observations but introduce inconsistencies.
K-planes: The foundation of factorized scene representations; STDF extends its design space.
The "explicit inconsistency modeling" philosophy underlying STDF can be broadly applied to all 3D/4D reconstruction tasks that leverage generative models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First framework to jointly model spatio-temporal inconsistencies in generative observations; the Ennea-plane design is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three standard benchmarks, comprehensive ablations, cross-diffusion-model validation, and visualization analysis — very rigorous.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear; visualizations of spatio-temporal inconsistencies are highly persuasive.
Value: ⭐⭐⭐⭐⭐ Reduces the camera requirement for 4D reconstruction from 20+ cameras to just 2–3, with enormous practical application potential.