LTGS: Long-Term Gaussian Scene Chronology From Sparse View Updates¶

Conference: CVPR 2026 arXiv: 2510.09881 Code: N/A Area: 3D Vision Keywords: 3D Gaussian Splatting, scene updating, sparse-view, temporal reconstruction, object-level tracking

TL;DR¶

This paper proposes the LTGS framework, which constructs reusable object-level Gaussian templates to efficiently update 3DGS scene reconstructions from spatiotemporally sparse observations, enabling temporal modeling of long-term environmental evolution.

Background & Motivation¶

Novel view synthesis methods such as 3DGS and NeRF can reconstruct high-quality static 3D scenes from standard camera inputs; however, objects in everyday environments frequently change (e.g., furniture rearrangement, object addition or removal), causing reconstructions to become outdated rapidly. Existing strategies exhibit notable shortcomings:

Reconstruction from scratch: Discards prior information, resulting in massive computational redundancy.
4D representations (e.g., 4DGS, NSC): Require continuous dense observations and can only handle smooth motion, failing to capture abrupt geometric changes.
Continual learning methods (e.g., CL-NeRF, CL-Splats): Require a relatively large number of update images (>10), lack structural priors, and degrade under sparse inputs.
Few-shot reconstruction (e.g., InstantSplat): Cannot preserve the initial reconstruction, leading to severe floating artifacts in free-viewpoint rendering.

The core requirement is to efficiently detect and update object-level changes in a scene using only a very small number (e.g., 3) of randomly captured update images, while preserving the quality of the initial reconstruction.

Method¶

Overall Architecture¶

LTGS is an integrated pipeline that, given an initial 3DGS reconstruction $\mathcal{G}_0$ and sparse images $\mathcal{I} = \{I^i\}_t$ across multiple timesteps, outputs a temporally ordered scene evolution $\mathcal{S} = \{\mathcal{G}_0, \mathcal{G}_1, \ldots, \mathcal{G}_M\}$. The pipeline consists of four stages: change detection → object tracking and template construction → template association and registration → long-term Gaussian optimization.

Key Designs¶

Change Detection Module: Scene changes are detected by combining semantic and photometric criteria. Semantic discrepancy is measured via cosine similarity of SAM features (robust to illumination), while photometric discrepancy is captured using SSIM for subtle shifts. The two signals are fused and binarized to produce pseudo-masks, which are then filtered by SAM masks to remove floating artifacts and extract reliable object-level change regions. Masks are dilated by 3 pixels to retain sufficient 3D aggregation information. Design motivation: Pure photometric methods are sensitive to lighting variation, while pure semantic methods miss subtle changes; their combination yields greater robustness.
Object Template Construction and 2D Instance Matching: Objects are associated across views and across timesteps from sparse observations. Within a timestep, dense matching graphs are established using MASt3R geometric features, and instance IDs are assigned through graph-based analysis. Across timesteps, SAM features are aggregated to compute a cosine similarity matrix, and Hungarian matching is applied to establish instance correspondences. Object templates at the initial timestep are segmented from $\mathcal{G}_0$ via optimal label assignment; new objects at subsequent timesteps are initialized as Gaussians from MASt3R-estimated 3D point clouds. Design motivation: Under sparse settings, low-level features are insufficient for matching small objects; the combination of MASt3R and SAM leverages complementary dense geometric and semantic features.
Object Template Tracking and 3D Registration: A 6DoF transformation is estimated for each matched object instance. Since MASt3R point clouds are incomplete and unevenly distributed, conventional ICP/RANSAC registration fails. The method augments each point with DINO features and employs a robust point cloud registration pipeline to estimate $P_{t \to \tilde{t}, k} = \{R_{t \to \tilde{t}, k}, T_{t \to \tilde{t}, k}\}$, with geometric consistency verified via a Chamfer distance threshold. Non-rigid changes such as articulated motion are naturally represented as distinct instances. Design motivation: Selecting a single shared template per instance paired with relative transformations avoids redundant reconstruction and enables reuse across timesteps.
Long-Term Gaussian Optimization: Selected templates are mapped to each timestep via 6DoF transformations: $$(\mu_{t,k}, R_{t,k}, c_{t,k}) = (\mu_{0,k} R_{0 \to t,k}^\top + T_{0 \to t,k}^\top, R_{0 \to t,k} R_{0,k}, c_{0,k} \mathcal{R}_{\text{SH}}(R_{0 \to t,k})^\top)$$ A temporal opacity filter $\mathcal{M}_{t,o}$ is introduced to control transient object visibility, and the 6DoF poses are set as optimizable parameters to compensate for pixel-level registration errors. Consistency constraints are applied using rendered training-view images from the initial timestep to prevent overfitting to the limited subsequent observations.

Loss & Training¶

Rendering loss: standard L1 + D-SSIM (consistent with the original 3DGS formulation).
Convergence is achieved in only 5,000 iterations (no densification, cloning, or opacity reset required).
The background $\mathcal{B}_0$ is initialized from the initial reconstruction; regions exposed by object removal are supplemented with MASt3R point clouds.
Total processing time is approximately 6.5 minutes on an RTX 4090: change detection 2.5 min + instance matching 0.5 min + optimization 3.5 min.

Key Experimental Results¶

Main Results¶

Dataset	Metric	LTGS (Ours)	CL-Splats	4DGS	CL-NeRF	Gain
CL-NeRF (synthetic)	PSNR↑	27.17	25.84	26.13	25.53	+1.33 vs 4DGS
CL-NeRF (synthetic)	SSIM↑	0.795	0.772	0.786	0.730	+0.009 vs 4DGS
CL-NeRF (synthetic)	LPIPS↓	0.376	0.416	0.411	0.465	−0.035 vs 4DGS
Real dataset	PSNR↑	23.46	21.12	21.49	20.95	+1.97 vs 4DGS
Real dataset	SSIM↑	0.889	0.829	0.850	0.815	+0.039 vs 4DGS
Real dataset	LPIPS↓	0.230	0.312	0.322	0.379	−0.092 vs 4DGS
Real dataset	Time↓	7 min	3 min	29 min	2 h	4× faster than 4DGS

Ablation Study¶

Configuration	PSNR↑	SSIM↑	LPIPS↓	Notes
Full (LTGS)	23.46	0.889	0.230	Complete method
w/o object tracking	23.26	0.885	0.234	No template association; each timestep reconstructed independently
w/o pose optimization	23.33	0.886	0.232	6DoF poses not optimized
w/o background initialization	23.29	0.885	0.233	Exposed occlusion regions not supplemented
w/o training-view constraint	23.11	0.885	0.240	Initial training views not used

Key Findings¶

The performance advantage is more pronounced on real scenes (PSNR +1.97 vs. 4DGS), demonstrating the critical role of object template priors as constraints under sparse inputs.
Object tracking is the most important component: without template reuse, sparse views are insufficient to eliminate ghosting artifacts from removed objects.
Noise injection experiments on foundation models (MASt3R, SAM) confirm the robustness of the framework, as contour aggregation and graph matching smooth out local perturbations.
Non-rigid and articulated motion is naturally handled by creating independent templates for different states.

Highlights & Insights¶

The object template reuse paradigm is highly elegant: build once, transform repeatedly — reducing what is conceptually an $O(M)$ reconstruction problem to $O(1)$ templates plus $O(M)$ rigid transformations.
The framework effectively leverages the complementary strengths of three foundation models: SAM (semantics), MASt3R (geometry), and DINO (appearance), each fulfilling a distinct role.
The system is deployment-friendly: only 3 randomly captured images per update step are required, with a processing time of approximately 7 minutes, making it suitable for practical applications such as digital twins and location-based services.
A newly collected real-world dataset fills an evaluation gap for long-term sparse-update scenarios.

Limitations & Future Work¶

Geometric changes only: The method cannot handle significant illumination changes, shadow variations, or screen content changes.
Rigid object assumption: Although articulated motion is indirectly handled via multi-template mechanisms, explicit deformation modeling is absent.
The approach depends on a high-quality initial reconstruction $\mathcal{G}_0$; poor initial quality limits the effectiveness of subsequent updates.
Per-component PSNR gains in ablation experiments are modest (~0.2 dB), with overall improvement manifesting primarily in visual quality.
Future work could incorporate object articulation modeling (e.g., [13]) to support finer-grained non-rigid tracking.

CL-Splats / CL-NeRF (continual learning): Preserve temporal information but require more input images; LTGS overcomes the sparse-input bottleneck through structural priors.
InstantSplat (few-shot reconstruction): Reconstructs each timestep independently, discarding historical information; LTGS addresses this via background preservation and template reuse.
3DGS-CD (change detection): Detects changes but does not track them; LTGS adds complete object association and template optimization.
Insight: Deep integration of scene understanding (detection + segmentation + matching) with scene reconstruction (3DGS optimization) is a key direction for practical 3D systems.

Rating¶

Novelty: ⭐⭐⭐⭐ The object template reuse paradigm is novel, modeling everyday scene changes as decomposable structured updates.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on synthetic and real datasets with 7 baselines, complete ablation, and robustness analysis.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear, pipeline diagrams are intuitive, and the motivation and design of each module are internally consistent.
Value: ⭐⭐⭐⭐ The problem setting is highly practical, with direct applicability to digital twins, robotics, and related downstream applications.