LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction¶

Conference: CVPR2026 arXiv: 2512.13680 Code: Project Page Area: Human Understanding / 3D Reconstruction Keywords: Streaming 4D Reconstruction, Training-Free Framework, Layer-wise Scale Alignment, Sliding Window, Sim(3) Registration

TL;DR¶

This paper proposes LASER, a training-free framework that converts offline feed-forward reconstruction models (e.g., VGGT, π³) into streaming systems via Layer-wise Scale Alignment (LSA), achieving real-time streaming 4D reconstruction of kilometer-scale videos at 14 FPS with 6 GB peak memory on an RTX A6000.

Background & Motivation¶

Limitations of Prior Work: - Offline model constraints: Feed-forward reconstruction models such as VGGT and π³ perform well on static image sets but cannot handle streaming video input due to quadratic memory complexity, running out of memory (OOM) on long sequences such as KITTI. - Streaming methods require retraining: Streaming approaches including CUT3R, StreamVGGT, and STream3R achieve incremental processing through learned memory mechanisms or causal attention, but all require extensive retraining or knowledge distillation, incurring substantial computational cost. - Drift in recurrent designs: Recurrent designs such as CUT3R suffer from drift and catastrophic forgetting on long sequences; methods relying on growing memory face scalability limitations. - Insufficiency of simple Sim(3) alignment: The concurrent work VGGT-Long adopts a training-free approach via chunking and Sim(3) alignment, but rigid global alignment proves insufficient along the depth dimension. - Layer-wise depth inconsistency: Monocular scale ambiguity causes the relative depth scales of different scene layers (e.g., foreground vs. background) to shift inconsistently across windows; the uniform scaling of a global Sim(3) transformation cannot resolve this anisotropic scaling. - Practical deployment requirements: Applications in autonomous driving, robotics, and AR/VR demand efficient and consistent processing of video streams, requiring online processing while maintaining reconstruction quality.

Method¶

Overall Architecture¶

LASER adopts a sliding window strategy to process video streams. Given a video \(\{I_t\}\), overlapping windows \(\{W_i\}\) are formed, each containing \(L\) consecutive frames with an overlap of \(O\) frames between adjacent windows. Each window is processed by a frozen offline reconstructor (VGGT or π³) that predicts dense point maps and camera poses; the resulting local submaps are then registered into a global map via incremental alignment.

Pipeline: Video stream → Overlapping sliding windows → Frozen feed-forward reconstructor predicts point maps/poses → Sim(3) global alignment → Layer-wise Scale Alignment (LSA) → Globally consistent reconstruction.

Key Designs: Layer-wise Scale Alignment (LSA)¶

Problem Identification: Global Sim(3) registration assumes isotropic scaling, but under low-parallax motion the scale constraint along the depth direction is unreliable, leading to inconsistent scaling across depth layers (foreground over- or under-scaled relative to background).

Depth Layer Extraction: After Sim(3) registration, an efficient segmentation algorithm partitions the pseudo-depth map into \(M\) disjoint depth layers \(\{L_{t,m}\}\), each corresponding to a continuous geometric surface with consistent depth.

Depth Layer Graph Construction: All depth layers are organized into a directed graph \(H=(V,E)\) with two types of edges: - Inter-window edges \(E_{\text{inter}}\): Connect corresponding layers from two windows at overlapping timestamps with IoU \(> \tau\ (=0.3)\). - Intra-window edges \(E_{\text{intra}}\): Connect the same depth layer across adjacent frames within the same window.

Layer-wise Scale Estimation: For each inter-window edge, a layer-wise scaling factor \(\hat{s}\) is optimized via IRLS (Huber loss) to align the depth values of corresponding layers in adjacent windows.

Scale Propagation and Aggregation: Layer-wise scales estimated from overlapping regions along \(E_{\text{inter}}\) are first propagated temporally to non-overlapping frames along \(E_{\text{intra}}\). The final scale for each layer is computed as an IoU-weighted average, ensuring consistency across windows and the temporal axis.

Loss & Training¶

Global scale \(s_i^w\) is estimated via IRLS robust optimization with Huber loss to suppress outliers.
Rotation and translation are optimized via the Kabsch algorithm under the estimated scale.
Layer-wise scales are likewise optimized via IRLS with Huber loss.

Key Experimental Results¶

Main Results¶

Video Depth Estimation (Table 1):

Method	Type	Sintel Abs Rel↓	Bonn Abs Rel↓	KITTI Abs Rel↓
π³ (offline)	Offline	0.245	0.050	0.038
CUT3R	Streaming	0.421	0.078	0.118
STream3Rβ	Streaming	0.264	0.069	0.080
π³+Ours	Streaming	0.247	0.048	0.054

Camera Pose Estimation (Table 2):

Method	Sintel ATE↓	ScanNet ATE↓	TUM ATE↓
π³ (offline)	0.073	0.030	0.014
CUT3R	0.213	0.099	0.046
TTT3R	0.201	0.064	0.028
π³+Ours	0.061	0.031	0.016

ATE on Sintel is reduced by 68.6% versus the previous best streaming method; Acc on 7-Scenes is reduced by 63.9%.

Large-Scale KITTI Odometry (Table 3): Offline models VGGT and π³ run OOM on all sequences; CUT3R runs OOM on most. LASER(π³) maintains stable performance across all 11 sequences, achieving a mean ATE of 24.17, outperforming VGGT-Long (27.64) and π³-Long (30.72).

Ablation Study¶

LSA Component Ablation (Table 5, Sintel depth):

Configuration	Abs Rel↓	δ<1.25↑
Full LASER	0.247	68.8
w/o LSA	0.328	51.4
SAM 2 replacing segmentation	0.251	67.8
w/o \(E_{\text{intra}}\)	0.261	64.7

Key Findings: - Removing LSA degrades Abs Rel by 32.8%, confirming that layer-wise scale alignment is the core contribution. - SAM 2, despite finer segmentation granularity, yields no improvement; simple and efficient segmentation is sufficient. - Removing intra-window temporal propagation edges \(E_{\text{intra}}\) impairs global consistency. - The IoU threshold \(\tau\) is robust in the range 0.2–0.6; the default value of 0.3 performs best. - Window size \(L=20\) achieves the best balance.

Efficiency Analysis¶

π³+Ours: ~14.2 FPS, 6 GB peak memory (RTX A6000).
VGGT+Ours: ~10.9 FPS, 10 GB peak memory.
Fastest speed and lowest memory consumption among all streaming methods.

Highlights & Insights¶

Zero training cost: No retraining is required whatsoever; any offline reconstruction model can be directly converted to a streaming system, enabling plug-and-play adoption as new models emerge.
Identification and resolution of layer-wise depth inconsistency: The paper provides a deep insight into the anisotropic scaling failure mode of global Sim(3) alignment and proposes a solution grounded in classical layered scene representation.
Comprehensive state-of-the-art performance: LASER surpasses existing streaming methods across three tasks — depth estimation, pose estimation, and point map reconstruction — with several metrics approaching or exceeding offline models.
Practically deployable: At 14 FPS with 6 GB memory usage and support for kilometer-scale long sequences, the system holds significant real-world application value.
Elegant design philosophy: Classical geometric principles are used to bridge the shortcomings of deep learning models without requiring end-to-end retraining.

Limitations & Future Work¶

Performance is bounded by the capabilities of the underlying offline model (e.g., π³'s weaker normal accuracy leads to suboptimal NC metrics).
Layered segmentation depends on the quality of the depth map and may fail in extreme scenarios such as pure rotation or textureless regions.
The sliding window strategy introduces a fixed latency, making it unsuitable for ultra-low-latency applications.
Large-scale scenes still require additional loop closure to reduce long-range drift.
The paper is categorized under human understanding, whereas the actual contribution is a general-purpose 3D/4D reconstruction framework.

Offline feed-forward reconstruction: DUSt3R → VGGT → π³, progressing from pairwise image regression to dense reconstruction from arbitrary view collections.
Streaming reconstruction (with training): CUT3R (recurrent memory), StreamVGGT (causal attention), STream3R (sliding window + token pooling), WinT3R, TTT3R (test-time adaptation).
Training-free streaming (concurrent work): VGGT-Long uses chunking + Sim(3); this paper demonstrates that simple Sim(3) alignment is insufficient.
Classical methods: ORB-SLAM2, DROID-SLAM, etc., offer high accuracy but require calibration and produce only sparse reconstructions.
4D reconstruction: From per-scene optimization with NeRF/3DGS to feed-forward dynamic reconstruction.

Rating¶

Novelty: ⭐⭐⭐⭐ — The identification of the layer-wise depth inconsistency problem and the LSA design are original; the combination of classical geometry and modern deep learning is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three tasks, six datasets, extensive baseline comparisons, comprehensive ablations, and efficiency analysis; very thorough.
Writing Quality: ⭐⭐⭐⭐ — Problem motivation is clearly articulated, figures are intuitive, and the method is described rigorously.
Value: ⭐⭐⭐⭐⭐ — Training-free, plug-and-play, efficient, and practical; highly valuable for real-world deployment.