Skip to content

LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction

Conference: CVPR2026 arXiv: 2512.13680 Code: Project Page Area: Human Understanding / 3D Reconstruction Keywords: Streaming 4D Reconstruction, Training-Free Framework, Layer-wise Scale Alignment, Sliding Window, Sim(3) Registration

TL;DR

This paper proposes LASER, a training-free framework that converts offline feed-forward reconstruction models (e.g., VGGT, π³) into streaming systems via Layer-wise Scale Alignment (LSA), achieving real-time streaming 4D reconstruction of kilometer-scale videos at 14 FPS with 6 GB peak memory on an RTX A6000.

Background & Motivation

Limitations of Prior Work: - Offline model constraints: Feed-forward reconstruction models such as VGGT and π³ perform well on static image sets but cannot handle streaming video input due to quadratic memory complexity, running out of memory (OOM) on long sequences such as KITTI. - Streaming methods require retraining: Streaming approaches including CUT3R, StreamVGGT, and STream3R achieve incremental processing through learned memory mechanisms or causal attention, but all require extensive retraining or knowledge distillation, incurring substantial computational cost. - Drift in recurrent designs: Recurrent designs such as CUT3R suffer from drift and catastrophic forgetting on long sequences; methods relying on growing memory face scalability limitations. - Insufficiency of simple Sim(3) alignment: The concurrent work VGGT-Long adopts a training-free approach via chunking and Sim(3) alignment, but rigid global alignment proves insufficient along the depth dimension. - Layer-wise depth inconsistency: Monocular scale ambiguity causes the relative depth scales of different scene layers (e.g., foreground vs. background) to shift inconsistently across windows; the uniform scaling of a global Sim(3) transformation cannot resolve this anisotropic scaling. - Practical deployment requirements: Applications in autonomous driving, robotics, and AR/VR demand efficient and consistent processing of video streams, requiring online processing while maintaining reconstruction quality.

Method

Overall Architecture

LASER adopts a sliding window strategy to process video streams. Given a video \(\{I_t\}\), overlapping windows \(\{W_i\}\) are formed, each containing \(L\) consecutive frames with an overlap of \(O\) frames between adjacent windows. Each window is processed by a frozen offline reconstructor (VGGT or π³) that predicts dense point maps and camera poses; the resulting local submaps are then registered into a global map via incremental alignment.

Pipeline: Video stream → Overlapping sliding windows → Frozen feed-forward reconstructor predicts point maps/poses → Sim(3) global alignment → Layer-wise Scale Alignment (LSA) → Globally consistent reconstruction.

Key Designs: Layer-wise Scale Alignment (LSA)

Problem Identification: Global Sim(3) registration assumes isotropic scaling, but under low-parallax motion the scale constraint along the depth direction is unreliable, leading to inconsistent scaling across depth layers (foreground over- or under-scaled relative to background).

Depth Layer Extraction: After Sim(3) registration, an efficient segmentation algorithm partitions the pseudo-depth map into \(M\) disjoint depth layers \(\{L_{t,m}\}\), each corresponding to a continuous geometric surface with consistent depth.

Depth Layer Graph Construction: All depth layers are organized into a directed graph \(H=(V,E)\) with two types of edges: - Inter-window edges \(E_{\text{inter}}\): Connect corresponding layers from two windows at overlapping timestamps with IoU \(> \tau\ (=0.3)\). - Intra-window edges \(E_{\text{intra}}\): Connect the same depth layer across adjacent frames within the same window.

Layer-wise Scale Estimation: For each inter-window edge, a layer-wise scaling factor \(\hat{s}\) is optimized via IRLS (Huber loss) to align the depth values of corresponding layers in adjacent windows.

Scale Propagation and Aggregation: Layer-wise scales estimated from overlapping regions along \(E_{\text{inter}}\) are first propagated temporally to non-overlapping frames along \(E_{\text{intra}}\). The final scale for each layer is computed as an IoU-weighted average, ensuring consistency across windows and the temporal axis.

Loss & Training

  • Global scale \(s_i^w\) is estimated via IRLS robust optimization with Huber loss to suppress outliers.
  • Rotation and translation are optimized via the Kabsch algorithm under the estimated scale.
  • Layer-wise scales are likewise optimized via IRLS with Huber loss.

Key Experimental Results

Main Results

Video Depth Estimation (Table 1):

Method Type Sintel Abs Rel↓ Bonn Abs Rel↓ KITTI Abs Rel↓
π³ (offline) Offline 0.245 0.050 0.038
CUT3R Streaming 0.421 0.078 0.118
STream3Rβ Streaming 0.264 0.069 0.080
π³+Ours Streaming 0.247 0.048 0.054

Camera Pose Estimation (Table 2):

Method Sintel ATE↓ ScanNet ATE↓ TUM ATE↓
π³ (offline) 0.073 0.030 0.014
CUT3R 0.213 0.099 0.046
TTT3R 0.201 0.064 0.028
π³+Ours 0.061 0.031 0.016

ATE on Sintel is reduced by 68.6% versus the previous best streaming method; Acc on 7-Scenes is reduced by 63.9%.

Large-Scale KITTI Odometry (Table 3): Offline models VGGT and π³ run OOM on all sequences; CUT3R runs OOM on most. LASER(π³) maintains stable performance across all 11 sequences, achieving a mean ATE of 24.17, outperforming VGGT-Long (27.64) and π³-Long (30.72).

Ablation Study

LSA Component Ablation (Table 5, Sintel depth):

Configuration Abs Rel↓ δ<1.25↑
Full LASER 0.247 68.8
w/o LSA 0.328 51.4
SAM 2 replacing segmentation 0.251 67.8
w/o \(E_{\text{intra}}\) 0.261 64.7

Key Findings: - Removing LSA degrades Abs Rel by 32.8%, confirming that layer-wise scale alignment is the core contribution. - SAM 2, despite finer segmentation granularity, yields no improvement; simple and efficient segmentation is sufficient. - Removing intra-window temporal propagation edges \(E_{\text{intra}}\) impairs global consistency. - The IoU threshold \(\tau\) is robust in the range 0.2–0.6; the default value of 0.3 performs best. - Window size \(L=20\) achieves the best balance.

Efficiency Analysis

  • π³+Ours: ~14.2 FPS, 6 GB peak memory (RTX A6000).
  • VGGT+Ours: ~10.9 FPS, 10 GB peak memory.
  • Fastest speed and lowest memory consumption among all streaming methods.

Highlights & Insights

  • Zero training cost: No retraining is required whatsoever; any offline reconstruction model can be directly converted to a streaming system, enabling plug-and-play adoption as new models emerge.
  • Identification and resolution of layer-wise depth inconsistency: The paper provides a deep insight into the anisotropic scaling failure mode of global Sim(3) alignment and proposes a solution grounded in classical layered scene representation.
  • Comprehensive state-of-the-art performance: LASER surpasses existing streaming methods across three tasks — depth estimation, pose estimation, and point map reconstruction — with several metrics approaching or exceeding offline models.
  • Practically deployable: At 14 FPS with 6 GB memory usage and support for kilometer-scale long sequences, the system holds significant real-world application value.
  • Elegant design philosophy: Classical geometric principles are used to bridge the shortcomings of deep learning models without requiring end-to-end retraining.

Limitations & Future Work

  • Performance is bounded by the capabilities of the underlying offline model (e.g., π³'s weaker normal accuracy leads to suboptimal NC metrics).
  • Layered segmentation depends on the quality of the depth map and may fail in extreme scenarios such as pure rotation or textureless regions.
  • The sliding window strategy introduces a fixed latency, making it unsuitable for ultra-low-latency applications.
  • Large-scale scenes still require additional loop closure to reduce long-range drift.
  • The paper is categorized under human understanding, whereas the actual contribution is a general-purpose 3D/4D reconstruction framework.
  • Offline feed-forward reconstruction: DUSt3R → VGGT → π³, progressing from pairwise image regression to dense reconstruction from arbitrary view collections.
  • Streaming reconstruction (with training): CUT3R (recurrent memory), StreamVGGT (causal attention), STream3R (sliding window + token pooling), WinT3R, TTT3R (test-time adaptation).
  • Training-free streaming (concurrent work): VGGT-Long uses chunking + Sim(3); this paper demonstrates that simple Sim(3) alignment is insufficient.
  • Classical methods: ORB-SLAM2, DROID-SLAM, etc., offer high accuracy but require calibration and produce only sparse reconstructions.
  • 4D reconstruction: From per-scene optimization with NeRF/3DGS to feed-forward dynamic reconstruction.

Rating

  • Novelty: ⭐⭐⭐⭐ — The identification of the layer-wise depth inconsistency problem and the LSA design are original; the combination of classical geometry and modern deep learning is elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three tasks, six datasets, extensive baseline comparisons, comprehensive ablations, and efficiency analysis; very thorough.
  • Writing Quality: ⭐⭐⭐⭐ — Problem motivation is clearly articulated, figures are intuitive, and the method is described rigorously.
  • Value: ⭐⭐⭐⭐⭐ — Training-free, plug-and-play, efficient, and practical; highly valuable for real-world deployment.