LongStream: Long-Sequence Streaming Autoregressive Visual Geometry¶

Conference: CVPR 2026 arXiv: 2602.13172 Code: Project Page Area: 3D Vision Keywords: Streaming 3D reconstruction, autoregressive model, pose estimation, KV cache, long sequence

TL;DR¶

LongStream is a gauge-decoupled streaming visual geometry model that achieves stable metric-scale scene reconstruction at 18 FPS over thousand-frame sequences, via keyframe-relative pose prediction, orthogonal scale learning, and cache-consistent training.

Background & Motivation¶

Long-sequence streaming 3D reconstruction remains a major open challenge in visual geometry. Existing autoregressive streaming models (e.g., Stream3R, StreamVGGT) degrade severely on long sequences:

Root Cause — gauge-coupled design: Existing models anchor poses to the first-frame coordinate system and regress absolute poses. This turns long-sequence prediction into an increasingly difficult extrapolation problem — models are trained on short index ranges but must predict large indices at inference, creating a "train-short, test-long" domain gap.
Attention decay: As sequences grow longer, the attention mechanism becomes overly dependent on first-frame tokens (attention sink), leading to keyframe jumps and accumulated pose errors.
Scale drift: Geometry and metric scale estimation are entangled, causing progressive Sim(3) scale drift.
KV cache contamination: Stale or degraded information accumulated in long-term KV caches further exacerbates geometric drift.

Existing streaming methods collapse within tens of meters, whereas applications such as autonomous driving require kilometer-scale stable reconstruction.

Method¶

Overall Architecture¶

LongStream adopts a ViT encoder + causal Transformer aggregator + task head architecture. Given streaming input frames, it predicts per frame: keyframe-relative pose \(\mathbf{T}_{i \leftarrow k}\), depth map, point map, and global scale factor. Training and inference share identical KV cache layouts to ensure consistency.

Key Designs¶

SE(3) Gauge Decoupling — Keyframe-Relative Pose: Rather than anchoring to a fixed first frame, the model predicts each frame's pose relative to the most recent keyframe:

\[\mathbf{T}_{i \leftarrow k} = \mathbf{T}_i \circ \mathbf{T}_k^{-1}\]

This formulation is invariant under any world-coordinate reparameterization \(S \in SE(3)\). The key effect is transforming the long-range extrapolation problem (large index range) into a constant-difficulty local estimation task (bounded index interval \(i - k\)), while eliminating the inherent bias toward the first frame. The pose head fuses current-frame and keyframe features and performs RAFT-style iterative updates \(\mathbf{p}^{(t+1)} = \mathbf{p}^{(t)} + \Delta\mathbf{p}^{(t)}\).

Sim(3) Gauge Decoupling — Orthogonal Scale Learning: Inspired by scale-invariant (SI-Log) approaches, geometry learning and metric scale estimation are decoupled at both the architectural and objective levels:
- Geometry branch: Optimized in normalized space with loss \(\mathcal{L}_{geom} = \|\tilde{X}_{pred} - \tilde{X}_{gt}\|_1\), where \(\tilde{X} = X / \text{Norm}(X)\), ensuring \(\partial\mathcal{L}/\partial s = 0\).
- Scale head: Independently predicts a global scale factor \(s = \exp(\mathbf{w}^\top \mathbf{h}_{scale})\), trained only on metrically calibrated data.
- Scale affects only translation, depth, and point clouds; rotation and field of view are unaffected, achieving complete decoupling.
Cache-Consistent Training (CCT): Addresses attention sink dependency and KV cache contamination:
- Constant sink tokens are removed during training; pure causal masking with a sliding window is used instead.
- KV caches are explicitly propagated and pruned between training chunks, ensuring that cache visibility during training exactly matches inference.
- For very long sequences, periodic cache refresh is introduced: sink frames and KV caches are hard-reset every \(N\) keyframes (analogous to state marginalization in SLAM). Because the entire model operates in keyframe-relative coordinates, this reset does not break consistency.

Loss & Training¶

The joint probabilistic framework maximizes the posterior:

\[\mathcal{L} = \mathcal{L}_{geom} + \mathcal{L}_{depth} + \mathcal{L}_{pose} + \mathcal{L}_{scale}\]

\(\mathcal{L}_{pose}\): L1 loss on rotation (quaternion), translation (normalized space), and focal length offset across iterative updates, with decay weight \(\gamma^{t-1}\).
\(\mathcal{L}_{geom}\): Normalized point cloud L1 (scale-invariant).
\(\mathcal{L}_{depth}\): Depth supervision.
\(\mathcal{L}_{scale}\): Log-space metric scale L1 \(\|\log\hat{s} - \log s_{gt}\|_1\) (applied only on calibrated data).

Key Experimental Results¶

Main Results¶

Dataset	Metric	LongStream	Prev. Best Streaming	Gain
KITTI (11 sequences)	Mean ATE↓	51.90	177.73 (TTT3R)	−70.8%
KITTI seq00 (3.7 km)	ATE	92.55	190.93 (TTT3R)	−51.5%
KITTI seq04 (0.4 km)	ATE	1.95	11.62 (TTT3R)	−83.2%
TUM-RGBD	ATE	Best	Stream3R/StreamVGGT collapse	—
Waymo	ATE	Best	Severe drift in prior methods	—

Note: 18 FPS real-time inference; memory and latency remain stable on long sequences (unlike VGGT et al., which run out of memory).

Ablation Study¶

Configuration	Key Metric	Notes
No CCT + causal inference	Strong attention sink, poor accuracy	Train-inference inconsistency causes degradation
No CCT + sliding window inference (with sink)	Amplified sink, reduced accuracy	Sliding window amplifies sink bias
CCT + causal inference	Sink strongly suppressed, best accuracy	CCT eliminates sink dependency
CCT + sliding window inference	Sink similarly suppressed, strong accuracy	CCT effective across all inference modes

Key Findings¶

Existing streaming methods collapse within tens of meters, while LongStream remains stable on kilometer-scale sequences.
Attention sink is not a useful feature — it is a byproduct of train-inference inconsistency; its elimination via CCT yields substantial performance gains.
Keyframe-relative pose formulation converts extrapolation into interpolation, providing the theoretical foundation for long-sequence stability.
Periodic cache refresh is naturally compatible with keyframe-relative coordinates, requiring no additional alignment.

Highlights & Insights¶

Gauge decoupling: The two fundamental degrees of freedom responsible for long-sequence degradation — SE(3) coordinate freedom and Sim(3) scale freedom — are identified at the theoretical level and addressed with elegant, targeted solutions.
Reinterpreting attention sink: The sink is shown not to be an intrinsic requirement of streaming Transformers, but rather a symptom of the train-inference gap. CCT directly eliminates this gap.
Periodic cache refresh: Inspired by state marginalization in SLAM, this mechanism pairs seamlessly with keyframe-relative coordinates.
Practical significance: 18 FPS throughput, kilometer-scale stability, and metric scale make LongStream genuinely deployable in autonomous driving and related applications.

Limitations & Future Work¶

Keyframe selection strategy is not discussed in detail; adaptive keyframe scheduling may be needed for fast motion or texture-poor scenes.
The scale head is trained only on metrically calibrated data; scale quality on uncalibrated data depends on the training data distribution.
The value of \(N\) for periodic cache refresh may require scene-specific tuning.
Advantages over baselines are less pronounced on small-scale indoor scenes (e.g., TUM-RGBD) than outdoors.
Dynamic object handling is not discussed; additional processing may be required for scenes with numerous moving objects.

VGGT/StreamVGGT: Direct baselines for LongStream, demonstrating that absolute pose regression inevitably fails on long sequences.
DUSt3R/MASt3R: Offline reconstruction methods whose ideas LongStream extends to the streaming setting.
CUT3R: RNN-based streaming reconstruction; LongStream replaces the RNN with a Transformer + KV cache for improved long-range dependency modeling.
The keyframe-relative pose idea generalizes to other vision tasks requiring long-sequence processing.
The train-inference consistency principle underlying CCT has reference value for all streaming models using KV caches.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Triple innovations — gauge decoupling, CCT, and periodic cache refresh — with clear theoretical motivation and independent contributions per component.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple indoor and outdoor datasets with comprehensive visualizations, including attention map analysis.
Writing Quality: ⭐⭐⭐⭐⭐ Problem formulation is precise, theoretical derivations are clear, and figures are highly informative.
Value: ⭐⭐⭐⭐⭐ First system to achieve kilometer-scale real-time streaming reconstruction, with significant implications for practical deployment in autonomous driving and robotics.
Value: Pending