Speed3R: Sparse Feed-forward 3D Reconstruction Models¶
Conference: CVPR 2026 arXiv: 2603.08055 Code: https://visual-ai.github.io/speed3r/ Area: 3D Vision Keywords: 3D Reconstruction, Sparse Attention, Feed-forward, Inference Acceleration, Structure-from-Motion
TL;DR¶
Speed3R introduces a trainable dual-branch Global Sparse Attention (GSA) mechanism for feed-forward 3D reconstruction models. A compression branch provides coarse-grained scene summaries while a selection branch focuses fine-grained attention on critical tokens, achieving 12.4× inference speedup on 1000-view sequences with only marginal accuracy degradation.
Background & Motivation¶
Background: Recent feed-forward 3D reconstruction models (VGGT, \(\pi^3\)) can jointly infer dense geometry and camera poses in a single forward pass, bypassing the multi-stage pipelines of classical SfM/MVS.
Limitations of Prior Work: These models rely on dense global attention, whose computational cost scales as \(O(n^2)\) with the number of tokens. Inference speed becomes a severe bottleneck at large view counts or high resolutions — for example, \(\pi^3\) requires 202 seconds to process 1024 images.
Key Challenge: Training-free approaches such as FastVGGT (token merge-unmerge) and Block-Sparse VGGT (top-k attention) cannot be optimized end-to-end, and aggressive pruning leads to significant accuracy degradation.
Core Insight: The classical SfM intuition — that sparse keypoints suffice for robust pose estimation — has not yet been fully exploited by feed-forward methods.
Goal: Motivated by both SfM and sparse attention in LLMs (NSA, MOBA), this work designs end-to-end trainable sparse attention and transfers dense model performance via knowledge distillation.
Method¶
Overall Architecture¶
Speed3R adopts a three-stage architecture:
- Per-frame Feature Encoder: \(N\) images \(\{I_i\}_{i=1}^N\) are independently processed by a visual encoder (e.g., DINOv2) to extract patch-level feature tokens.
- Alternating Attention Transformer: Multiple Transformer blocks alternate between local intra-frame attention (Frame Attention) and Global Sparse Attention (GSA, the core contribution), replacing the dense global attention in the original models.
- Task-specific Prediction Heads: Refined tokens are fed into downstream heads to predict per-view camera parameters \(\{\hat{C_i}\}\), depth maps \(\{\hat{D_i}\}\), and associated uncertainties \(\{\hat{\alpha_i}\}\).
Key Designs: Global Sparse Attention (GSA)¶
The core idea of GSA is coarse-to-fine: first build a global scene understanding from low-resolution representations, then guide the model to attend only to the most informative token subset in high-resolution space.
Input Decomposition: The GSA input \(X \in \mathbb{R}^{M \times C}\) is formed by concatenating special tokens \(X_{\text{spec}}\) and image tokens \(X_{\text{img}}\). Projections \(W_Q, W_K, W_V\) generate Q/K/V, which are split by token type:
Full Attention for Special Tokens: Special tokens (e.g., pose tokens) serve as global information bottlenecks for critical tasks such as pose estimation, and attend to all tokens via standard dense self-attention:
Since \(M_{\text{spec}}\) is small, this step incurs negligible overhead.
Dual-branch Sparse Attention for Image Tokens: The large number of image tokens is handled via a dual-branch strategy.
Compression Branch¶
Provides an efficient coarse-grained global scene summary. \(Q_{\text{img}}, K_{\text{img}}, V_{\text{img}}\) are spatially downsampled using \(s \times s\) non-overlapping average pooling, yielding compressed tensors \(Q_{\text{comp}}, K_{\text{comp}}, V_{\text{comp}} \in \mathbb{R}^{M'_{\text{img}} \times d}\) where \(M'_{\text{img}} = M_{\text{img}} / s^2\).
Attention is computed in the compressed space:
A guidance score matrix is also computed for use by the selection branch:
The coarse output is upsampled back to the original resolution via nearest-neighbor interpolation: \(O_{\text{comp}} = \text{Upsample}(O'_{\text{comp}})\).
Selection Branch¶
Recovers fine-grained attention. Using the guidance scores \(S_{\text{guide}}\), \(\text{TopKSelect}(\cdot)\) identifies the most relevant coarse-region indices for each query. The corresponding \(K_{\text{sel}}, V_{\text{sel}}\) are then retrieved from the full-resolution \(K_{\text{img}}, V_{\text{img}}\) (queries within the same compression window share the same KV pairs):
Each query attends to only \(k \ll M_{\text{img}}\) tokens, making this step highly efficient.
Gated Aggregation¶
The outputs of both branches are dynamically fused via a learnable gating mechanism:
where \(\sigma\) denotes sigmoid and \(W_g\) is a learned projection matrix. The model adaptively determines for each token whether to emphasize the global summary or local detail.
Efficient Triton Kernel Implementation¶
A naïve implementation would materialize the full \(S_{\text{guide}}\) score matrix, incurring excessive memory usage. A fused GSA Triton kernel is developed that integrates a streaming Top-K algorithm into the FlashAttention workflow: score matrix tiles are computed on-chip in SRAM while simultaneously maintaining running top-k index sets, completing region selection and compressed output computation in a single pass without materializing the full score matrix.
Speed3R-VGGT Adaptation¶
VGGT treats the first frame as a global reference and employs dedicated camera tokens. To preserve reference-frame information, the selection branch attention set consists of two components:
- Fixed global context: all tokens from the reference frame plus tokens from every 100th frame
- Dynamic Top-K tokens: key token windows from non-reference frames determined by the standard selection procedure
Speed3R-\(\pi^3\) Adaptation¶
\(\pi^3\) has no reference frame or camera token dependencies, so GSA can be applied directly. Experiments show that the register tokens in \(\pi^3\) can be removed in the sparse variant without affecting performance, further simplifying the model.
Loss & Training¶
- Knowledge Distillation: A pretrained dense model serves as the teacher; its depth and pose predictions are used as pseudo-labels to train the student (sparse model).
- Total Loss: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{depth}} + \lambda \mathcal{L}_{\text{camera}}\)
- Data: A mixture of 7 datasets (ArkitScene, Scannet++, DL3DV, CO3D, Hypersim, WildRGBD, VirtualKitti2)
- Training Configuration: 80 epochs (800 steps per epoch), 8× NVIDIA H20 GPUs for approximately 7 days, learning rate \(1 \times 10^{-5}\), gradient accumulation factor=4 (effective batch size 32)
Key Experimental Results¶
Main Results: Multi-view Pose Estimation (RE10K / CO3Dv2)¶
| Method | Sparsity (%) | RE10K AUC@30↑ | CO3Dv2 AUC@30↑ |
|---|---|---|---|
| VGGT (dense) | 0 | 74.17 | 88.33 |
| Block Sparse-VGGT | 75 | 63.82 | 79.92 |
| FastVGGT | 82 | 69.99 | 84.03 |
| Speed3R-VGGT | 84 | 74.81 | 87.71 |
| \(\pi^3\) (dense) | 0 | 87.37 | 89.67 |
| Block Sparse-\(\pi^3\) | 75 | 75.39 | 80.72 |
| FastVGGT-\(\pi^3\) | 90 | 86.04 | 86.39 |
| Speed3R-\(\pi^3\) | 94 | 87.17 | 89.41 |
Key Findings:
- Speed3R-VGGT at 84% sparsity surpasses the dense VGGT baseline on RE10K (74.81 vs. 74.17)
- Speed3R-\(\pi^3\) at 94% sparsity nearly matches the performance of dense \(\pi^3\)
- Consistently outperforms training-free competitors at all sparsity levels
Long-sequence Pose Estimation (Tanks & Temples, avg. 300 images/scene)¶
| Method | RRA@5↑ | RTA@5↑ | AUC@30↑ | Time (s)↓ |
|---|---|---|---|---|
| VGGT (dense) | 70.29 | 79.30 | 77.67 | 34.51 |
| Block Sparse-VGGT | 66.83 | 71.29 | 74.15 | 10.79 |
| FastVGGT | 69.28 | 77.98 | 76.29 | 15.98 |
| Speed3R-VGGT | 69.51 | 77.81 | 76.57 | 6.55 |
| \(\pi^3\) (dense) | 72.14 | 81.26 | 79.63 | 22.32 |
| Block Sparse-\(\pi^3\) | 67.85 | 78.91 | 76.64 | 8.16 |
| FastVGGT-\(\pi^3\) | 69.78 | 79.51 | 77.76 | 11.96 |
| Speed3R-\(\pi^3\) | 70.72 | 80.72 | 79.77 | 4.19 |
Key Findings: Speed3R-\(\pi^3\) achieves the best results among sparse methods on all metrics while being the fastest (4.19s), 5.3× faster than dense \(\pi^3\).
Ablation Study (Speed3R-\(\pi^3\), T&T Dataset)¶
| Configuration | RE10K AUC@30↑ | T&T AUC@30↑ | Time (s)↓ |
|---|---|---|---|
| Base (4×4 window, top-32) | 86.35 | 78.69 | 4.19 |
| (1) Remove compression branch Value | 86.29 | 77.90 | 3.99 |
| (2) Remove selection branch | 83.44 | 76.84 | 3.56 |
| (4) Top-8 | 85.37 | 78.17 | 3.72 |
| (5) Top-16 | 85.98 | 78.55 | 3.92 |
| (6) Top-64 | 86.42 | 78.90 | 4.64 |
| (7) 8×8 window | 86.49 | 78.71 | 5.27 |
| (8) Without knowledge distillation | 85.18 | 77.81 | 4.19 |
Key Findings:
- Selection branch is essential: Removing it causes large drops on both datasets (RE10K −2.91, T&T −1.85)
- Compression branch matters for long sequences: Removing its Value has negligible effect on short sequences but hurts long sequences (T&T −0.79)
- Knowledge distillation is critical: Removing it reduces RE10K by 1.17 and T&T by 0.88, demonstrating its effectiveness in mitigating noisy labels in real-world datasets
- 4×4 window + top-32 is the optimal trade-off: top-8/16 are insufficient in accuracy; top-64 and 8×8 windows offer marginal gains at higher cost
Inference Latency Comparison¶
| Sequence Length | 32 | 64 | 128 | 256 | 512 | 1024 |
|---|---|---|---|---|---|---|
| Full Attn. (\(\pi^3\)) | 0.50s | 1.31s | 3.97s | 13.41s | 50.01s | 202.39s |
| Block Sparse | 0.46s | 0.85s | 1.69s | 3.77s | 9.64s | 29.58s |
| FastVGGT | 0.44s | 0.88s | 1.96s | 4.95s | 14.13s | 45.49s |
| Speed3R | 0.37s | 0.71s | 1.44s | 3.06s | 6.83s | 16.38s |
At 1024 images, Speed3R requires only 16.38s vs. 202.39s for the dense model, yielding a 12.4× speedup.
Test-time Adaptation (Tanks & Temples)¶
Training uses top-32; increasing top-k at inference continuously improves long-sequence performance. At top-128, RTA@5 reaches 82.00, surpassing the dense model (81.26), and AUC@30 reaches 80.33, also exceeding the dense model (79.63), with a runtime of only 6.07s.
Highlights & Insights¶
- Bridging Classical and Modern Methods: Combines the SfM insight that sparse keypoints suffice for robust pose estimation with LLM sparse attention techniques, yielding a trainable sparse attention mechanism tailored to 3D reconstruction.
- Coarse-to-fine Dual-branch Design: The compression branch establishes global understanding, which then guides the selection branch to focus on critical regions, balancing global coverage with local precision.
- End-to-end Trainability: Offers a significant advantage over training-free methods such as FastVGGT and Block-Sparse through joint optimization.
- General Plug-and-play Applicability: Successfully adapted to both VGGT and \(\pi^3\) architectures, demonstrating strong generalizability.
- Custom Triton Kernel: Fuses Top-K with FlashAttention for efficient memory access, avoiding materialization of the full score matrix.
Limitations & Future Work¶
- Short-sequence Accuracy Gap: A performance gap relative to dense models remains under strict thresholds (AUC@5), as high-precision pose regression is particularly challenging for sparse methods.
- Memory Overhead: The dual-branch GSA architecture incurs a 15% memory overhead compared to full attention, limiting a single 80GB GPU to at most 1024 images.
- Dependence on Pretrained Dense Models: The knowledge distillation strategy requires a high-quality dense teacher, increasing training pipeline complexity.
- Pose Regression vs. Generative Tasks: The high numerical precision demanded by pose regression makes 3D reconstruction less amenable to sparse attention than text or image generation.
Rating¶
⭐⭐⭐⭐ — The first trainable sparse attention method targeting feed-forward 3D reconstruction, with a practically significant 12.4× speedup, an elegant dual-branch design, and thorough ablations. However, accuracy gaps remain under strict short-sequence metrics, and the approach is inherently constrained by the high-precision demands of pose regression in 3D reconstruction.