Light3R-SfM: Towards Feed-forward Structure-from-Motion¶

Conference: CVPR 2025
arXiv: 2501.14914
Code: None
Area: 3D Vision/SfM
Keywords: feed-forward SfM, global alignment, attention mechanism, shortest path tree, pointmap regression

TL;DR¶

Light3R-SfM proposes the first feed-forward end-to-end SfM framework. It replaces the traditional optimization-based global alignment with a learnable latent global alignment module, and constructs a scene graph via a shortest path tree based on retrieval scores. It completes reconstruction in only 33 seconds on the Tanks&Temples 200-image setup (49x faster than MASt3R-SfM) while maintaining comparable accuracy.

Background & Motivation¶

Background: SfM recovers camera poses and 3D scene structure from unordered image sets, serving as the foundation for downstream tasks like NeRF and 3DGS. Traditional methods are categorized into incremental (COLMAP gradually adding images) and global (GLOMAP jointly aligning all cameras). Recently, DUSt3R proposed pairwise 3D reconstruction via pointmap regression, and MASt3R-SfM further improved it with image retrieval and sparse matching optimization, but still relies on expensive iterative optimization for global alignment.

Limitations of Prior Work: Optimization-based global alignment (such as bundle adjustment) is critical for accurate 3D reconstruction but is computationally extremely expensive—MASt3R-SfM takes about 27 minutes to process 200 images, requiring significant memory and time for medium-scale image sets. The existing feed-forward method, Spann3R, uses an explicit spatial memory bank, but is limited by fixed capacity and prone to cumulative drift.

Key Challenge: Accurate global alignment requires expensive optimization processes (constraining multi-view consistency + bundle adjustment), whereas feed-forward inference requires obtaining results in a single forward pass. How to achieve globally consistent camera pose estimation without performing optimization?

Goal: (1) Replace optimization-based global alignment with a learnable feed-forward approach; (2) drastically reduce running time while maintaining accuracy; (3) scale to large-scale image sets (hundreds of images).

Key Insight: The authors observe that if global information is shared among all image features via an attention mechanism during the encoding stage, the subsequent pairwise decoding can implicitly output globally consistent pointmaps, avoiding explicit global optimization. The key challenge is how to make the attention mechanism work efficiently on large-scale image sets.

Core Idea: Insert a scalable latent global alignment module between image encoding and 3D decoding, sharing implicit global information through self-attention on global tokens followed by cross-attention to local tokens, thereby replacing expensive optimization-based alignment.

Method¶

Overall Architecture¶

Given an unordered set of images, an image encoder first extracts token features for each image. Global information is then exchanged across all images through the latent global alignment module (comprising L layers of self-attention on global tokens + cross-attention to local tokens). The similarity matrix between images is computed using the average pooling of encoded features to construct a shortest path tree (SPT) as the scene graph. For each edge in the SPT, pairwise pointmap decoding is executed on the image pair to obtain locally aligned pointmaps and confidence scores. Finally, the SPT is traversed via BFS, and the pointmaps are accumulated into a global reconstruction using Procrustes alignment edge-by-edge.

Key Designs¶

Latent Global Alignment Module:
- Function: Enables global information sharing among all images in the feature space, allowing subsequent pairwise decoding to output implicitly globally aligned pointmaps.
- Mechanism: Performs spatial average pooling on the tokens \(F_i^{(0)}\) of each image to obtain a global token \(g_i^{(0)}\). A stack of L layers is applied: (1) self-attention is performed among all global tokens: \(\{g_i^{(l+1)}\} = \text{Self}(\{g_i^{(l)}\})\); (2) local tokens of each image perform cross-attention with all global tokens: \(F_i^{(l+1)} = \text{Cross}(F_i^{(l)}, \{g_i^{(l+1)}\})\). Finally, a residual connection is used: \(F_i = F_i^{(0)} + F_i^{(L)}\).
- Design Motivation: A naive self-attention over all tokens has a complexity of \(O((N \times T)^2)\) and is not scalable; the global token factorization reduces the complexity to \(O(N^2 + N \times T)\), significantly decreasing the constant factor in practice when \(N \approx T\).
Shortest Path Tree Scene Graph:
- Function: Connects all images with the minimum number of edges (N-1) while minimizing cumulative drift.
- Mechanism: Computes a cosine similarity matrix \(S_{ij}\) using average pooling of the encoded features, and runs Dijkstra's algorithm with negative similarity as edge weights to build the SPT. The node with the minimum total cost to all other nodes is selected as the root: \(\arg\min_j \sum_i -S_{ij}\). The difference between SPT and MST is that SPT minimizes the path cost from the root to each node, yielding a flatter tree structure.
- Design Motivation: MST minimizes total edge weight but may yield deep trees, leading to severe cumulative drift during BFS traversal; SPT generates flatter trees, reducing the propagation path length of errors. Moreover, it requires only N-1 edges, significantly reducing the amount of decoding compared to a fully connected graph.
Global Optimization-free Reconstruction:
- Function: Accumulates pairwise local pointmaps into a globally consistent reconstruction.
- Mechanism: Traverses the SPT via BFS, initializing the global reconstruction with the first edge. For each subsequent edge (k,l), where node k already has a global pointmap \(X^k\), it is aligned with the predicted \(X^{k,k}\) of the current edge via Procrustes alignment to estimate a rigid transformation \(P_k\) (weighted by \(\log C^k\)). Then, the pointmap of node l is transformed to the global coordinate frame: \(X^l = P_k^{-1} X^{k,l}\). The confidence is updated via element-wise geometric mean: \(C^k := C^k \odot C^{k,k}\).
- Design Motivation: Procrustes alignment has a closed-form solution. Its computational cost is linear with respect to the number of images and is negligible, making it far more efficient than bundle adjustment.

Loss & Training¶

Jointly supervises local and global pointmaps: \(\mathcal{L} = \mathcal{L}_{\text{pair}} + \lambda \mathcal{L}_{\text{global}}\) (\(\lambda=0.1\)). The local loss uses confidence-weighted L2 on the two pointmaps of each edge: \(\mathcal{L}_{\text{conf}} = \sum_p C_p \|X_p - \bar{X}_p\| - \alpha C_p\). The global loss first aligns the global pointmap to the ground truth coordinate system via Procrustes alignment and then computes the same confidence-weighted loss. Global loss implicitly supervises pose accuracy.

Key Experimental Results¶

Main Results¶

Tanks & Temples, 200 images:

Method	Alignment	RRA@5↑	RTA@5↑	ATE↓	Reg.↑	Time(s)↓
COLMAP	OPT	64.7	57.7	0.019	97.0	-
GLOMAP	OPT	73.5	74.8	0.016	100	536.7
VGGSfM	OPT	84.5	86.3	0.007	47.6	1511.6
MASt3R-SfM	OPT	68.2	68.4	0.013	100	1609.0
Spann3R	FFD	22.8	28.6	0.019	100	60.4
Light3R-SfM	FFD	52.4	53.1	0.016	100	33.4

Tanks & Temples, full sequence:

Method	Alignment	RRA@5↑	RTA@5↑	ATE↓	Time(s)↓
GLOMAP	OPT	75.8	76.7	0.010	1977.7
MASt3R-SfM	OPT	49.2	54.0	0.011	2723.1
Spann3R	FFD	20.3	24.7	0.016	116.2
Light3R-SfM	FFD	52.0	52.8	0.011	63.4

Ablation Study¶

Configuration	Key Metrics	Description
No global alignment (DUSt3R baseline)	ATE ~0.03+	No global information sharing, large errors
MST scene graph	Slightly higher ATE	Deep tree causes cumulative drift
SPT scene graph	Best ATE	Flatter tree reduces drift
With global loss	Best ATE	Implicitly supervises global consistency
Without global loss	Degraded ATE	Lack of global consistency constraints

Key Findings¶

Light3R-SfM significantly outperforms Spann3R among feed-forward methods: RRA 52.4 vs 22.8 (2.3x) and ATE 0.016 vs 0.019 under the 200-image setup.
49x faster than MASt3R-SfM (33.4s vs 1609s), with an ATE difference of only 0.003 (0.016 vs 0.013).
Achains a 100% registration rate (all images are successfully registered), whereas COLMAP/VGGSfM may fail in large-scale setups.
Under the full sequence setup, the RRA of Light3R-SfM even surpasses MASt3R-SfM (52.0 vs 49.2), suggesting that optimization-based methods also face challenges when the number of images is very large.
SPT is better suited for cumulative reconstruction than MST because a flatter tree minimizes cumulative drift.

Highlights & Insights¶

The concept of "replacing optimization with attention for global alignment" is a paradigm shift in the SfM field—replacing traditional bundle adjustment with an end-to-end learnable attention mechanism fundamentally eliminates the computational bottleneck of SfM. Global token factorization makes this idea feasible on large-scale image sets.
Replacing MST with SPT for scene graph construction is clever—while both are tree structures with N-1 edges, the SPT results in a flatter tree shape, physically reducing the error accumulation path during BFS traversal. This simple modification is applicable to any incremental method based on tree traversal.
The design of the global loss allows the training to implicitly supervise pose accuracy—no direct supervision on pose parameters is needed. Instead, pose consistency is indirectly constrained through reconstruction error after global pointmap alignment.

Limitations & Future Work¶

Accuracy still lags behind optimization-based methods (GLOMAP, MASt3R-SfM), especially under dense view setups.
The global token is merely average pooling of the image tokens, and such information compression might be too aggressive.
The choice of root node and the BFS traversal order in SPT affect the results, but a systematic analysis is lacking.
Cumulative alignment still introduces drift, particularly when the tree is deep.
A post-processing bundle adjustment is not integrated—while the authors argue this is the direction for feed-forward SfM, optional optimization refinement might be needed in practical applications.
Adaptive scene graph construction (not limited to N-1 edges) could be explored to trade off between accuracy and efficiency.

vs MASt3R-SfM: MASt3R-SfM achieves the best accuracy through sparse matching, optimization-based alignment, and bundle adjustment, but requires 27 minutes for 200 images; Light3R-SfM completes in 33 seconds via a feed-forward manner with slightly lower but acceptable accuracy.
vs Spann3R: Spann3R uses an explicit spatial memory bank for online reconstruction, constrained by fixed capacity and prone to drift; Light3R-SfM achieves offline global information sharing via attention, outperforming it across the board.
vs DUSt3R: DUSt3R requires pairwise reconstruction for all exhaustive image pairs followed by optimization alignment, which is not scalable; Light3R-SfM uses SPT to reduce the decoding burden and replaces optimization with latent alignment.
Insight: The relationship between feed-forward SfM and optimization-based SfM is similar to that between feed-forward depth estimation and MVS optimization—the former is faster but has an accuracy gap, which may gradually narrow through larger models and datasets.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The first practical feed-forward SfM framework with a novel latent global alignment module design.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensively settings of image numbers, comparisons with multiple baselines, and runtime analysis, though validation on more datasets is lacking.
Writing Quality: ⭐⭐⭐⭐ Clear framework and strong motivation arguments.
Value: ⭐⭐⭐⭐⭐ Opens up a new direction for feed-forward SfM with important practical value for large-scale 3D reconstruction, featuring an impressive 49x speedup.