Global-Aware Edge Prioritization for Pose Graph Initialization¶
Conference: CVPR 2026 arXiv: 2602.21963 Code: GitHub Area: 3D Vision Keywords: Structure-from-Motion, Pose Graph Initialization, Graph Neural Network, Minimum Spanning Tree, Edge Ranking
TL;DR¶
This paper proposes a GNN-based global edge prioritization method that upgrades pose graph initialization from independent pairwise image retrieval to globally structure-aware edge ranking combined with multi-minimum-spanning-tree construction, achieving significant improvements in SfM reconstruction accuracy under extremely sparse settings.
Background & Motivation¶
Background: Structure-from-Motion (SfM) is a classical problem of reconstructing 3D structure and camera poses from image collections. Both incremental (COLMAP) and global (GLOMAP) SfM pipelines begin with the same step: constructing an initial pose graph—selecting a sparse subset from \(\binom{N}{2}\) candidate image pairs for geometric verification.
Limitations of Prior Work: - Existing methods almost exclusively rely on per-image retrieval (e.g., NetVLAD, CosPlace, MegaLoc), independently connecting each image to its \(k\) nearest neighbors. - This greedy strategy ignores global structure: it may produce elongated chains, weakly connected regions, or multiple loosely coupled sub-structures. - Once the initial edges are selected, subsequent stages can only remove but not add edges; any globally critical connections missed at initialization are irrecoverable. - Re-ranking methods (Patch-NetVLAD, VOP) still operate at the pairwise level and cannot reason about global topology.
Key Challenge: The pose graph serves as the structural backbone of SfM, and its quality determines reconstruction success. However, current initialization strategies consider only local visual similarity and lack global reasoning capability, making them particularly fragile in sparse or ambiguous scenes.
Key Insight: This paper introduces the concept of "edge priority"—rather than evaluating image pairs independently, all candidate edges are ranked according to their global utility for SfM, and a compact yet globally connected pose graph is then constructed via multi-minimum-spanning-tree building.
Method¶
Overall Architecture¶
Given \(N\) images, the goal is to construct an initial pose graph \(\mathcal{G}_0 = (\mathcal{V}, \mathcal{E}_0)\). The pipeline consists of three steps: 1. Image Encoding: Global descriptors \(d_i\) are extracted using DINOv2+SALAD. 2. GNN Edge Ranking: Message passing via a GNN over the complete graph predicts a global matchability score \(\hat{r}_{ij}\) for each edge. 3. Multi-MST Graph Construction: Edges are selected through iterative minimum spanning tree construction based on predicted scores, augmented by connectivity-aware score modulation.
Key Designs¶
- GNN Edge Ranking Predictor
Edge-node message passing is performed over the complete graph of image embeddings. Each directed edge is initialized as:
\(e_{ij}^0 = \text{ReLU}(f_l[d_i, d_j, \langle d_i, d_j \rangle])\)
Over two rounds of message passing: edge features are updated as \(e_{ij}^t = f_{\text{edge}}([e_{ij}^{t-1}, d_i^t, d_j^t])\), nodes aggregate neighbor messages \(m_i^t = \frac{1}{N}\sum_j m_{ji}^t\) and are updated as \(d_i^{t+1} = f_{\text{update}}([d_i^t, m_i^t])\). An MLP then predicts the edge ranking score \(\hat{r}_{ij} = f_{\text{MLP}}(e_{ij}^2)\).
Design Motivation: Pairwise cosine similarity cannot perceive global structure. GNN message passing enables each edge's score to incorporate contextual information from the entire image collection, thereby distinguishing edges that are "locally similar but globally useless."
- Geometric Self-Supervised Signals
Supervision is generated by the SfM pipeline itself, requiring no manual annotation: - RANSAC inlier count \(u_{ij}\): measures immediate geometric verifiability between two views. - Common visible 3D point count \(v_{ij}\): measures long-term contribution of two views to global reconstruction. - Normalized and combined as \(\tilde{r}_{ij} = \frac{1}{2}(\text{norm}(u_{ij}) + \text{norm}(v_{ij}))\).
The training loss adopts NDCGLoss2++ (a differentiable ranking loss), optimizing the NDCG consistency between predicted and ground-truth rankings rather than regressing absolute values.
-
Multi-Minimum Spanning Tree + Connectivity-Aware Modulation
-
Predicted scores are converted to weights \(w_{ij} = 1 - \hat{r}_{ij}\), and Kruskal's algorithm is applied iteratively to construct \(k\) MSTs.
- Each new MST excludes already-selected edges (assigned \(\infty\) weight), ensuring complementary edge sets are selected.
- Key Innovation: From the second MST onward, scores are modulated using graph distance:
\(s_{ij}^{(m)} = (1-\lambda)\hat{r}_{ij} + \lambda \bar{d}^{(m-1)}(i,j)\)
where \(\bar{d}^{(m-1)}(i,j)\) is the normalized shortest-path distance in the current graph. High-scoring edges between distant nodes are prioritized, effectively reducing graph diameter and reinforcing weakly connected regions. Only the top-5 candidate edges per image are updated, and edges with scores below 0.9 are discarded to prevent unreliable edges from being erroneously promoted.
Loss & Training¶
- Loss: NDCGLoss2++, a differentiable NDCG approximation based on LambdaRank.
- Training Data: 153 MegaDepth scenes; each batch samples 240 images from a single scene, with at most 4 batches per scene.
- Optimizer: AdamW, learning rate \(10^{-5}\), 50 epochs.
- Large-Scale Extension: For scenes with >500 images, METIS graph partitioning splits the graph into subgraphs for batched inference, with overlapping edge scores averaged.
Key Experimental Results¶
Main Results¶
| Dataset | Metric | Ours (k=2 MSTs) | MegaLoc | Gain |
|---|---|---|---|---|
| IMC23-PhotoTourism | AUC@5° | ~71.7 | ~65.3 | +6.4 |
| MegaDepth | AUC@5° | Outperforms all baselines | Second best | Largest margin at k=1–2 |
| VisymScenes | % correctly reconstructed cameras (k=5) | >75% | <75% | Surpasses DoppelGanger++ |
Ablation Study¶
| Configuration | AUC@5° (k=1/2/3/5) | Notes |
|---|---|---|
| Full method | 64.2/71.7/72.6/73.5 | All components enabled |
| w/o GNN | 55.4/70.4/72.3/72.3 | Large drop at k=1, confirming importance of global reasoning |
| SALAD backbone | 61.2/71.0/72.4/73.4 | Slightly lower but still far above raw SALAD, confirming backbone agnosticism |
| Oracle-RANSAC | 65.7/72.4/73.0/74.1 | Best at small k (immediate verifiability) |
| Oracle-3D points | 65.4/72.1/73.5/74.3 | Best at large k (long-term utility) |
| kNN vs. MST edge selection | MST far superior to kNN | kNN prone to fragmentation; MST guarantees connectivity |
Key Findings¶
- Global edge ranking provides the greatest advantage under extremely sparse settings (k=1–2 MSTs); methods converge as k increases.
- All baselines outperform their native kNN selection under the MST framework, demonstrating that MST itself is a stronger structural prior.
- Connectivity modulation is particularly critical in ambiguous scenes (VisymScenes), improving the score from 61.9 to 66.0 (at k=2).
- The method surpasses the specialized Doppelganger++ filtering algorithm on VisymScenes without retraining, demonstrating that global reasoning naturally suppresses misleading edges.
- Inference overhead: GNN prediction is slower than pure retrieval but negligible compared to overall COLMAP runtime.
Highlights & Insights¶
- Precise problem formulation: Redefining pose graph initialization from "retrieval" to a "ranking" problem accurately identifies the most critical and irreversible bottleneck in the SfM pipeline.
- Elegant self-supervised signal design: Using SfM's own outputs (RANSAC inliers + 3D co-visibility points) as ranking supervision yields two complementary and fully automatic signals.
- Strong generality of the MST framework: Not only does the proposed method benefit, but all baselines also improve under the MST framework, demonstrating that MST is a fundamentally superior graph construction strategy compared to kNN.
- Simple yet effective connectivity modulation: Linearly modulating scores using graph shortest-path distances incurs virtually no additional overhead while significantly improving global topology.
Limitations & Future Work¶
- GNN inference at test time requires message passing over the complete graph; large-scale scenes necessitate graph partitioning approximations, and more efficient sparse GNN architectures could be explored.
- Only two rounds of message passing are currently used; deeper GNNs may capture richer global patterns.
- The modulation weight \(\lambda\) is fixed; learnable adaptive weighting could be investigated.
- The combination with global SfM methods (e.g., GLOMAP) has not been explored.
- Failure cases involving low-resolution images and small-scale landmarks remain challenging.
Related Work & Insights¶
- MegaLoc/SALAD: Currently the strongest retrieval baselines, but still operating at the pairwise evaluation level.
- DoppelGanger++: A specialized edge filtering method for ambiguous scenes, but operating after geometric verification; the proposed method suppresses ambiguity prior to verification.
- PoGO-Net, Damblon et al.: Apply GNNs to optimize/filter existing pose graphs; this paper is the first to apply GNNs to the initialization stage.
- Insights: Global reasoning combined with structure-aware selection strategies can be generalized to other graph construction problems (e.g., scene graph construction, point cloud registration graphs).
Rating¶
- Novelty: ⭐⭐⭐⭐ Redefines pose graph initialization as a global ranking problem; the combination of GNN + multi-MST + connectivity modulation is entirely novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets, multiple baselines, detailed ablations, oracle comparisons, and timing analysis — highly comprehensive.
- Writing Quality: ⭐⭐⭐⭐ Problem statement is clear, method motivation is well-justified, and formulations and figures are well-presented.
- Value: ⭐⭐⭐⭐ Directly pluggable into existing SfM pipelines with high practical value in sparse and ambiguous scenarios.
- Value: TBD