Skip to content

Global-Aware Edge Prioritization for Pose Graph Initialization

Conference: CVPR 2026
arXiv: 2602.21963
Code: GitHub
Area: 3D Vision
Keywords: Structure-from-Motion, Pose Graph Initialization, Graph Neural Networks, Minimum Spanning Tree, Edge Ranking

TL;DR

A GNN-based global edge prioritization method is proposed, upgrading pose graph initialization from independent pairwise image retrieval to global structure-aware edge ranking and multi-MST construction, significantly improving SfM reconstruction accuracy in extremely sparse settings.

Background & Motivation

Background: Structure-from-Motion (SfM) is a classic problem for reconstructing 3D structures and camera poses from image collections. Whether incremental (COLMAP) or global (GLOMAP), all SfM pipelines begin with the same step: constructing an initial pose graph by selecting a sparse subset of pairs for geometric verification from \(\binom{N}{2}\) candidate image pairs.

Limitations of Prior Work: - Current methods rely almost entirely on per-image retrieval (e.g., NetVLAD, CosPlace, MegaLoc), independently connecting each image to its \(k\) nearest neighbors. - This greedy strategy ignores global structure: it may produce elongated chains, weakly connected regions, or multiple loosely coupled sub-structures. - Once initial edges are selected, subsequent stages only delete but never add; missing critical global connections is irreversible. - Re-ranking methods (Patch-NetVLAD, VOP) still operate at the pairwise level and cannot perceive global topology.

Key Challenge: The pose graph is the structural skeleton of SfM, and its quality determines the success of the reconstruction. However, current initialization strategies only consider local visual similarity and lack global reasoning capabilities, making them particularly fragile in sparse or ambiguous scenes.

Key Insight: This work introduces the concept of "edge priority"—instead of evaluating image pairs independently, all candidate edges are ranked based on their global utility for SfM, and a compact, globally connected pose graph is constructed via multiple Minimum Spanning Trees (MSTs).

Method

Overall Architecture

Given \(N\) images, the goal is to construct an initial pose graph \(\mathcal{G}_0 = (\mathcal{V}, \mathcal{E}_0)\) upon which subsequent SfM performs geometric verification and reconstruction. Unlike traditional pipelines that collect edges by connecting each image to \(k\) nearest neighbors via per-image retrieval, this work adopts a two-stage process: "global ranking followed by structure-aware selection." First, images are encoded into global descriptors \(d_i\) using DINOv2+SALAD. Then, a GNN performs message propagation on the complete graph of all images to predict a global matchability score \(\hat{r}_{ij}\) for each candidate edge. Finally, instead of using a greedy threshold, edges are selected sequentially via multiple MSTs, with scores dynamically modulated based on the current graph's connectivity to ensure the resulting graph is both compact and globally connected. This global edge ranking is trained using geometric self-supervision signals derived directly from SfM outputs, requiring no manual annotation.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
    A["N Input Images"] --> B["DINOv2+SALAD Encoding<br/>Global Descriptors d_i"]
    B --> C["GNN Edge Ranking Predictor<br/>Two-round Edge-Node Message Passing on Complete Graph"]
    C --> D["Edge Global Matchability Score r̂_ij"]
    D --> E["Multi-MST + Connectivity-Aware Modulation<br/>Build k MSTs via Kruskal, Modulating Complementary Edges by Graph Distance"]
    E --> F["Initial Pose Graph G0<br/>→ Downstream SfM Geometric Verification & Reconstruction"]
    subgraph SUP["Geometric Self-Supervision (Training Supervision)"]
        direction TB
        G["RANSAC Inlier Count u_ij<br/>Co-visible 3D Points v_ij"] --> H["Target Ranking Score r̃_ij<br/>NDCGLoss2++ Ranking Loss"]
    end
    H -.Training Labels.-> C

Key Designs

1. GNN Edge Ranking Predictor: Enabling Global Context for Every Edge Score

Pairwise cosine similarity fails because it evaluates image pairs in isolation, failing to distinguish edges that "look locally similar but are useless for global reconstruction." This method performs alternating edge-node message passing on the complete graph formed by image embeddings. Each directed edge is initialized as \(e_{ij}^0 = \text{ReLU}(f_l[d_i, d_j, \langle d_i, d_j \rangle])\), concatenating endpoints' descriptors and their inner product. In subsequent propagation rounds, edge features are updated via \(e_{ij}^t = f_{\text{edge}}([e_{ij}^{t-1}, d_i^t, d_j^t])\), and nodes refresh themselves by aggregating neighbor messages \(m_i^t = \frac{1}{N}\sum_j m_{ji}^t\) using \(d_i^{t+1} = f_{\text{update}}([d_i^t, m_i^t])\). A final MLP reads out the edge features into ranking scores \(\hat{r}_{ij} = f_{\text{MLP}}(e_{ij}^2)\). Through this propagation, every edge score incorporates the context of the entire image set, allowing the model to suppress globally redundant or misleading edges.

2. Geometric Self-Supervision: Using SfM Outputs as Ranking Labels

The challenge is the lack of existing annotations for "global utility of edges." Instead of manual labeling, two complementary signals are extracted directly from the SfM pipeline: the RANSAC inlier count \(u_{ij}\) reflects immediate geometric verifiability, while the number of co-visible 3D points \(v_{ij}\) reflects the long-term contribution of the edge to the global reconstruction. These are combined into a target score \(\tilde{r}_{ij} = \frac{1}{2}(\text{norm}(u_{ij}) + \text{norm}(v_{ij}))\). Training uses NDCGLoss2++, a differentiable ranking loss that optimizes the consistency between predicted and ground-truth rankings, focusing on relative importance rather than absolute value regression.

3. Multi-MST + Connectivity-Aware Modulation: Selecting Globally Connected Compact Edges

To avoid fragmented graphs resulting from simple thresholding or kNN, scores are converted to weights \(w_{ij} = 1 - \hat{r}_{ij}\). Kruskal's algorithm is used to iteratively construct \(k\) MSTs, with selected edges excluded from subsequent rounds to force the selection of complementary edges. Starting from the second MST, scores are modulated using the current graph distance:

\[s_{ij}^{(m)} = (1-\lambda)\hat{r}_{ij} + \lambda \bar{d}^{(m-1)}(i,j)\]

where \(\bar{d}^{(m-1)}(i,j)\) is the normalized shortest path distance in the current graph. This raises the priority of strong edges whose endpoints are distant in the graph, effectively filling "cross-region shortcuts" to reduce graph diameter and reinforce weak connections. To avoid promoting noise, modulation only applies to the top-5 candidate edges per image and discards edges with scores below 0.9.

Loss & Training

  • Loss: NDCGLoss2++, a differentiable NDCG approximation based on LambdaRank.
  • Training Data: 153 scenes from MegaDepth, taking 240 images per scene per batch, with up to 4 batches per scene.
  • Optimizer: AdamW, learning rate \(10^{-5}\), 50 epochs.
  • Large-scale Scaling: For >500 images, METIS graph partitioning is used for sub-graph batch inference, with overlapping edge scores averaged.

Key Experimental Results

Main Results

Dataset Metric Ours (k=2 MSTs) MegaLoc Gain
IMC23-PhotoTourism AUC@5° ~71.7 ~65.3 +6.4
MegaDepth AUC@5° Outperforms all baselines Second-best Largest advantage at k=1-2
VisymScenes Correct Camera% (k=5) >75% <75% Surpasses DoppelGanger++

Ablation Study

Configuration AUC@5° (k=1/2/3/5) Description
Full Method 64.2/71.7/72.6/73.5 All components enabled
Without GNN 55.4/70.4/72.3/72.3 Large drop at k=1, proving global reasoning importance
SALAD Backbone 61.2/71.0/72.4/73.4 Slight drop but far exceeds original SALAD; proves backbone agnosticism
Oracle-RANSAC 65.7/72.4/73.0/74.1 Optimal for small k (immediate verifiability)
Oracle-3D Points 65.4/72.1/73.5/74.3 Optimal for large k (long-term utility)
kNN vs. MST MST far superior to kNN kNN easily fragments; MST ensures connectivity

Key Findings

  • Global edge prioritization shows the greatest advantage in extremely sparse settings (k=1-2 MSTs); methods converge as k increases.
  • All baselines perform better under the MST framework than their native kNN selection, suggesting MST is inherently a better structural prior.
  • Connectivity modulation is crucial in ambiguous scenes (VisymScenes), increasing scores from 61.9 to 66.0 (at k=2).
  • Ours outperforms the specialized Doppelganger++ filtering algorithm on VisymScenes without retraining, proving global reasoning naturally suppresses misleading edges.
  • Inference overhead: GNN prediction is slower than pure retrieval but negligible compared to COLMAP processing time.

Highlights & Insights

  • Precise Problem Definition: Redefining pose graph initialization from a "retrieval" to a "ranking" problem addresses the most critical and irreversible bottleneck in the SfM pipeline.
  • Clever Self-Supervision: Using SfM outputs (RANSAC inliers + 3D co-visibility) creates complementary, fully automated supervision signals.
  • Versatile MST Framework: Not only does the proposed method benefit, but all baselines improve under the MST framework, indicating MST is a superior graph construction strategy compared to kNN.
  • Effective Connectivity Modulation: Linearly modulating scores with shortest path distance significantly improves global topology with almost no additional overhead.

Limitations & Future Work

  • GNN inference requires message passing on a complete graph; sub-graph partitioning is needed for large scenes. More efficient sparse GNNs can be explored.
  • Currently, only two rounds of message passing are used; deeper GNNs might capture richer global patterns.
  • The modulation weight \(\lambda\) is fixed; learnable adaptive weights could be explored.
  • Integration with Global SfM (e.g., GLOMAP) has not yet been investigated.
  • Success rate remains limited for low-resolution and small-scale landmarks in failure cases.
  • MegaLoc/SALAD: Current strongest retrieval baselines, though still restricted to pairwise evaluation.
  • DoppelGanger++: An edge filtering method for ambiguous scenes acting after geometric verification; the proposed method suppresses ambiguity before verification.
  • PoGO-Net, Damblon et al.: Works using GNNs to optimize/filter existing pose graphs; this is the first to apply GNNs to the initialization stage.
  • Insight: Global reasoning + structure-aware selection strategies can be extended to other graph construction problems (e.g., scene graph construction, point cloud registration graphs).

Rating

  • Novelty: ⭐⭐⭐⭐ Reframing pose graph initialization as a global ranking problem using GNN+Multi-MST+Connectivity Modulation is novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across three datasets, multiple baselines, detailed ablations, oracle comparisons, and timing analysis.
  • Writing Quality: ⭐⭐⭐⭐ Clear problem statement, well-motivated method, and effective formulas/illustrations.
  • Value: ⭐⭐⭐⭐ High practical value as it is directly pluggable into existing SfM pipelines, especially for sparse and ambiguous scenes.