Dense-SfM: Structure from Motion with Dense Consistent Matching¶

Conference: CVPR 2025
arXiv: 2501.14277
Code: https://icetea-cv.github.io/densesfm/
Area: 3D Vision
Keywords: Structure from Motion, Dense Matching, Gaussian Splatting, Feature Tracks, Multi-view Consistency

TL;DR¶

Proposed the Dense-SfM framework, which resolves the fragmented track issue of dense matches through Gaussian Splatting-based track expansion, and combines a Transformer and Gaussian Process-based multi-view kernelized match refinement module to achieve high-precision dense SfM reconstruction.

Background & Motivation¶

Traditional SfM relies on sparse keypoint matching, which has limited accuracy and density in textureless regions. In recent years, while dense matching methods (e.g., DKM, RoMa) can yield reliable matches in low-texture regions, their pairwise nature leads to fragmented feature tracks—with most tracks spanning only two views, making them difficult to use directly in SfM.

The existing solution, DFSfM, adopts quantized matching to merge sub-pixel matches into grid nodes. Although this increases track length and consistency, it: - Significantly degrades matching accuracy due to the quantization operation - Reduces the number of matches, lowering point cloud density - Relies heavily on subsequent refinement modules

The core insight of this paper is that Gaussian Splatting can be utilized to evaluate the visibility of 3D points in different views, thereby extending track lengths without sacrificing original matching accuracy.

Method¶

Overall Architecture¶

A three-stage pipeline: 1. Initial SfM: Uses a dense matcher (DKM/RoMa) for pairwise matching, filters out reliable matches via bidirectional verification, and performs triangulation to construct an initial SfM model. 2. Track Expansion: Uses Gaussian Splatting to evaluate the visibility of each 3D point, projecting points onto more views to extend tracks. 3. Iterative Refinement: Iteratively optimizes the reconstruction via a multi-view kernelized matching module and geometric Bundle Adjustment.

Key Designs¶

Bidirectional Verified Dense Matching:
- Function: Filters out highly reliable correspondences from dense matching results.
- Mechanism: Performs bidirectional dense matching on images A→B and B→A, calculating the cycle-consistency error \(\|p_a - p_{a'}\|_2 \leq \epsilon_p\) (\(\epsilon_p=3\)px), similar to the mutual nearest neighbor concept. It first samples via non-maximum suppression, then filters via bidirectional verification.
- Design Motivation: Dense matchers generate many low-quality matches, making bidirectional verification an efficient means of quality control.
Gaussian Splatting-Based Track Expansion:
- Function: Solves the issue of fragmented short tracks generated by pairwise matching, expanding tracks from 2 views to multiple views.
- Mechanism: Initializes small Gaussians using 3D points from the initial SfM (positions set to point coordinates, rotations set to identity matrices, opacity set to 1), then optimizes and densifies them to cover the scene while fixing the initial Gaussian parameters. During rendering, visibility is determined by the visibility formula \(M = [\max_{r \in R}\{\alpha_{SfM}\prod_{j=1}^{N_{SfM}-1}(1-\alpha_j)\} > \epsilon_v]\) to judge whether a point is visible (\(\epsilon_v=0.5\)). Visible points are then projected to new views and added to the tracks after geometric verification.
- Design Motivation: Gaussian Splatting offers fast training and rendering, allowing efficient occlusion reasoning. The projection approach preserves matching accuracy without loss.
Multi-View Kernelized Matching Module:
- Function: Refines expanded feature tracks to improve multi-view consistency.
- Mechanism: Divides processing into two paths—(1) Feature Path: Core feature maps of reference and query views are processed with Transformer self- and cross-attention; (2) Coordinate Embedding Path: Computes the posterior mean \(\mu(\mathbf{F}_R|\mathbf{F}_{Q_i})\) using Gaussian Processes with an exponential cosine similarity kernel, combined with positional encoding to provide spatial geometric information. The concatenated features from both paths are fed into a CNN decoder to output pixel-wise coordinate probability distributions and confidence scores.
- Design Motivation: The refinement module in DFSfM relies solely on statistical variance as an uncertainty metric. Ours directly learns confidence through a neural network, using a joint loss \(\mathcal{L} = \frac{1}{N}\sum s_{Q_i} \cdot \|p_{Q_i}-p_{gt}\|_2 - \alpha \log s_{Q_i}\) to optimize both accuracy and reliability simultaneously.

Loss & Training¶

The refinement module is trained on the MegaDepth dataset using GT tracks provided by SfM models.
During training, random noise is added to the GT 2D positions as input.
The loss function combines an accuracy term and a confidence regularization term (\(\alpha=20\)) to prevent the model from becoming overly unconfident.
Bundle Adjustment is iterated twice, and points with a reprojection error exceeding \(\epsilon_f=3\)px are filtered.

Key Experimental Results¶

Main Results (ETH3D 3D Triangulation)¶

Method	Accuracy@2cm (%)	Completeness@2cm (%)	Description
SP+SG+PixSfM	87.04	2.77	Detector-based methods
LoFTR+DFSfM	89.01	11.07	Semi-dense + Quantization
RoMa+DFSfM	88.42	9.79	Dense + Quantization
RoMa+Ours	92.62	17.06	Dense + GS track expansion

Ablation Study (from Tab. 3 in the paper)¶

Configuration	Key Metric	Description
Without track expansion	Lower accuracy and completeness	Fragmented short tracks limit refinement
+ Track expansion (GS)	Significant improvement	Longer tracks provide more geometric constraints
DFSfM refinement module	Sub-optimal	Uncertainty metric based on statistical statistics
Ours refinement module	Optimal	Learned confidence + Gaussian Process

Key Findings¶

Dense-SfM (RoMa) achieves 92.62% accuracy@2cm and 17.06% completeness@2cm on ETH3D, comprehensively outperforming all baselines.
Compared to RoMa+DFSfM (which utilizes quantized matching), accuracy increases by 4.2% and completeness by 74%—fully demonstrating the advantage of avoiding quantization.
It also performs exceptionally well on the Texture-Poor SfM dataset, validating its efficacy in low-texture scenes.
Track expansion enables the subsequent refinement stage to yield higher-quality tracks, demonstrating that the two processes are mutually beneficial.
The Gaussian Process path in multi-view kernelized matching provides crucial auxiliary spatial position information.

Highlights & Insights¶

GS as a Visibility Tool: Transforms Gaussian Splatting from a rendering utility into a visibility reasoning tool within the SfM pipeline, offering a novel perspective.
Lossless Track Expansion: Extends tracks via projection and geometric verification rather than re-matching, fully preserving the original matching accuracy.
Dual-Path Feature + Coordinate Processing: The Transformer handles appearance features, while the Gaussian Process models spatial coordinates, processing these two complementary information sources.
Comparison with MASt3R-SfM: MASt3R-SfM performs poorly on ETH3D triangulation (with an accuracy@2cm of only 43.9%), indicating that end-to-end approaches still underperform in high-precision triangulation.

Limitations & Future Work¶

Training the Gaussian Splatting model introduces additional computational overhead.
Bidirectional dense matching is computationally expensive (requiring two forward passes per pair of images).
Track expansion depends on the camera pose quality of the initial SfM.
Integration with learning-based SfM methods (e.g., VGGSfM) remains unexplored.
The radius setting for non-maximum suppression varies across datasets.

Although DFSfM's quantization strategy increases track consistency, it sacrifices accuracy and density; Dense-SfM elegantly bypasses this trade-off.
While PixSfM refines keypoints by adapting feature metrics, ours employs a dual-path Transformer+GP architecture to yield superior results.
Insight: There remains substantial design space for integrating dense matching with SfM, where track consistency remains a core challenge.

Rating¶

Novelty: ⭐⭐⭐⭐ The GS track-expansion scheme is innovative, and the GP+Transformer dual-path refinement is cleverly designed.
Experimental Thoroughness: ⭐⭐⭐⭐ Extensive coverage across three datasets (ETH3D, Texture-Poor, IMC) with thorough ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear flowcharts and well-explained design motivations for individual components.
Value: ⭐⭐⭐⭐ Successfully addresses practical pain points of incorporating dense matches into SfM, offering strong utility.