CoMatcher: Multi-View Collaborative Feature Matching¶

Conference: CVPR 2025
arXiv: 2504.01872
Code: https://github.com/EATMustard/CoMatcher
Area: 3D Vision
Keywords: Multi-view feature matching, Collaborative inference, Cross-view consistency, Wide-baseline matching, SfM

TL;DR¶

Proposes CoMatcher, a multi-view collaborative feature matcher that shifts from the independent two-view matching paradigm to a 1-to-N collaborative matching paradigm, leveraging contextual cues from complementary views and cross-view projective consistency constraints to improve matching reliability in complex scenes.

Background & Motivation¶

Feature matching is a core component of SfM/SLAM. Existing paradigms decompose image sets into pairs, perform independent two-view matching for each pair, and then merge them. This approach faces fundamental difficulties in wide-baseline scenes with severe occlusions and repetitive textures:

Insufficient Information: A significant amount of information is lost when a complex 3D structure is projected to 2D, and two-view observations are insufficient to reliably infer the original 3D scene. Points far apart in space may appear very close in 2D, leading to matching ambiguities.
Inadequate Two-View Geometric Priors: Relying solely on two-view constraints makes it difficult to handle sudden depth discontinuities.
Error Accumulation: Errors from pairwise matching are amplified when merging tracks.

The core insight of this paper is: instead of continuously optimizing two-view matchers, it is better to directly exploit the rich relationships within the raw multi-view observations. By collaboratively establishing correspondences within a group of complementary views, a holistic understanding of the 3D scene can be formed, and a reliable global solution can be derived through cross-view projective consistency constraints.

Method¶

Overall Architecture¶

CoMatcher adopts a 1-to-N matching architecture: given \(M\) source views forming a group \(\mathcal{G}\) and a target view \(I_t\), the network simultaneously estimates the correspondences from each source view to the target view. A three-step pipeline of grouping-connecting-matching is configured: first, the image set is grouped based on co-visibility (grouping); then, existing frameworks are used within the group to establish tracks to provide geometric guidance (connecting); finally, CoMatcher is used for group-level collaborative matching (matching).

Key Designs¶

Multi-View Feature Interaction Module:
- Function: Enhances point feature representation through multi-view receptive fields, resolving the issue of features at occlusion boundaries being contaminated by irrelevant context.
- Mechanism: (a) Source Cross: For a point \(u\) in source view \(I_i\), information from all points in other source views \(I_j\) is aggregated via multi-view cross-attention, uniformly from each source view to avoid bias towards similar views. (b) Geometrically-Constrained Attention: Uses precomputed intra-group tracks \(\mathcal{M}(\mathcal{G})\) to obtain the projected positions of point \(u\) in other views, and embeds the projection position difference \(\Delta\mathbf{p}\) into attention scores as relative position encodings, guiding each point to focus on geometrically corresponding regions. (c) Target Cross: Aggregates target view features across different pairings.
- Design Motivation: Features of points at occlusion boundaries are unreliable when observed from a single view, but may be clearly visible from other views; geometric constraints limit the search space to avoid noise interference from irrelevant regions.
Multi-View Feature Correlation Strategy:
- Function: Leverages cross-view projective consistency to identify and correct ambiguous matches.
- Mechanism: A two-step process—(a) At each layer, a lightweight head is used to predict the confidence of each point \(c_u^{I_i} = \text{Sigmoid}(\text{MLP}(\mathbf{f}_u^{I_i}))\), treating those below a threshold \(\theta\) as ambiguous points; (b) For ambiguous points, their attention distributions are corrected by a weighted average of the attention distributions of corresponding points in other views within the track: \(\boldsymbol{\alpha}_u^{I_i'} = c_u^{I_i}\boldsymbol{\alpha}_u^{I_i} + (1-c_u^{I_i})\frac{\sum_v c_v^{I_j}\boldsymbol{\alpha}_v^{I_j}}{\sum_v c_v^{I_j}}\). The threshold \(\theta\) is gradually increased in subsequent layers.
- Design Motivation: Matches of the same 3D point across different source views should be consistent (either corresponding to the same target point, or both having no match). Traditional methods filter matches using consistency post-hoc, whereas this method exploits this constraint in real-time during inference.
Group-wise Matching Pipeline:
- Function: Scales CoMatcher to large-scale image sets.
- Mechanism: Divides the image set into multiple groups based on co-visibility, with each group representing a local scene. Within a group, existing frameworks (such as LightGlue) are first used to establish tracks and obtain multi-view projection information, and then each group as a whole is collaborative-matched with other images using CoMatcher.
- Design Motivation: CoMatcher's 1-to-N architecture is most efficient when the number of source views \(M\) is small (trained with \(M=4\)), and the grouping strategy makes it scalable to image sets of arbitrary size.

Loss & Training¶

The total loss is a weighted sum of the correspondence loss and the confidence loss:

\[\mathcal{L}_{total} = \frac{1}{M}\sum_{I_i \in \mathcal{G}}(\mathcal{L}_{corr}(I_i, I_t) + \alpha\mathcal{L}_{conf}(I_i))\]

\(\mathcal{L}_{corr}\): Negative log-likelihood loss of the assignment matrix, with ground truth calculated from relative pose/homography.
\(\mathcal{L}_{conf}\): Binary cross-entropy loss for confidence estimation, where the label indicates whether the current layer's estimation is consistent with the final estimation.

Two-stage training: first pre-trained on a synthetic homography dataset, and then fine-tuned on MegaDepth. Training takes about 6 days on 2 RTX 4090 GPUs.

Key Experimental Results¶

Main Results¶

Method	MegaDepth AUC@5°/10°/20° (RANSAC)	Time (ms)
SP+NN+mutual	51.4/67.3/75.9	9
SP+SuperGlue	65.1/77.2/89.2	87
SP+LightGlue	67.2/80.1/88.0	51
SP+End2End (multi-view)	67.4/81.5/87.0	152
SP+CoMatcher	68.3/82.2/89.1	69
DISK+LightGlue	68.6/80.4/87.2	54
DISK+CoMatcher	68.5/82.1/88.4	73

IMC 2020 Benchmark（Stereo Task AUC@5°/10°）¶

Method	mAA@5°	mAA@10°	Time (ms/pair)
SP+SuperGlue	-	-	87
SP+LightGlue	-	-	51
SP+CoMatcher (groupwise)	Significantly Improved	Significantly Improved	~73

Key Findings¶

CoMatcher comprehensively outperforms similar sparse two-view matchers (LightGlue, SuperGlue) across all metrics, and is faster than or on par with LightGlue.
Compared with End2End (which is also a multi-view method), CoMatcher's 1-to-N architecture is significantly superior to the N-to-N architecture (HPatches DLT AUC@5px: 37.1 vs 34.3) and is more efficient.
When using DLT (a non-robust estimator), CoMatcher's accuracy is close to that of RANSAC, reflecting the high reliability of the matching quality.
The improvement is particularly prominent at semantic edges and depth discontinuities (qualitative results in Fig. 4), demonstrating the advantage of multi-view collaborative inference in handling occlusions.

Highlights & Insights¶

Paradigm Shift: Shifting from "local optimal merging" in two-view matching to multi-view "global collaborative inference" is an important direction in the feature matching field.
The confidence-guided attention correction strategy is simple yet effective—low-confidence points borrow the attention distribution of high-confidence corresponding points, avoiding complex post-processing.
The geometrically-constrained attention mechanism (using projection position differences of tracks as position encodings) cleverly limits the search space without losing flexibility.

Limitations & Future Work¶

The number of source views \(M=4\) is fixed during training, and generalizability to more views has not been fully verified.
Relies on the quality of intra-group precomputed tracks; if the initial tracks are erroneous, the accuracy of the geometric constraint will be affected.
The grouping algorithm is based on a heuristic design, leaving room for exploring optimal grouping strategies.
Only handles sparse feature matching and has not been extended to dense matching (e.g., LoFTR level).

Complementary to LightGlue's "lightweight and efficient two-view matching" route: CoMatcher sacrifices a small amount of efficiency for reliability.
Multi-view consistency constraints are shifted forward from post-processing filtering to guidance during inference, similar to a paradigm shift from "generate then filter" to "constrained generation".
The design idea of the group-wise matching framework can be extended to other multi-view reasoning tasks (such as multi-view depth estimation and multi-view pose estimation).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The 1-to-N collaborative matching paradigm is a significant breakthrough in the feature matching field, and the confidence-guided cross-view correction strategy is novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three benchmarks (HPatches/MegaDepth/IMC2020), complete ablation study, and detailed running time analysis.
Writing Quality: ⭐⭐⭐⭐ The motivation is in-depth and the architecture diagram is clear, but there are many notations, and some formulas could be further simplified.
Value: ⭐⭐⭐⭐⭐ Substantially improves the core pipeline of SfM/SLAM, and can be directly integrated into existing pipelines.