Skip to content

ARGMatch: Adaptive Refinement Gathering for Efficient Dense Matching

Conference: ICCV 2025 arXiv: N/A (CVF OpenAccess) Code: https://github.com/ACuOoOoO/argmatch Area: Model Compression Keywords: Dense matching, coarse-to-fine, content-aware refinement, local consistency, efficient feature matching

TL;DR

This paper proposes an Adaptive Refinement Gathering pipeline comprising three modules—a content-aware offset estimator, a local consistency matching corrector, and a local consistency upsampler—augmented with an adaptive gating mechanism. The approach substantially reduces reliance on heavyweight feature extractors and global matchers, achieving performance comparable to state-of-the-art methods with a lightweight model.

Background & Motivation

Efficiency Bottleneck in Dense Matching

Establishing dense pixel correspondences is a fundamental step in multi-view tasks such as 3D reconstruction and visual localization. Brute-force global matching has computational complexity that scales quadratically with image resolution, rendering it infeasible for high-resolution scenarios. Although coarse-to-fine schemes alleviate computational cost, efficiency remains constrained by heavyweight feature extractors (e.g., DINOv2) and complex global matchers (e.g., Gaussian process matching).

Three Deficiencies in Existing Refiners

Why can lightweight substitutes not simply replace existing components? The authors argue that redundancy in existing methods cannot be trivially eliminated; the root cause lies in the inefficiency of refiner design:

High feature dependency: Existing correlation volume (CV)-based refiners require high-dimensional, highly discriminative features to sharpen the correlation volume, necessitating heavyweight feature extractors.

Limited error correction range: CV-based refiners can only correct errors within a local window; handling larger errors demands more sophisticated global initialization.

Insufficient joint optimization: When errors exceed the local window, existing optimization strategies force the refiner to fix errors beyond its capacity rather than propagating gradients to upstream modules for appropriate optimization.

Mechanism

The paper proposes a more powerful refinement pipeline to "liberate" the system from dependence on front-end components: if the refiner is sufficiently capable, heavyweight feature extractors and global matchers are no longer required to provide high-quality initialization.

Method

Overall Architecture

ArgMatch adopts a coarse-to-fine scheme: a lightweight feature extractor generates feature pyramids at 1/16, 1/8, and 1/4 resolution → a global matcher predicts an initial match \(M_{1/16}\) at the coarsest resolution → the adaptive refinement gathering pipeline progressively refines matches to half resolution \(M_{1/2}\).

The core contributions lie in the three modules of the refinement pipeline and the adaptive gating/local consistency mechanisms.

Key Design 1: Content-Aware Offset Estimator

Conventional CV-based refinement estimates residual offsets \(\delta M\) by computing and decoding local correlation volumes. This paper improves upon this along two dimensions:

Scale-adaptive sampling: Even with a fixed window size, the sampling window should adapt to geometric scale variation. Scale can be approximated via the gradient of the dense flow field:

\[s_g(i,j) \approx |\nabla M(i,j)| \approx \sum_{(m,n)} C(m,n) |M(i+m, j+n) - M(i,j)|\]

The final scale \(s \in [0.5, 3.5]\) is then estimated by a lightweight network \(f_1\) that takes as input the geometric scale, confidence map, and contextual features.

Why is scale-adaptive sampling necessary? In geometric matching, scale variation across regions can be substantial (e.g., near vs. far objects). A fixed window size cannot accommodate such variation, leading to under- or over-sampling.

Content-aware decoding: Conventional methods estimate offsets via the expectation of the correlation volume, but the similarity distribution may deviate from an ideal Gaussian (especially near object boundaries). This paper modulates the distribution using content information from the sampled region:

  1. A self-correlation volume (SV) is computed to capture the content distribution of the sampled region.
  2. SV features, center-sampled features, and scale are fused into a latent code \(z\).
  3. \(z\) modulates the original CV to produce a content-aware offset:
\[CV' = f_3(CV, z), \quad \delta M(i,j) = f_4(z(i,j))^\top CV'(i,j)\]

Why is content-aware decoding preferable to pursuing sharper features? Pursuing more discriminative features implies heavier feature extractors. Instead, modulating the correlation volume distribution using local content information from the sampled region enables accurate offset estimation even with low-dimensional features.

Key Design 2: Local Consistency Matching Corrector

Local consistency principle: Geometric transformations between neighboring smooth regions can be approximated by rigid or affine transformations, such that \(M''(i,j)\) can be regressed as a linear combination of neighboring matches:

\[M''(i,j) = \sum_{(m,n)} w(m,n,i,j) M'(u,v)\]

Weight design: The weight \(w\) integrates three factors: - Spatial correlation: a learned relative spatial tensor \(b\) - Semantic correlation: feature similarity between neighbors, \(F(i,j)^\top F(u,v)\) - Matching confidence: the certainty of neighboring matches, \(C(u,v)\)

\[w(m,n,i,j) = C(u,v) \cdot F(i,j)^\top F(u,v) + b(m,n)\]

A 5×5 window with a two-layer cascaded structure is adopted, yielding a receptive field covering a larger spatial range. The NATTEN library is used to accelerate neighborhood attention computation.

Why can the corrector handle large errors? CV-based estimators are limited by the sampling window size, whereas the corrector operates on the neighborhood consistency principle—as long as reliable matches exist in the vicinity, even a point with a large initial matching error can be corrected via neighborhood regression.

Key Design 3: Local Consistency Upsampler

Unlike bilinear interpolation, the upsampler employs a mechanism similar to the corrector, estimating neighborhood weights \(w_{up}\) from semantic similarity, spatial correlation, and confidence signals, and performing 2× upsampling via PixelShuffle.

Why not use bilinear interpolation? Bilinear interpolation is displacement-invariant and causes over-smoothing and artifacts at depth discontinuities. The local consistency upsampler assigns weights based on semantic consistency, preventing information propagation across depth boundaries. During backpropagation, this also blocks gradient flow across depth boundaries, avoiding the erroneous enforcement of continuity between different depth layers.

Adaptive Gating Aggregation

An adaptive gating mechanism is introduced to selectively integrate previous and updated matches:

\[M^* = sg(M) + \delta M, \quad M' = \beta M^* + (1-\beta)M\]

where \(\beta\) is controlled by the estimated confidence score \(\alpha\) (\(\beta = \alpha > 0.1\)). The gating strategy prevents the offset estimator from being forced beyond its capacity and ensures that gradients for poorly matched points are correctly propagated to upstream modules.

Loss & Training

Multi-scale loss: $\(L_{total} = L_{reg}^{1/2} + \sum_{t \in \{1/16, 1/8, 1/4\}} L_t\)$

Each level includes: a regression loss \(L_{reg}\) (L2 distance + robust regression), a confidence map classification loss \(L_{cls}\) (balanced binary cross-entropy), and a gating loss \(L_{gate}\).

Training strategy: the coarsest level is first optimized until 90% of matching errors fall below 1px, followed by end-to-end training of the full network. Input resolution: 800×608, batch size: 8, hardware: 4× RTX 4090.

Key Experimental Results

Main Results

Geometric model estimation (multi-dataset comparison):

Method Params (M) MegaDepth AUC@5° ScanNet AUC@5° Time (ms) Memory (G)
RoMa 415 62.6 28.4 1557 14.8
DKM 72.3 60.4 26.5 953 13.1
LoFTR 11.5 52.8 16.9 296 6.97
ArgMatch 38.3 61.2 28.2 270 2.30
ArgMatch+ 38.8 62.0 28.4 329 2.30

ArgMatch achieves accuracy comparable to RoMa using 1/11 of its parameters, 1/6 of its inference time, and 1/6 of its memory.

Dense matching accuracy (MegaDepth PCK):

Method PCK@0.5px PCK@1px PCK@3px
DKM 56.2 79.8 94.4
RoMa 58.9 82.6 96.5
ArgMatch+ 60.2 82.9 96.5

ArgMatch+ surpasses RoMa by 1.3% on PCK@0.5px, achieving state-of-the-art performance at the finest granularity.

Visual localization (InLoc):

Method DUC1 (0.25m,2°) DUC2 (0.25m,2°)
RoMa 54.5 56.5
ArgMatch+ 58.6 58.8

ArgMatch+ surpasses RoMa on the InLoc visual localization benchmark, achieving state-of-the-art results.

Ablation Study

Component contributions (MegaDepth):

Configuration AUC@5° PCK@0.5px Time (ms) Params (M)
ConvR (baseline) 56.3 76.6 197 24.7
+U (upsampler) 58.0 79.0 222 27.2
+R+U (corrector+upsampler) 59.4 80.3 235 32.5
−ConvR+O+U 59.8 80.6 248 33.0
ArgMatch (O+R+U) 61.2 82.2 270 38.3
−ScaleS 60.7 81.7 269 38.2
−ContD 59.9 80.8 261 36.6
−Gate 60.1 81.4 270 38.3

Key findings: full integration of all three modules significantly outperforms any subset; content-aware decoding (ContD) contributes most; the gating strategy is critical for stable optimization.

Gradient stopping comparison:

Configuration AUC@5° PCK@0.5px
ConvR 56.3 76.6
ConvR + detach 57.8 77.3
ArgMatch 61.2 82.2
ArgMatch + detach 60.4 81.6

Conventional pipelines require gradient detachment to stabilize training, whereas ArgMatch achieves superior end-to-end optimization through gating aggregation and local consistency; gradient detachment is in fact detrimental.

Key Findings

  1. Underestimated potential of lightweight models: The key lies not in heavier feature extractors or global matchers, but in a more intelligent refinement pipeline.
  2. Synergy among three modules: The full potential of each module is realized only when all three are jointly integrated within the pipeline.
  3. Local content information is central: Modulating the correlation volume with local content is more effective than pursuing more discriminative features.
  4. Dual role of local consistency: It not only corrects errors in the forward pass but also guides rational gradient allocation during backpropagation.

Highlights & Insights

  1. Paradigm shift in refinement pipeline design: The focus shifts from "better initialization is required" to "better refinement reduces dependence on initialization," fundamentally reconceiving the core design philosophy of dense matching.
  2. Fine-grained control of gradient propagation: Gating and local consistency jointly address the fundamental difficulty of gradient propagation in coarse-to-fine frameworks.
  3. Efficiency–accuracy trade-off: Comparable accuracy is achieved at 1/6 the computational cost of RoMa, demonstrating the high efficiency of the proposed design.
  4. Hallucination matching phenomenon: The paper honestly discusses the hallucination matching problem in occluded regions caused by local consistency regression.

Limitations & Future Work

  1. Hallucination matching: Local consistency regression may propagate neighborhood information into occluded regions, producing spurious matches; confidence map learning is suboptimal due to noisy MegaDepth annotations.
  2. Training targets only 1/2 resolution; full-resolution recovery relies on RoMa's convolutional module, which is not trained as part of this work.
  3. Although lightweight, the feature extractor can be further optimized.
  4. Performance in scenarios with extreme viewpoint and illumination changes leaves room for improvement.
  • RoMa: The current dense matching state-of-the-art, relying on DINOv2 and complex global matching; ArgMatch approaches its performance at a fraction of the cost.
  • DKM: Employs Gaussian process global matching (\(O(n^3)\) complexity); representative of the heavyweight paradigm ArgMatch aims to replace.
  • RAFT: Classic iterative refinement framework for optical flow estimation; the upsampler design in this paper draws inspiration from it.
  • NATTEN: Neighborhood attention acceleration library, used in this paper to implement efficient local consistency regression.
  • Insight: Refinement pipeline design is a critical leverage point for improving efficiency, and merits further exploration in a broad range of coarse-to-fine tasks.

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐⭐