ARGMatch: Adaptive Refinement Gathering for Efficient Dense Matching¶

Conference: ICCV 2025 arXiv: N/A (CVF OpenAccess) Code: https://github.com/ACuOoOoO/argmatch Area: Model Compression Keywords: Dense matching, coarse-to-fine, content-aware refinement, local consistency, efficient feature matching

TL;DR¶

This paper proposes an Adaptive Refinement Gathering pipeline comprising three modules—a content-aware offset estimator, a local consistency matching corrector, and a local consistency upsampler—augmented with an adaptive gating mechanism. The approach substantially reduces reliance on heavyweight feature extractors and global matchers, achieving performance comparable to state-of-the-art methods with a lightweight model.

Background & Motivation¶

Efficiency Bottleneck in Dense Matching¶

Establishing dense pixel correspondences is a fundamental step in multi-view tasks such as 3D reconstruction and visual localization. Brute-force global matching has computational complexity that scales quadratically with image resolution, rendering it infeasible for high-resolution scenarios. Although coarse-to-fine schemes alleviate computational cost, efficiency remains constrained by heavyweight feature extractors (e.g., DINOv2) and complex global matchers (e.g., Gaussian process matching).

Three Deficiencies in Existing Refiners¶

Why can lightweight substitutes not simply replace existing components? The authors argue that redundancy in existing methods cannot be trivially eliminated; the root cause lies in the inefficiency of refiner design:

High feature dependency: Existing correlation volume (CV)-based refiners require high-dimensional, highly discriminative features to sharpen the correlation volume, necessitating heavyweight feature extractors.

Limited error correction range: CV-based refiners can only correct errors within a local window; handling larger errors demands more sophisticated global initialization.

Insufficient joint optimization: When errors exceed the local window, existing optimization strategies force the refiner to fix errors beyond its capacity rather than propagating gradients to upstream modules for appropriate optimization.

Mechanism¶

The paper proposes a more powerful refinement pipeline to "liberate" the system from dependence on front-end components: if the refiner is sufficiently capable, heavyweight feature extractors and global matchers are no longer required to provide high-quality initialization.

Method¶

Overall Architecture¶

ArgMatch adopts a coarse-to-fine scheme: a lightweight feature extractor generates feature pyramids at 1/16, 1/8, and 1/4 resolution → a global matcher predicts an initial match $M_{1/16}$ at the coarsest resolution → the adaptive refinement gathering pipeline progressively refines matches to half resolution $M_{1/2}$.

The core contributions lie in the three modules of the refinement pipeline and the adaptive gating/local consistency mechanisms.

Key Design 1: Content-Aware Offset Estimator¶

Conventional CV-based refinement estimates residual offsets $\delta M$ by computing and decoding local correlation volumes. This paper improves upon this along two dimensions:

Scale-adaptive sampling: Even with a fixed window size, the sampling window should adapt to geometric scale variation. Scale can be approximated via the gradient of the dense flow field:

\[s_g(i,j) \approx |\nabla M(i,j)| \approx \sum_{(m,n)} C(m,n) |M(i+m, j+n) - M(i,j)|\]

The final scale $s \in [0.5, 3.5]$ is then estimated by a lightweight network $f_1$ that takes as input the geometric scale, confidence map, and contextual features.

Why is scale-adaptive sampling necessary? In geometric matching, scale variation across regions can be substantial (e.g., near vs. far objects). A fixed window size cannot accommodate such variation, leading to under- or over-sampling.

Content-aware decoding: Conventional methods estimate offsets via the expectation of the correlation volume, but the similarity distribution may deviate from an ideal Gaussian (especially near object boundaries). This paper modulates the distribution using content information from the sampled region:

A self-correlation volume (SV) is computed to capture the content distribution of the sampled region.
SV features, center-sampled features, and scale are fused into a latent code $z$.
$z$ modulates the original CV to produce a content-aware offset:

\[CV' = f_3(CV, z), \quad \delta M(i,j) = f_4(z(i,j))^\top CV'(i,j)\]

Why is content-aware decoding preferable to pursuing sharper features? Pursuing more discriminative features implies heavier feature extractors. Instead, modulating the correlation volume distribution using local content information from the sampled region enables accurate offset estimation even with low-dimensional features.

Key Design 2: Local Consistency Matching Corrector¶

Local consistency principle: Geometric transformations between neighboring smooth regions can be approximated by rigid or affine transformations, such that $M''(i,j)$ can be regressed as a linear combination of neighboring matches:

\[M''(i,j) = \sum_{(m,n)} w(m,n,i,j) M'(u,v)\]

Weight design: The weight $w$ integrates three factors: - Spatial correlation: a learned relative spatial tensor $b$ - Semantic correlation: feature similarity between neighbors, $F(i,j)^\top F(u,v)$ - Matching confidence: the certainty of neighboring matches, $C(u,v)$

\[w(m,n,i,j) = C(u,v) \cdot F(i,j)^\top F(u,v) + b(m,n)\]

A 5×5 window with a two-layer cascaded structure is adopted, yielding a receptive field covering a larger spatial range. The NATTEN library is used to accelerate neighborhood attention computation.

Why can the corrector handle large errors? CV-based estimators are limited by the sampling window size, whereas the corrector operates on the neighborhood consistency principle—as long as reliable matches exist in the vicinity, even a point with a large initial matching error can be corrected via neighborhood regression.

Key Design 3: Local Consistency Upsampler¶

Unlike bilinear interpolation, the upsampler employs a mechanism similar to the corrector, estimating neighborhood weights $w_{up}$ from semantic similarity, spatial correlation, and confidence signals, and performing 2× upsampling via PixelShuffle.

Why not use bilinear interpolation? Bilinear interpolation is displacement-invariant and causes over-smoothing and artifacts at depth discontinuities. The local consistency upsampler assigns weights based on semantic consistency, preventing information propagation across depth boundaries. During backpropagation, this also blocks gradient flow across depth boundaries, avoiding the erroneous enforcement of continuity between different depth layers.

Adaptive Gating Aggregation¶

An adaptive gating mechanism is introduced to selectively integrate previous and updated matches:

\[M^* = sg(M) + \delta M, \quad M' = \beta M^* + (1-\beta)M\]

where $\beta$ is controlled by the estimated confidence score $\alpha$ ($\beta = \alpha > 0.1$). The gating strategy prevents the offset estimator from being forced beyond its capacity and ensures that gradients for poorly matched points are correctly propagated to upstream modules.

Loss & Training¶

Multi-scale loss: $$L_{total} = L_{reg}^{1/2} + \sum_{t \in \{1/16, 1/8, 1/4\}} L_t$$

Each level includes: a regression loss $L_{reg}$ (L2 distance + robust regression), a confidence map classification loss $L_{cls}$ (balanced binary cross-entropy), and a gating loss $L_{gate}$.

Training strategy: the coarsest level is first optimized until 90% of matching errors fall below 1px, followed by end-to-end training of the full network. Input resolution: 800×608, batch size: 8, hardware: 4× RTX 4090.

Key Experimental Results¶

Main Results¶

Geometric model estimation (multi-dataset comparison):

Method	Params (M)	MegaDepth AUC@5°	ScanNet AUC@5°	Time (ms)	Memory (G)
RoMa	415	62.6	28.4	1557	14.8
DKM	72.3	60.4	26.5	953	13.1
LoFTR	11.5	52.8	16.9	296	6.97
ArgMatch	38.3	61.2	28.2	270	2.30
ArgMatch+	38.8	62.0	28.4	329	2.30

ArgMatch achieves accuracy comparable to RoMa using 1/11 of its parameters, 1/6 of its inference time, and 1/6 of its memory.

Dense matching accuracy (MegaDepth PCK):

Method	PCK@0.5px	PCK@1px	PCK@3px
DKM	56.2	79.8	94.4
RoMa	58.9	82.6	96.5
ArgMatch+	60.2	82.9	96.5

ArgMatch+ surpasses RoMa by 1.3% on PCK@0.5px, achieving state-of-the-art performance at the finest granularity.

Visual localization (InLoc):

Method	DUC1 (0.25m,2°)	DUC2 (0.25m,2°)
RoMa	54.5	56.5
ArgMatch+	58.6	58.8

ArgMatch+ surpasses RoMa on the InLoc visual localization benchmark, achieving state-of-the-art results.

Ablation Study¶

Component contributions (MegaDepth):

Configuration	AUC@5°	PCK@0.5px	Time (ms)	Params (M)
ConvR (baseline)	56.3	76.6	197	24.7
+U (upsampler)	58.0	79.0	222	27.2
+R+U (corrector+upsampler)	59.4	80.3	235	32.5
−ConvR+O+U	59.8	80.6	248	33.0
ArgMatch (O+R+U)	61.2	82.2	270	38.3
−ScaleS	60.7	81.7	269	38.2
−ContD	59.9	80.8	261	36.6
−Gate	60.1	81.4	270	38.3

Key findings: full integration of all three modules significantly outperforms any subset; content-aware decoding (ContD) contributes most; the gating strategy is critical for stable optimization.

Gradient stopping comparison:

Configuration	AUC@5°	PCK@0.5px
ConvR	56.3	76.6
ConvR + detach	57.8	77.3
ArgMatch	61.2	82.2
ArgMatch + detach	60.4	81.6

Conventional pipelines require gradient detachment to stabilize training, whereas ArgMatch achieves superior end-to-end optimization through gating aggregation and local consistency; gradient detachment is in fact detrimental.

Key Findings¶

Underestimated potential of lightweight models: The key lies not in heavier feature extractors or global matchers, but in a more intelligent refinement pipeline.
Synergy among three modules: The full potential of each module is realized only when all three are jointly integrated within the pipeline.
Local content information is central: Modulating the correlation volume with local content is more effective than pursuing more discriminative features.
Dual role of local consistency: It not only corrects errors in the forward pass but also guides rational gradient allocation during backpropagation.

Highlights & Insights¶

Paradigm shift in refinement pipeline design: The focus shifts from "better initialization is required" to "better refinement reduces dependence on initialization," fundamentally reconceiving the core design philosophy of dense matching.
Fine-grained control of gradient propagation: Gating and local consistency jointly address the fundamental difficulty of gradient propagation in coarse-to-fine frameworks.
Efficiency–accuracy trade-off: Comparable accuracy is achieved at 1/6 the computational cost of RoMa, demonstrating the high efficiency of the proposed design.
Hallucination matching phenomenon: The paper honestly discusses the hallucination matching problem in occluded regions caused by local consistency regression.

Limitations & Future Work¶

Hallucination matching: Local consistency regression may propagate neighborhood information into occluded regions, producing spurious matches; confidence map learning is suboptimal due to noisy MegaDepth annotations.
Training targets only 1/2 resolution; full-resolution recovery relies on RoMa's convolutional module, which is not trained as part of this work.
Although lightweight, the feature extractor can be further optimized.
Performance in scenarios with extreme viewpoint and illumination changes leaves room for improvement.

RoMa: The current dense matching state-of-the-art, relying on DINOv2 and complex global matching; ArgMatch approaches its performance at a fraction of the cost.
DKM: Employs Gaussian process global matching ($O(n^3)$ complexity); representative of the heavyweight paradigm ArgMatch aims to replace.
RAFT: Classic iterative refinement framework for optical flow estimation; the upsampler design in this paper draws inspiration from it.
NATTEN: Neighborhood attention acceleration library, used in this paper to implement efficient local consistency regression.
Insight: Refinement pipeline design is a critical leverage point for improving efficiency, and merits further exploration in a broad range of coarse-to-fine tasks.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐