Cross-View Completion Models are Zero-shot Correspondence Estimators¶

Conference: CVPR 2025
arXiv: 2412.09072
Code: cvlab-kaist.github.io/ZeroCo
Area: 3D Vision / Visual Correspondence
Keywords: Cross-View Completion, Zero-shot Matching, Cross-Attention, Cost Volume, Depth Estimation

TL;DR¶

Reveals that the cross-attention maps in cross-view completion (CVC) models inherently learn precise dense correspondence, and proposes ZeroCo to leverage this finding in zero-shot matching and learning-based geometric matching, significantly outperforming conventional methods based on encoder/decoder features.

Background & Motivation¶

Cross-view completion (CroCo/CroCo-v2) has emerged as a powerful self-supervised pre-training task, but the mechanism of its success remains unclear.
Existing methods leveraging CVC knowledge (DUSt3R, MASt3R, CroCo-Flow) only use decoder features as downstream representations.
Key Insight: Cross-attention maps in CVC models capture geometric correspondence more accurately than encoder/decoder features.
From an analogical perspective: The learning process of CVC is highly similar to self-supervised correspondence learning (optical flow, stereo depth)—both reconstruct the target view using source view features.

Method¶

Overall Architecture¶

Based on the pre-trained CroCo-v2 model, the proposed method directly utilizes the attention maps of the cross-attention layers in its decoder as the cost volume, achieving zero-shot dense correspondence estimation without any training. Furthermore, a lightweight learning module (cost aggregation + upsampling) is proposed to construct learning-based matching and depth estimation models.

Key Designs¶

Cross-Attention Map as Cost Volume (Zero-shot ZeroCo):
- Function: Directly utilizes the cross-attention map of the CVC model for zero-shot correspondence estimation.
- Mechanism: Extracts the pre-softmax cross-attention maps \(C^l(i,j) = D_t^{l,Q}(i) \cdot D_s^{l,K}(j)\) from the original and swapped input pairs respectively, averages them across layers, and fuses them as \(C' = \frac{1}{L}\sum_l C^l + (\frac{1}{L}\sum_l C^l_{\text{swap}})^T\) to enforce reciprocity, finally obtaining the flow via soft-argmax.
- Design Motivation: The cross-attention in CVC essentially learns "how to warp features from the source view to the target view", which is equivalent to learning correspondence.
Register Token Correction:
- Function: Eliminates artifacts in the attention map caused by register tokens.
- Mechanism: Replaces the attention values corresponding to register tokens with the minimum attention value.
- Design Motivation: Register tokens in Transformers lead to shortcut learning, creating artifacts where the attention inappropriately concentrates on registers instead of the correct locations.
Learning-based Extension (ZeroCo-flow / ZeroCo-depth):
- Function: Appends lightweight heads to the frozen cross-attention maps to achieve fine-grained matching and depth estimation.
- Mechanism: ZeroCo-flow applies cost aggregation \(\mathcal{T}_c\) and upsampling \(\mathcal{U}\) along the target axis on the cross-attention map to obtain high-resolution flow. ZeroCo-depth feeds the aggregated attention map into a DPT head to predict depth.
- Design Motivation: Zero-shot cross-attention maps have coarse resolution; adding shallow learning heads addresses fine-grained estimation.

Loss & Training¶

ZeroCo Zero-shot: No training required, directly uses pre-trained CroCo-v2 weights.
ZeroCo-flow: Trained with standard correspondence regression loss.
ZeroCo-finetuned: Fine-tunes the cross-attention maps themselves.
ZeroCo-depth: Trained with reprojection, appearance consistency, and smoothness losses (commonly used losses in self-supervised depth estimation).

Key Experimental Results¶

Main Results (Zero-shot Matching HPatches-240)¶

Method	AEPE ↓ (I)	AEPE ↓ (V)	AEPE ↓ (Avg)
DINOv2 (Correlation)	18.81	36.60	28.08
DIFT_SD (Correlation)	15.89	40.34	29.06
CroCo Encoder (Corr.)	39.69	54.63	47.52
CroCo Decoder (Corr.)	32.38	54.84	44.63
ZeroCo (Cross-attn)	5.07	13.26	9.41

Ablation Study¶

Configuration	HPatches AEPE ↓	Explanation
Forward attention only	Higher	Lacks reciprocity constraint
With register tokens	Artifacts	Abnormal attention concentration
Register removal + Bidirectional fusion	Optimal	Reciprocity + artifact elimination
Encoder feature correlation	47.52	Geometric information is diluted
Decoder feature correlation	44.63	Better than encoder, but far worse than cross-attention

Key Findings¶

The AEPE of the cross-attention map in the zero-shot setting is only 9.41, which is significantly better than DINOv2 (28.08) and diffusion model features (29.06).
On ETH3D zero-shot matching, the average AEPE decreases from the best baseline of 25.69 to 12.72.
Correlation maps constructed with the same encoder/decoder features from CroCo perform poorly, further proving that geometric knowledge is mainly encoded in cross-attention.
Adding extremely lightweight heads achieves competitive results in learning-based matching and multi-frame depth estimation.
Shows robustness to dynamic objects (does not rely on epipolar geometry).

Highlights & Insights¶

The core finding is highly insightful: cross-attention \(\approx\) implicitly learned cost volumes, CVC \(\approx\) self-supervised correspondence learning.
Challenges existing practices (such as DUSt3R using decoder features), proving that cross-attention is the component actually carrying geometric knowledge.
ZeroCo features a minimalist design: only reciprocity fusion + register correction, achieving SOTA-level zero-shot matching without training.
Clear analogy: source-to-target warp in CVC perfectly corresponds to source-to-target alignment in optical flow/stereo matching.
Provides a new paradigm for how pre-trained CVC models should be utilized.

Limitations & Future Work¶

The resolution of cross-attention maps is limited by the ViT patch size (usually 16×16); fine-grained matching still requires an extra learning head.
Zero-shot inference requires two forward passes: forward and swapped.
Only validated on CroCo-v2, other CVC models (such as cross-attention in DUSt3R decoder) have not been explored.
Direct applications of cross-attention maps in other 3D tasks (e.g., pose estimation, 3D reconstruction) could be explored.

Reveals a possible success mechanism of DUSt3R/MASt3R: correspondence priors in cross-attention.
Contrasts with diffusion model feature matching methods like DIFT; CVC is specifically designed for geometric tasks and yields better performance.
Register token correction is related to studies on artifacts in ViTs (Darcet et al.).
Inspiration: Pre-trained models might contain better feature representations than those conventionally used; their internal mechanisms warrant deep analysis.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Reveals a neglected but highly important finding.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive validation on zero-shot, learning-based, and depth estimation tasks, with rich visualizations.
Writing Quality: ⭐⭐⭐⭐⭐ Elegant analogy-driven narrative, clear figures and tables.
Value: ⭐⭐⭐⭐⭐ Reshapes the paradigm of using pre-trained CVC models, with far-reaching influence.