Cross-View Completion Models are Zero-shot Correspondence Estimators¶
Conference: CVPR 2025
arXiv: 2412.09072
Code: cvlab-kaist.github.io/ZeroCo
Area: 3D Vision / Visual Correspondence
Keywords: Cross-View Completion, Zero-shot Matching, Cross-Attention, Cost Volume, Depth Estimation
TL;DR¶
Reveals that the cross-attention maps in cross-view completion (CVC) models inherently learn precise dense correspondence, and proposes ZeroCo to leverage this finding in zero-shot matching and learning-based geometric matching, significantly outperforming conventional methods based on encoder/decoder features.
Background & Motivation¶
- Cross-view completion (CroCo/CroCo-v2) has emerged as a powerful self-supervised pre-training task, but the mechanism of its success remains unclear.
- Existing methods leveraging CVC knowledge (DUSt3R, MASt3R, CroCo-Flow) only use decoder features as downstream representations.
- Key Insight: Cross-attention maps in CVC models capture geometric correspondence more accurately than encoder/decoder features.
- From an analogical perspective: The learning process of CVC is highly similar to self-supervised correspondence learning (optical flow, stereo depth)—both reconstruct the target view using source view features.
Method¶
Overall Architecture¶
Based on the pre-trained CroCo-v2 model, the proposed method directly utilizes the attention maps of the cross-attention layers in its decoder as the cost volume, achieving zero-shot dense correspondence estimation without any training. Furthermore, a lightweight learning module (cost aggregation + upsampling) is proposed to construct learning-based matching and depth estimation models.
Key Designs¶
-
Cross-Attention Map as Cost Volume (Zero-shot ZeroCo):
- Function: Directly utilizes the cross-attention map of the CVC model for zero-shot correspondence estimation.
- Mechanism: Extracts the pre-softmax cross-attention maps \(C^l(i,j) = D_t^{l,Q}(i) \cdot D_s^{l,K}(j)\) from the original and swapped input pairs respectively, averages them across layers, and fuses them as \(C' = \frac{1}{L}\sum_l C^l + (\frac{1}{L}\sum_l C^l_{\text{swap}})^T\) to enforce reciprocity, finally obtaining the flow via soft-argmax.
- Design Motivation: The cross-attention in CVC essentially learns "how to warp features from the source view to the target view", which is equivalent to learning correspondence.
-
Register Token Correction:
- Function: Eliminates artifacts in the attention map caused by register tokens.
- Mechanism: Replaces the attention values corresponding to register tokens with the minimum attention value.
- Design Motivation: Register tokens in Transformers lead to shortcut learning, creating artifacts where the attention inappropriately concentrates on registers instead of the correct locations.
-
Learning-based Extension (ZeroCo-flow / ZeroCo-depth):
- Function: Appends lightweight heads to the frozen cross-attention maps to achieve fine-grained matching and depth estimation.
- Mechanism: ZeroCo-flow applies cost aggregation \(\mathcal{T}_c\) and upsampling \(\mathcal{U}\) along the target axis on the cross-attention map to obtain high-resolution flow. ZeroCo-depth feeds the aggregated attention map into a DPT head to predict depth.
- Design Motivation: Zero-shot cross-attention maps have coarse resolution; adding shallow learning heads addresses fine-grained estimation.
Loss & Training¶
- ZeroCo Zero-shot: No training required, directly uses pre-trained CroCo-v2 weights.
- ZeroCo-flow: Trained with standard correspondence regression loss.
- ZeroCo-finetuned: Fine-tunes the cross-attention maps themselves.
- ZeroCo-depth: Trained with reprojection, appearance consistency, and smoothness losses (commonly used losses in self-supervised depth estimation).
Key Experimental Results¶
Main Results (Zero-shot Matching HPatches-240)¶
| Method | AEPE ↓ (I) | AEPE ↓ (V) | AEPE ↓ (Avg) |
|---|---|---|---|
| DINOv2 (Correlation) | 18.81 | 36.60 | 28.08 |
| DIFT_SD (Correlation) | 15.89 | 40.34 | 29.06 |
| CroCo Encoder (Corr.) | 39.69 | 54.63 | 47.52 |
| CroCo Decoder (Corr.) | 32.38 | 54.84 | 44.63 |
| ZeroCo (Cross-attn) | 5.07 | 13.26 | 9.41 |
Ablation Study¶
| Configuration | HPatches AEPE ↓ | Explanation |
|---|---|---|
| Forward attention only | Higher | Lacks reciprocity constraint |
| With register tokens | Artifacts | Abnormal attention concentration |
| Register removal + Bidirectional fusion | Optimal | Reciprocity + artifact elimination |
| Encoder feature correlation | 47.52 | Geometric information is diluted |
| Decoder feature correlation | 44.63 | Better than encoder, but far worse than cross-attention |
Key Findings¶
- The AEPE of the cross-attention map in the zero-shot setting is only 9.41, which is significantly better than DINOv2 (28.08) and diffusion model features (29.06).
- On ETH3D zero-shot matching, the average AEPE decreases from the best baseline of 25.69 to 12.72.
- Correlation maps constructed with the same encoder/decoder features from CroCo perform poorly, further proving that geometric knowledge is mainly encoded in cross-attention.
- Adding extremely lightweight heads achieves competitive results in learning-based matching and multi-frame depth estimation.
- Shows robustness to dynamic objects (does not rely on epipolar geometry).
Highlights & Insights¶
- The core finding is highly insightful: cross-attention \(\approx\) implicitly learned cost volumes, CVC \(\approx\) self-supervised correspondence learning.
- Challenges existing practices (such as DUSt3R using decoder features), proving that cross-attention is the component actually carrying geometric knowledge.
- ZeroCo features a minimalist design: only reciprocity fusion + register correction, achieving SOTA-level zero-shot matching without training.
- Clear analogy: source-to-target warp in CVC perfectly corresponds to source-to-target alignment in optical flow/stereo matching.
- Provides a new paradigm for how pre-trained CVC models should be utilized.
Limitations & Future Work¶
- The resolution of cross-attention maps is limited by the ViT patch size (usually 16×16); fine-grained matching still requires an extra learning head.
- Zero-shot inference requires two forward passes: forward and swapped.
- Only validated on CroCo-v2, other CVC models (such as cross-attention in DUSt3R decoder) have not been explored.
- Direct applications of cross-attention maps in other 3D tasks (e.g., pose estimation, 3D reconstruction) could be explored.
Related Work & Insights¶
- Reveals a possible success mechanism of DUSt3R/MASt3R: correspondence priors in cross-attention.
- Contrasts with diffusion model feature matching methods like DIFT; CVC is specifically designed for geometric tasks and yields better performance.
- Register token correction is related to studies on artifacts in ViTs (Darcet et al.).
- Inspiration: Pre-trained models might contain better feature representations than those conventionally used; their internal mechanisms warrant deep analysis.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Reveals a neglected but highly important finding.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive validation on zero-shot, learning-based, and depth estimation tasks, with rich visualizations.
- Writing Quality: ⭐⭐⭐⭐⭐ Elegant analogy-driven narrative, clear figures and tables.
- Value: ⭐⭐⭐⭐⭐ Reshapes the paradigm of using pre-trained CVC models, with far-reaching influence.