Are Pretrained Image Matchers Good Enough for SAR-Optical Satellite Registration?¶
Conference: CVPR 2026 arXiv: 2604.10217 Code: None Area: Remote Sensing Imagery Keywords: SAR-optical registration, image matching, cross-modal, zero-shot transfer, satellite imagery
TL;DR¶
This paper evaluates 24 families of pretrained image matchers on SAR-optical satellite registration under a zero-shot setting, finding that deployment protocol choices (geometric model, tile size, etc.) can affect accuracy by up to 33×, sometimes surpassing the effect of switching the matcher itself.
Background & Motivation¶
Background: Cloud cover during disaster response frequently renders optical imagery unavailable, necessitating the registration of SAR images to optical basemaps for generating georeferenced damage assessments. However, state-of-the-art image matchers are designed for indoor, urban, or natural scene imagery.
Limitations of Prior Work: Optical and SAR sensors observe the same scene through fundamentally different physical mechanisms — optical sensors capture reflected light (texture-rich), while SAR captures radar backscatter (speckle noise, layover, radiometric inversion). Whether pretrained matchers can function under such extreme domain shift remains unclear.
Key Challenge: Cross-modal matching requires modality-invariant feature representations, yet pretrained data contains virtually no satellite or SAR imagery.
Goal: Evaluate the cross-modal satellite registration performance of 24 matcher families in a purely zero-shot setting, without any fine-tuning or domain adaptation.
Key Insight: A unified, deterministic evaluation protocol encompassing large-image tile-based inference, robust geometric filtering, and tie-point-anchored metrics.
Core Idea: Cross-modal transfer is asymmetric — explicit cross-modal training does not consistently outperform training on natural images alone, and foundation model features may partially substitute for cross-modal supervision.
Method¶
Overall Architecture¶
24 matcher families × unified evaluation protocol (tile-based inference + RANSAC geometric filtering + tie-point metrics) × SpaceNet9 and two additional cross-modal benchmarks. Deployment protocol choices (geometric model, tile size, inlier gating) are systematically ablated.
Key Designs¶
-
Unified Evaluation Protocol:
- Function: Ensures all 24 matchers are evaluated under fully comparable conditions.
- Mechanism: Tile-based inference for handling the very high resolution of satellite imagery, affine RANSAC or fundamental matrix RANSAC for geometric filtering, and tie-point-anchored reprojection error metrics.
- Design Motivation: Different matcher papers employ different evaluation setups, making direct comparison meaningless without standardization.
-
Deployment Protocol Sensitivity Analysis:
- Function: Quantifies the impact of hyperparameter choices on registration accuracy.
- Mechanism: Systematic sweep over protocol parameters including geometric model (affine/fundamental matrix/homography), tile size, and inlier gating threshold. Using affine geometry alone reduces mean error from 12.34 px to 9.74 px. A single matcher can exhibit up to 33× accuracy variation across different protocols.
- Design Motivation: In practical deployment, protocol selection may matter more than matcher selection.
-
Cross-Modal Transfer Asymmetry Finding:
- Function: Reveals that explicit cross-modal training is not a prerequisite for strong performance.
- Mechanism: XoFTR (trained on visible-thermal pairs) and RoMa (no cross-modal training) both achieve the lowest mean error of 3.0 px. MatchAnything-ELoFTR (trained on synthetic cross-modal pairs) follows closely at 3.4 px. DINOv2 foundation model features may provide partial modality invariance.
- Design Motivation: Challenges the intuition that cross-modal tasks necessarily require cross-modal training.
Loss & Training¶
This is a purely zero-shot evaluation study; no training is involved. All matchers use their official pretrained weights.
Key Experimental Results¶
Main Results¶
| Matcher | SpaceNet9 Mean Error (px) | Cross-Modal Training |
|---|---|---|
| XoFTR | 3.0 | Yes (visible–thermal) |
| RoMa | 3.0 | No |
| MatchAnything-ELoFTR | 3.4 | Yes (synthetic cross-modal) |
| MASt3R/DUSt3R | Protocol-sensitive | No (3D reconstruction) |
Ablation Study¶
| Protocol Choice | Mean Error Change | Note |
|---|---|---|
| Affine geometry vs. others | 12.34→9.74 px | 21% reduction |
| Tile size variation | Up to 33× difference | For individual matchers |
| Inlier gating variation | Significant impact | Both overly strict and overly loose settings degrade performance |
Key Findings¶
- The impact of deployment protocol choices can exceed that of switching matchers — affine geometry alone reduces error by 21%.
- 3D reconstruction-based matchers (MASt3R/DUSt3R) are highly fragile under default settings and are heavily dependent on protocol configuration.
- DINOv2 foundation model features may provide a form of implicit modality invariance.
Highlights & Insights¶
- "Protocol over algorithm" finding: For practitioners, optimizing the deployment protocol may be more effective than replacing the matcher.
- Cross-modal transfer asymmetry: Counterintuitively, RoMa achieves the lowest error without any cross-modal training.
- Hypothesis of implicit modality invariance in DINOv2: Foundation model features may inherently generalize across modalities.
Limitations & Future Work¶
- Only zero-shot performance is evaluated; the effect of few-shot fine-tuning remains unexplored.
- Scene coverage in SpaceNet9 may be limited (primarily urban areas).
- The hypothesis of implicit modality invariance in DINOv2 requires deeper mechanistic analysis.
Related Work & Insights¶
- vs. RemoteCLIP: RemoteCLIP achieves domain adaptation through large-scale remote sensing pretraining, whereas this paper demonstrates that zero-shot transfer may already be sufficient.
- vs. LoFTR/ELoFTR: Standard natural-image matchers exhibit inconsistent performance on cross-modal satellite registration; deployment protocol is the critical factor.
Rating¶
- Novelty: ⭐⭐⭐ Empirical study, but findings are valuable.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 24 matcher families × multiple protocols × multiple benchmarks — extremely comprehensive.
- Writing Quality: ⭐⭐⭐⭐ In-depth analysis with a practical orientation.
- Value: ⭐⭐⭐⭐ Provides direct guidance for real-world deployment in applications such as disaster response.