CraterBench-R: Instance-Level Crater Retrieval for Planetary Scale¶

Conference: CVPR 2026 arXiv: 2604.06245 Code: https://hf.co/datasets/jfang/CraterBench-R Area: Planetary Science / Image Retrieval Keywords: crater retrieval, instance-level retrieval, ViT patch token, training-free token aggregation, two-stage retrieval

TL;DR¶

This work is the first to formalize crater analysis as an instance-level image retrieval problem. It introduces the CraterBench-R benchmark (~25K Mars crater IDs, 50K gallery, 5K queries), and through systematic diagnosis reveals that single-vector pooling imposes an accuracy ceiling while supervised metric learning consistently degrades performance. A training-free instance token aggregation method is proposed—selecting K seed tokens via top-K attention or FPS and performing cosine nearest-neighbor residual assignment—to compress 196 ViT patch tokens into K representative tokens for late interaction matching. At K=64, the method matches full-token accuracy with substantially reduced storage. A practical two-stage pipeline (single-vector coarse retrieval + instance token re-ranking) recovers 89–94% of full-pipeline accuracy.

Background & Motivation¶

Background: Mars orbital imagery contains millions of crater structures. Deep learning efforts have focused on detection—predicting locations and diameters—without providing visual representations suitable for association.

Practical Need: Scientific workflows depend on association—deduplication of the same crater across images, cross-observation matching, and morphological analogy discovery. These tasks are fundamentally retrieval problems, not detection problems.

Core Challenge: Martian crater appearance is highly complex—varying degradation states (pristine vs. heavily eroded), diverse infill mechanisms (dunes/dust/lava), and dramatic illumination variation across orbital passes—resulting in extreme structural and photometric variability.

Representation Bottleneck Findings: (1) Single-vector global descriptors (CLS/GeM pooling) over-compress spatial detail, imposing a hard accuracy ceiling. (2) Supervised metric learning (three commonly used losses) consistently degrades retrieval accuracy, including late interaction accuracy—attributed to only 2 views per ID, yielding insufficient positive diversity. (3) Retaining all 196 patch tokens for late interaction achieves high accuracy but is infeasible at planetary scale due to storage and computation costs.

Core Idea: Training-free instance token aggregation—post-hoc compression from frozen ViT features—avoids fine-tuning degradation while preserving spatial detail.

Method¶

Key Designs¶

CraterBench-R Benchmark:
- ~25K crater IDs, each with 2 gallery views (~50K gallery images)
- 5K manually verified query images (1,000 crater IDs × 5 views) with cross-scale and context variation
- Mars CTX imagery with a complete evaluation protocol
- Diameter range: 1.0–401 km (median 1.5 km; 69% below 2 km)
- Gallery provided in two standard crops: 2× and 3× diameter context, explicitly evaluating robustness to context variation
- Queries manually verified to exclude degraded samples (pure background, severe artifacts, etc.)
- Evaluation metrics: Recall@K (K=1,5,10) and mAP; cluster-tolerant relevance to handle co-visible crater cases
Baseline Diagnosis (30 frozen backbones):
- Self-supervised ViTs—especially domain-specific pretrained MarsDINO—perform best, outperforming general-purpose models with 79× more parameters
- ViT-B/16 MarsDINO (85M parameters): R@1=.374, mAP=.553—best single-vector result
- Same architecture with DINO: R@1=.304 → domain-specific pretraining yields +7.0 R@1 gain
- MAE (.022) and CLIP (.058) perform extremely poorly under the same ViT-B/16 architecture → pretraining objective matters more than architecture
- Single-vector pooling (CLS/GeM) constitutes an insurmountable accuracy ceiling
- Supervised metric learning (Triplet/ArcFace/SupCon): all three losses consistently degrade retrieval accuracy
  - Triplet performs best yet still reduces CLS mAP from .368 to .318 and LI mAP from .602 to .530
  - Root cause: only 2 views per ID → insufficient positive diversity → full-backbone fine-tuning corrupts the token-level structure required by late interaction
Instance Token Aggregation (Training-Free; Core Contribution):
- Step 1 — Seed Selection: Select K seed indices \(\mathcal{S}=\{s_1,\ldots,s_K\}\) via attention-based selection (top-K by CLS→patch attention weights) or FPS (farthest point sampling in cosine space)
- Step 2 — Assignment: Non-seed tokens are assigned to their nearest seed by cosine similarity, forming clusters \(C_k\)
- Step 3 — Aggregation: Seed and cluster tokens are merged in residual form: \(\mathbf{z}_k = \ell_2\left(\mathbf{t}_{s_k} + \frac{1}{\max(|C_k|, \epsilon)}\sum_{i \in C_k} \mathbf{t}_i\right)\)
- Why residual rather than centroid: The residual formulation preserves the seed's identity, maintaining discriminability even for small clusters; k-means centroids blur local morphological detail
- Output: K instance tokens used for ColBERT-style late interaction matching: \(s_{\mathrm{LI}}(q,g) = \frac{1}{K_q}\sum_{i=1}^{K_q}\max_{1 \leq j \leq K_g} \langle \mathbf{t}_i^q, \mathbf{t}_j^g \rangle\)
- Training-free → avoids the fine-tuning degradation trap
- At K=16: mAP is +17.9 pts above raw token selection; at K=64: ≈ full 196-token accuracy with 3× storage reduction
Two-Stage Planetary-Scale Retrieval Pipeline:
- Stage 1: Single-vector FAISS coarse retrieval of top-S candidates (millisecond-level)
- Stage 2: Instance token late interaction re-ranking
- Offline aggregation complexity: \(O(NK)\) per image; online matching complexity: \(O(K^2D)\) per candidate
- At S=100: recovers 89–94% of full-pipeline accuracy
- At S=500: recovers ~96%

Key Experimental Results¶

Main Results — Frozen Backbone Single-Vector Retrieval¶

Model	Params	Pooling	R@1	R@5	mAP
EfficientNet-B0	4M	GAP	.150	.214	.248
ResNet-50	24M	GeM	.142	.217	.244
ViT-S/16 DINO	22M	CLS	.273	.360	.420
ViT-B/8 DINO	86M	GeM	.304	.379	.461
ViT-B/14 DINOv2	87M	Max	.240	.323	.377
ViT-7B/16 DINOv3_sat	6.7B	Max	.330	.416	.505
ViT-B/16 MAE	86M	GeM	.022	.042	.043
ViT-B/16 CLIP	86M	GeM	.058	.091	.107
ViT-S/16 MarsDINO	22M	GeM	.269	.356	.412
ViT-B/16 MarsDINO	85M	CLS	.374	.472	.553

Ablation Study — Instance Token Aggregation¶

Configuration	mAP	Notes
Single-vector (best backbone)	.553	MarsDINO CLS pooling ceiling
Raw attention selection K=16	.444	Token selection only, no aggregation
Instance token aggregation K=16	.623	+17.9 pts, substantial gain
Raw attention selection K=64	.716	Accuracy increases with more tokens
Instance token aggregation K=64	.760	Approaches full-token accuracy
Full 196-token late interaction	.744 (MarsDINO)	Complete upper bound
Supervised Triplet fine-tuning	.318 (CLS)	Degraded, below frozen .368

Key Findings¶

"Fine-tuning degradation" is the most important negative result in this paper—in the few-view regime (only 2 views per ID), brute-force learning underperforms frozen features with post-hoc processing
Residual assignment (vs. k-means centroids) retains more local morphological detail → stronger discriminability for crater edges and textures
Self-supervised ViT > CLIP > ImageNet pretraining → domain-specific pretraining is the key factor for retrieval performance
Attention-based seed selection provides the largest advantage at low K (K=16: +14 mAP over random); the gap narrows at high K
Pretraining objective matters more than parameter count: 22M ViT-S/16 DINO (.420 mAP) outperforms 86M DeiT-B/16 (.303) and 134M VGG-16 (.068)

Highlights & Insights¶

Task Redefinition: The paradigm shift from detection (outputting coordinates) to retrieval (outputting similarity-ranked matches) directly addresses the authentic needs of planetary science workflows
"Supervised Degradation" Finding and Explanation: In the few-view regime, metric learning lacks sufficient positive diversity → fine-tuning degrades general-purpose representations → frozen features with post-hoc processing is the correct strategy for such regimes
Generality of Training-Free Token Aggregation: Not limited to craters—applicable to any scenario requiring efficient retrieval over frozen ViT features (remote sensing change detection, scene deduplication, geo-localization)
Methodological Contribution to GeoAI: The pipeline of late interaction + deterministic compression + two-stage search is domain-agnostic

Limitations & Future Work¶

Only 2 views per ID → more views may restore the effectiveness of supervised methods
Currently limited to Mars CTX → generalization to the Moon and other planetary bodies remains to be verified
Seed token selection is attention-based → alternative saliency metrics may be superior
The optimal value of K may vary with crater size and type

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First crater retrieval benchmark + training-free token aggregation + supervised degradation finding
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 30 backbones + 3 metric learning losses + K-value ablation + two-stage parameter analysis
Writing Quality: ⭐⭐⭐⭐⭐ Clear logical chain from problem definition → diagnosis → solution → experiments
Value: ⭐⭐⭐⭐ Dual contributions to planetary science and GeoAI, with a generalizable retrieval methodology