Alligat0R: Pre-Training through Covisibility Segmentation for Relative Camera Pose Regression¶
Conference: NeurIPS 2025 arXiv: 2503.07561 Code: https://github.com/thibautloiseau/alligat0r Area: Image Segmentation Keywords: Covisibility Segmentation, Pre-Training, Relative Pose Regression, CroCo, ViT
TL;DR¶
This paper replaces CroCo's cross-view completion with covisibility segmentation as a stereo vision pre-training task, predicting per-pixel labels of "co-visible / occluded / out-of-view" for each pixel. The approach significantly outperforms CroCo in low-overlap scenarios and achieves a first-place overall success rate of 60.3% on the RUBIK benchmark.
Background & Motivation¶
Background: CroCo pioneered cross-view completion as a pre-training task for 3D vision and has been adopted by foundational models such as DUSt3R and MASt3R.
Limitations of Prior Work: - Cross-view completion in CroCo is ill-posed in non-co-visible regions — occluded or out-of-view pixels cannot be reconstructed from the other view. - CroCo requires image pairs with at least 50% overlap, limiting the diversity of training data. - The model learns ambiguous reconstructions in non-co-visible regions, wasting model capacity.
Key Challenge: Cross-view completion is meaningless in non-co-visible regions, yet low-overlap image pairs are abundant in real-world scenarios.
Goal: Design a pre-training task that provides effective supervision in both co-visible and non-co-visible regions.
Key Insight: Shift from "reconstruction" to "classification" — rather than reconstructing pixel values, predict the covisibility state of each pixel.
Core Idea: Replace cross-view completion with three-class covisibility segmentation, providing clear training signal across all regions.
Method¶
Overall Architecture¶
Input: Two images of a scene from different viewpoints. Pre-training: A ViT encoder independently processes both images; a Transformer decoder fuses information via cross-attention and predicts 3-class covisibility labels per pixel. Fine-tuning: A pose regression head is added to predict relative translation vectors and quaternion rotations.
Key Designs¶
-
Covisibility Segmentation Pre-Training Objective:
- Function: Predict per-pixel covisibility status in the other view — co-visible, occluded, or out-of-view.
- Mechanism: A ViT encoder symmetrically processes both images (without masking); the decoder performs cross-view reasoning via cross-attention; an FC layer outputs a 3-class softmax. Trained with cross-entropy loss.
- Design Motivation: Correct predictions require the model to understand 3D structure, occlusion relationships, and field of view, while non-co-visible regions also receive unambiguous training signal.
-
Symmetric Forward Pass:
- Function: Both images are processed by the same encoder without masking (unlike CroCo's asymmetric masking).
- Design Motivation: More consistent with downstream tasks and computationally more efficient.
-
Two-Stage Fine-Tuning Strategy:
- Phase 1: Freeze the backbone and train only the pose regression head using homoscedastic loss to balance translation and rotation.
- Phase 2: Unfreeze the entire network and jointly optimize pose loss and covisibility segmentation loss (retaining the segmentation head).
- Design Motivation: Retaining the segmentation head provides regularization by preserving pre-trained representations.
Loss & Training¶
- Pre-training: Cross-entropy loss \(L_{ce} = L_{ce1} + L_{ce2}\)
- Fine-tuning: Homoscedastic loss automatically balances translation, rotation, and segmentation terms.
Key Experimental Results¶
Main Results¶
Map-free Relocalization Benchmark
| Pre-training Method | \(\varepsilon_t < 0.25\)m (%) | \(\varepsilon_t < 0.5\)m (%) | \(\varepsilon_t < 5\)m (%) |
|---|---|---|---|
| CroCo v2 (official) | 75.7 | 87.4 | 91.5 |
| Alligat0R | 87.7 | 94.9 | 95.9 |
RUBIK Benchmark: Overall success rate of 60.3% ranks first, surpassing DUSt3R (54.8%) and MASt3R (53.6%).
Ablation Study¶
| Configuration | Description |
|---|---|
| w/o covisibility head (removed during fine-tuning) | Performance degrades; retaining the segmentation head provides regularization. |
| nuScenes data only | ScanNet indoor data contributes complementary improvements. |
| Phase 1 only | Not unfreezing the backbone leads to substantially lower performance. |
Key Findings¶
- Large advantage in low-overlap scenarios: In the 20–40% overlap range, Alligat0R achieves a success rate of 61.5%.
- Speed advantage: Direct pose regression requires only 57ms, compared to 257ms for DUSt3R.
- Zero-shot generalization: Zero-shot correspondence estimation on ETH3D surpasses CroCo v2.
Highlights & Insights¶
- Pre-training paradigm shift: Moving from "reconstruction" to "classification" elegantly resolves the ill-posed nature of CroCo in non-co-visible regions.
- Interpretability: Segmentation outputs provide intuitive visualization of the model's geometric understanding.
- Regularization via retained segmentation head: Not discarding the pre-training head during fine-tuning is a generalizable fine-tuning strategy worth wider adoption.
- Large-scale dataset Cub3: 5M image pairs with dense covisibility annotations.
Limitations & Future Work¶
- Validation is limited to pose regression; downstream tasks such as 3D reconstruction and Gaussian splatting remain untested.
- The three-class taxonomy may be too coarse, lacking fine-grained states such as partial occlusion.
- Integration with DUSt3R/MASt3R to validate improvements in 3D reconstruction is a promising direction.
Related Work & Insights¶
- vs. CroCo/CroCo v2: CroCo is ill-posed in non-co-visible regions; this work replaces it with segmentation, providing globally valid supervision.
- vs. DUSt3R/MASt3R: These CroCo-based foundation models may benefit further from Alligat0R pre-training.
- vs. Reloc3R: Both target pose regression, but Reloc3R predicts only translation direction, whereas this work predicts metric translation.
Rating¶
- Novelty: ⭐⭐⭐⭐ — A clean pre-training task design that directly addresses CroCo's core limitation.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Validated across multiple benchmarks with ablations, visualizations, and generalization tests.
- Writing Quality: ⭐⭐⭐⭐ — Clear motivation and fair comparisons.
- Value: ⭐⭐⭐⭐ — Potentially influences the pre-training paradigm of the entire CroCo ecosystem.