Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction¶

Conference: CVPR 2026 arXiv: 2602.18996 Code: GitHub Area: Segmentation / Cross-View Correspondence Keywords: cross-view correspondence, cycle consistency, conditional segmentation, test-time training, egocentric view

TL;DR¶

This paper proposes CCMP, a cross-view object correspondence framework based on conditional binary segmentation. It leverages cycle-consistency constraints as a self-supervised signal and supports test-time training (TTT), achieving state-of-the-art performance of 44.57% mIoU on Ego-Exo4D.

Background & Motivation¶

Cross-view visual correspondence — particularly between egocentric and exocentric perspectives — is a core capability for embodied intelligence. For example, a service robot must localize objects in a third-person view based on instructions from a first-person perspective. This task poses three major challenges:

Severe appearance variation: Egocentric views suffer from shake, clutter, and motion blur, while exocentric views are stable but may lack fine-grained detail.

Large spatial context discrepancy: The surrounding environment of an object differs drastically across views, making background-based matching infeasible.

Distinct temporal dynamics: Object motion and deformation vary significantly between camera perspectives.

Existing methods either rely on auxiliary modules (ObjectRelator) or require pre-generated candidate masks (O-MaMa), resulting in complex architectures with limited generalizability.

Method¶

Overall Architecture¶

An end-to-end conditional binary segmentation framework: given a source image \(I_s\), a target image \(I_t\), and a source mask \(M_s\), the model predicts the corresponding object mask \(M_t\) in the target view. The framework consists of three components: a Source Feature Extractor (ConvNeXt-based DINOv3-L), a Transformer Encoder (ViT-based DINOv3-L), and a Multi-task Decoder.

Key Designs¶

Conditioning Token (CDT) Injection: Source image features are extracted via the backbone, and a compact object representation is obtained through mask-weighted average pooling: \(z_s = \sum_{i,j} \tilde{M}_s[i,j] \cdot F_s[:,i,j]\), where \(\tilde{M}_s\) is the normalized source mask. This representation is linearly projected into a CDT token and injected into the Transformer encoder, propagating object-aware information across visual tokens of the target image via cross-token attention. Design advantage: only a single additional token is introduced, remaining fully compatible with the pretrained backbone with minimal architectural modification.
Cycle-Consistency Loss: The core self-supervised signal. The source mask \(M_s\) is used to predict the target mask \(\hat{M}_t\), which is then inversely mapped back to the source view to reconstruct \(\hat{M}_s\), with the constraint \(\mathcal{L}_{cycle} = \mathcal{L}_{bce}(M_s, \hat{M}_s)\). A key property is that no ground-truth target mask is required, enabling direct use at inference time and thus TTT. This forms an elegant self-supervised closed loop — the model must learn view-invariant representations to successfully complete the round-trip mapping.
Test-Time Training (TTT): At inference time, the cycle-consistency loss is used to fine-tune the model on each test sample pair for a few steps — updating only the last \(K\) layers of the Transformer encoder for \(T\) gradient steps at lr \(= 5 \times 10^{-6}\). For Ego2Exo: \(K=4, T=2\); for the harder Exo2Ego direction: \(K=11, T=6\). TTT enables the model to adapt to distribution shifts specific to each test pair.

Loss & Training¶

Mask loss \(\mathcal{L}_{mask} = \mathcal{L}_{bce} + \lambda_{dice}\mathcal{L}_{dice}\), \(\lambda_{dice}=5\)
Auxiliary loss \(\mathcal{L}_{aux}\): the same mask loss applied to the second-to-last Transformer layer output (deep supervision)
Total loss \(\mathcal{L}_{total} = \mathcal{L}_{mask} + \lambda_{aux}\mathcal{L}_{aux} + \lambda_{cycle}\mathcal{L}_{cycle}\), \(\lambda_{aux}=1, \lambda_{cycle}=10\)
Two-stage training: Stage 1 freezes the DINOv3 backbone for 64K iterations; Stage 2 fine-tunes all parameters for 640K iterations
Gradient accumulation over 16 steps; trained on 8×A800 GPUs for approximately 72 hours
Data augmentation: unified bidirectional Ego2Exo/Exo2Ego training, synthesis of same-view pairs (Ego2Ego, Exo2Exo), and relaxed temporal alignment

Key Experimental Results¶

Main Results¶

Dataset	Metric	Ours	Prev. SOTA	Gain
Ego-Exo4D (Exo Query)	IoU	47.18	44.08 (O-MaMa)	+3.10
Ego-Exo4D (Ego Query)	IoU	41.95	42.57 (O-MaMa)	-0.62
Ego-Exo4D	mIoU	44.57	43.32 (O-MaMa)	+1.25
Ego-Exo4D	CA (Ego)	0.669	0.590 (O-MaMa)	+13.4%
HANDAL-X (zero-shot)	IoU	78.8	42.8 (ObjectRelator)	+36.0
HANDAL-X (fine-tuned)	IoU	85.0	84.7 (ObjectRelator)	+0.3

Ablation Study¶

Configuration	Ego-IoU	Exo-IoU	mIoU	Note
Full model	41.95	47.18	44.57	All components
w/o \(\mathcal{L}_{cycle}\)	40.28	45.82	43.05	Cycle consistency is critical
w/o \(\mathcal{L}_{aux}\)	40.64	43.81	42.90	Deep supervision matters
w/o TTT	41.79	44.18	42.99	TTT contributes +1.58
w/o same-view augmentation	40.88	45.50	43.19	Data diversity is important

Key Findings¶

Ego Query is consistently harder than Exo Query, as exocentric target objects tend to be smaller and surrounded by more cluttered environments.
General segmentation models such as SEEM and PSALM perform poorly on this task (IoU < 10%), demonstrating the necessity of cross-view training.
Zero-shot IoU of 78.8% on HANDAL-X substantially outperforms all baselines, indicating strong cross-domain generalization after training on Ego-Exo4D.
Even when DINOv3 features are replaced with the weaker DINOv2, the proposed method still outperforms "baseline + DINOv3," confirming that performance gains primarily stem from the method design rather than stronger features.

Highlights & Insights¶

Minimal design, strong performance: The introduction of a single CDT token and a cycle-consistency loss results in negligible architectural overhead yet achieves excellent performance.
Cycle consistency unifies training and inference — the same loss serves as supervision during training and enables TTT at inference time.
TTT is applied to cross-view correspondence for the first time and consistently yields performance improvements.
The data augmentation strategy (same-view pairing, relaxed temporal alignment) is simple yet effective and warrants adoption in related settings.

Limitations & Future Work¶

Ego Query performance remains slightly below O-MaMa (41.95 vs. 42.57); segmentation of small objects in exocentric views still has room for improvement.
TTT requires additional gradient update steps at inference time, increasing latency.
The visibility prediction (CLS Head) is trained in a post-hoc manner, separate from the main model, which may limit joint optimization.
Extreme cases where the object is invisible during the cycle are not explicitly handled, though such cases are rare in Ego-Exo4D.

O-MaMa: A mask-matching method that relies on FastSAM to pre-generate candidates; the proposed end-to-end approach is more straightforward.
ObjectRelator: Fuses visual and textual cues with a complex architecture; the purely visual approach proposed here generalizes better (HANDAL-X +36%).
TTT family (Sun et al. 2020, 2024): Test-time training has been extended from classification to video and language tasks; this work is the first to apply it to cross-view correspondence.
Insight: The combination of cycle consistency and TTT can be generalized to other correspondence tasks requiring self-supervision, such as 3D point cloud registration.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of cycle consistency and TTT is novel; CDT injection is elegant and effective.
Experimental Thoroughness: ⭐⭐⭐⭐ Dual-benchmark evaluation on Ego-Exo4D and HANDAL-X with comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐ Method description is clear; motivation and design rationale are coherent.
Value: ⭐⭐⭐⭐ Establishes a clean and efficient new baseline for cross-view correspondence; the TTT strategy has broad applicability.