O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views¶

Conference: ICCV 2025 arXiv: 2506.06026 Code: Maria-SanVil/O-MaMa Area: Image Segmentation Keywords: Cross-View Segmentation, Mask Matching, Ego-Exo Correspondences, Contrastive Learning, DINOv2

TL;DR¶

This work reframes cross-view (ego-exo) object segmentation as a mask matching problem. It leverages FastSAM to generate candidate masks, DINOv2 to extract semantic features, and contrastive learning to match objects across views, achieving state-of-the-art performance on the Ego-Exo4D benchmark with only 1% of the trainable parameters used by prior methods.

Background & Motivation¶

Multi-agent collaboration scenarios — including multi-robot manipulation, AR assistants, and human-robot collaboration — require establishing object correspondences between egocentric (first-person) and exocentric (third-person) views. While single-image segmentation is well-studied, cross-view segmentation presents unique challenges:

Severe viewpoint changes: The egocentric view captures fine-grained hand-object interactions but suffers from high dynamics and motion blur, while the exocentric view covers the entire scene but exhibits large object scale variation.

Occlusion and domain shift: Differences in camera optics and imaging conditions introduce significant domain gaps.

Failure of traditional geometric matching: Even the state-of-the-art RoMa achieves only a 67.6% success rate in ego-exo settings.

Core insight: Rather than training a model to perform cross-view segmentation from scratch at the pixel level, one can leverage the zero-shot segmentation capability of SAM-based models to generate high-quality candidate masks, and then reduce the problem to determining which candidate mask corresponds to the target object — a matching problem.

Method¶

Overall Architecture¶

The O-MaMa pipeline proceeds as follows: 1. Generate $N$ candidate masks $\{\mathcal{M}_n\}_{n=1}^N$ in the target view using FastSAM. 2. Extract descriptors for each candidate mask via the Mask-Context Encoder. 3. Fuse global cross-view information via Ego↔Exo Cross Attention. 4. Learn view-invariant features using the Mask Matching Contrastive Loss. 5. At inference, select the candidate mask whose embedding is most similar to the source mask embedding.

Key Designs¶

Mask-Context Encoder:
- Dense feature maps $\psi(I)$ are extracted using DINOv2 ViT-B/14 and upsampled by 4× to preserve fine-grained spatial detail.
- Object descriptor $\mathbf{o}_n$: average pooling of DINOv2 features over the mask region.
- Context descriptor $\mathbf{c}_n$: average pooling over an expanded bounding box region, incorporating surrounding context to aid cross-view localization.
- Design motivation: DINOv2's self-supervised features exhibit strong semantic understanding and object decomposition capability. Experiments confirm that Avg-Pool(mask) outperforms Avg-Pool(bbox), Max-Pool(bbox), centroid-based features, and CLIP features.
Hard Negative Adjacent Mining:
- Problem: Neighboring objects share similar context but differ in identity; naive global negative sampling is insufficient for learning discriminative representations.
- A Delaunay triangulation is applied to construct an adjacency graph over mask segments.
- First- and second-order neighbors of each object are collected: $\mathcal{O}_n^- = \mathcal{N}(\mathbf{o}_n) \cup \mathcal{N}^2(\mathbf{o}_n)$.
- Hard negatives are sampled from this neighborhood set for contrastive learning.
- Ablations show this strategy yields gains of +4.2 IoU (Ego2Exo) and +1.2 IoU (Exo2Ego).
Ego↔Exo Cross Attention:
- Candidate mask descriptors $\mathbf{o}_n$ serve as queries, while the full DINOv2 feature map $\psi(I^S)$ of the source image serves as keys and values.
- Standard cross-attention is computed as: $\hat{\mathbf{o}}_n = \text{Softmax}(\frac{\mathbf{o}_n W_Q \cdot (\psi(I^S) W_K)^\top}{\sqrt{d}}) \cdot \psi(I^S) W_V$
- Learnable positional encodings and LayerNorm are incorporated.
- A cross-view embedding $\hat{\mathbf{o}}_S$ is similarly computed for the source mask in the target view.
- Design motivation: Context embeddings encode only local information and lack global cross-view semantic associations.

Loss & Training¶

Mask Matching Contrastive Loss: based on InfoNCE, with hard negatives sampled from adjacent neighbors in a batch $\mathcal{B}$:

$$\mathcal{L}_M(\rho^+, \rho_S) = -\log \frac{\exp(\text{sim}(f_\theta(\rho^+), f_\theta(\rho_S))/\tau)}{\sum_{n=1}^{|\mathcal{B}|} \exp(\text{sim}(f_\theta(\rho_n), f_\theta(\rho_S))/\tau)}$$

The final descriptor $\rho_n = [\hat{\mathbf{o}}_n; \mathbf{c}_n; \mathbf{o}_n]$ (cross-view embedding + context + object) is projected into a shared latent space via a shallow MLP $f_\theta$.
Optimizer: AdamW, lr=$8 \times 10^{-5}$, cosine annealing, batch size of 24 image pairs, 32 candidate masks sampled per target image.
Hardware: 2× NVIDIA RTX 4090.

Key Experimental Results¶

Main Results (Tables)¶

Ego-Exo4D Correspondences v2 Test Split

Method	Ego2Exo IoU ↑	Exo2Ego IoU ↑	Total IoU ↑	Trainable Params (M)
XMem + XSegTx	34.9	25.0	30.0	67.1
PSALM (zero-shot)	7.4	2.1	4.8	0
k-NN baseline	31.9	30.9	31.4	0
O-MaMa	42.6	44.1	43.4	11.6

Ego-Exo4D Correspondences v1 Val Split

Method	Ego2Exo IoU ↑	Exo2Ego IoU ↑	Total IoU ↑	Trainable Params (M)
PSALM (fine-tuned)	41.3	44.1	42.7	1587.1
ObjectRelator	44.3	50.9	47.6	1587.3
O-MaMa	50.1	54.2	52.1	11.6

O-MaMa surpasses ObjectRelator (prev. SOTA) by +13.1% (Ego2Exo) and +6.5% (Exo2Ego) while using only 1% of the trainable parameters.

Ablation Study (Tables)¶

Per-Module Ablation (10% Validation Set)

Config	$\mathcal{L}_M$	Context	Adj.Neg	CrossAttn	Ego2Exo IoU	Exo2Ego IoU	Total IoU
Baseline	✗	✗	✗	✗	35.2	34.9	35.1
A	✓	✗	✗	✗	42.2	44.7	43.5
C	✓	✓	✓	✗	46.9	45.6	46.3
E (full)	✓	✓	✓	✓	48.3	49.6	49.0

IoU gains over baseline: Ego2Exo +37.2%, Exo2Ego +42.1%.

Mask Descriptor Comparison

Descriptor	k-NN Ego2Exo	k-NN Exo2Ego	Learned Ego2Exo	Learned Exo2Ego
Avg-Pool(Mask)-DINOv2	35.2	34.9	42.2	44.7
Avg-Pool(BBox)-DINOv2	21.8	21.2	27.8	44.1
Avg-Pool(BBox)-CLIP	24.5	23.9	27.5	40.4
Centroid-DINOv2	25.6	24.1	-	-

DINOv2 mask-pooled features substantially outperform CLIP and alternative pooling strategies.

Key Findings¶

Problem reformulation is the primary contribution: Recasting cross-view segmentation as mask matching allows a zero-shot k-NN baseline (40.5 IoU) to already surpass many trained models.
Geometric constraints offer limited benefit: RoMa achieves only a 67.6% success rate; geometric matching yields marginal improvement over contrastive learning (35.2→35.4 vs. 35.2→42.2).
DINOv2 > CLIP: DINOv2's fine-grained semantic features outperform CLIP's coarser representations for this task.
Small objects remain challenging: O-MaMa performs well on medium and large objects, but mask descriptors for very small objects carry insufficient discriminative information.
Inference speed: approximately 250ms on average (FastSAM: 70ms).

Highlights & Insights¶

The power of problem reformulation: Recasting the difficult pixel-level cross-view segmentation task as a mask-level retrieval/matching problem substantially reduces task difficulty, enabling a lightweight model to achieve state-of-the-art performance.
DINOv2's object decomposition capability: Self-supervised DINOv2 pretraining provides remarkably strong object-level semantic representations — even in a zero-shot setting, they surpass models trained specifically for this task.
Delaunay triangulation for hard negative mining: Spatial proximity is elegantly exploited to enhance the discriminability of contrastive learning, proving more effective than random negative sampling.
Exceptional parameter efficiency: 11.6M trainable parameters vs. ObjectRelator's 1587.3M, demonstrating that foundation model features are already of sufficient quality and require only minimal task-specific adaptation.

Limitations & Future Work¶

FastSAM may produce incomplete segmentations covering only part of an object, leading to correct matches with suboptimal IoU.
Very small objects represent a primary bottleneck, as their mask descriptors contain insufficient information.
Temporal information from video is not exploited (frames are processed independently); incorporating temporal continuity could further improve performance.
The method depends on the quality of FastSAM proposals — if the target object is not covered by any candidate mask, matching is impossible.
Stronger segmentation models such as SAM2 have not been explored as candidate generators.

Ego-Exo4D: Provides a large-scale synchronized ego-exo video dataset and the associated Correspondences benchmark.
ObjectRelator: Fine-tunes PSALM (LLM-based) for cross-view segmentation at very large parameter cost.
FastSAM / SAM: Provide high-quality zero-shot segmentation, forming the foundation of this approach.
DINOv2: A self-supervised visual foundation model that supplies object-level semantic representations.
Insight: When foundation models already provide strong foundational capabilities (segmentation, feature extraction), lightweight task adaptation — such as contrastive learning combined with matching — may represent a more effective paradigm.

Rating¶

Novelty: ⭐⭐⭐⭐ Reframing cross-view segmentation as mask matching is a concise and effective contribution.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluation on two dataset splits, comparison against multiple baselines, comprehensive ablations (modules, descriptors, geometric constraints), and task-level analysis.
Writing Quality: ⭐⭐⭐⭐ The method is intuitive, the architecture diagrams are clear, and the experimental analysis is thorough.
Value: ⭐⭐⭐⭐⭐ Achieving state-of-the-art with 1% of prior parameters; the problem reformulation paradigm offers broad inspiration for cross-view understanding tasks.