Fine-grained Spatiotemporal Grounding on Egocentric Videos¶

Conference: ICCV 2025 arXiv: 2508.00518 Code: https://github.com/LaVi-Lab/EgoMask Area: Video Understanding Keywords: egocentric video, spatiotemporal grounding, pixel-level segmentation, benchmark, video understanding

TL;DR¶

This paper presents EgoMask, the first pixel-level spatiotemporal grounding benchmark for egocentric videos, comprising short/medium/long evaluation splits and a large-scale training set EgoMask-Train. Through systematic analysis, it reveals key differences between egocentric and exocentric videos, and demonstrates that fine-tuned models can achieve substantial performance gains.

Background & Motivation¶

Spatiotemporal Video Grounding (STVG) aims to localize target entities in video given textual queries. Existing work has focused predominantly on exocentric videos. Although egocentric videos are increasingly important for AR and robotics applications, pixel-level spatiotemporal grounding in this domain remains largely unexplored.

Quantitative analysis of key differences — entity appearance in egocentric videos exhibits: - Shorter total presence: only 21.56% of video duration (vs. 77–94% in exocentric) - Sparser continuous trajectories: a single trajectory spans only 1.33% of the video (vs. 65–90% in exocentric); absence duration is 6× presence duration - Smaller targets: mask area is only 1.20% of the frame (vs. 5%+ in exocentric) - Larger positional displacement: inter-frame mask IoU is only 14.96% (vs. 50%+ in exocentric)

Existing datasets (EgoTracks provides only bounding boxes; RefEgo covers only short videos) cannot support pixel-level evaluation, motivating the need for a new benchmark.

Method¶

Overall Architecture¶

The primary contribution of this work is dataset construction rather than model design. An automated annotation pipeline is designed in two stages: (1) pixel-level mask generation; and (2) referring expression generation. These stages are used to build both the evaluation benchmark EgoMask and the training dataset EgoMask-Train.

Key Designs¶

Pixel-level Mask Generation: Leveraging bounding box annotations from EgoTracks, SAM2 is applied for video-level object segmentation. Only segments containing the target object are annotated; the bounding box in the first frame serves as the box prompt for SAM2. Post-processing retains only regions overlapping with the provided bounding box annotations, minimizing hallucination errors.
Referring Expression Generation: Two strategies are employed to ensure diversity — (1) GPT-4o is prompted directly to generate short and long descriptions (using three frames where the target is most visible, with bounding boxes overlaid); (2) GPT-4o first generates metadata (visual attributes, world knowledge, object functions, etc.), which is then combined via predefined templates into referring expressions. All annotations undergo human verification.
Multi-duration Tiered Evaluation Design:
- EgoMask-Short (<1 min, 200 videos, 400 expressions): sampled from RefEgo, with manually annotated masks and refined expressions
- EgoMask-Medium (1–3 min, 100 videos, 200 expressions): randomly clipped from annotated long videos
- EgoMask-Long (>3 min, 15 videos, 100 expressions): based on the EgoTracks validation set, generated via the pipeline and manually refined
- EgoMask-Train: 2,624 videos, 9,592 objects, 47,968 expressions

Loss & Training¶

Fine-tuning is applied to two state-of-the-art models: - Sa2VA-4B (+FT): fine-tuned on EgoMask-Train plus 3 exocentric video segmentation datasets; 8× A100 GPUs, ~10 hours; AdamW, lr=4e-6, batch size=16 - VideoLISA-3.8B (+FT): 80% EgoMask-Train + 20% original training data; 4× A100 GPUs, 20 epochs, AdamW, lr=3e-5, batch size=16, 500 steps per epoch, ~12 hours

Evaluation Metric Design¶

Four metrics are proposed: - T_recall: ratio of predicted frames to ground-truth frames (temporal localization ability) - IoU_all: mean IoU over all frames (equivalent to the conventional \(\mathcal{J}\) metric) - IoU_gold: mean IoU computed only over ground-truth frames - IoU_gold_pred: mean IoU computed over all ground-truth and predicted frames (penalizes hallucinated predictions on background frames)

Key Experimental Results¶

Main Results¶

Results on the EgoMask benchmark (IoU_gold_pred):

Method	Short	Medium	Long
Grounded-SAM2	49.95	25.73	24.80
Sa2VA-26B	37.30	25.83	12.96
Sa2VA-4B	29.00	17.02	8.11
Sa2VA-4B (+FT)	30.97 (+1.97)	18.52 (+1.50)	8.24 (+0.13)
VideoLISA-3.8B	17.85	6.48	5.15
VideoLISA-3.8B (+FT)	23.36 (+5.51)	9.98 (+3.50)	7.16 (+2.01)

Comparison on exocentric benchmarks before and after fine-tuning (verifying that original capability is preserved):

Method	Ref-Davis	Mevis	ReasonVOS
VideoLISA-3.8B	65.82	49.20	42.41
VideoLISA-3.8B (+FT)	65.60 (-0.22)	49.20 (+0.00)	44.18 (+1.77)
Sa2VA-4B	69.75	50.01	42.35
Sa2VA-4B (+FT)	69.97 (+0.22)	55.55 (+5.54)	45.54 (+3.19)

Ablation Study¶

Effect of SAM2 initialization strategy (Grounded-SAM2):

EgoMask Split	Highest-confidence Init IoU_gold_pred	Naive Init IoU_gold_pred	Gap
Short	49.95	40.42	-9.53
Medium	25.73	15.11	-10.62
Long	24.80	11.65	-13.15

Effect of key frame validity in Sa2VA: when the target object does not appear in the first 5 frames, performance drops sharply to near 0%.

Key Findings¶

All state-of-the-art models perform poorly on egocentric videos: the best result on the Short split is only ~50% IoU_gold_pred, and less than 30% on Medium/Long splits
Fine-tuning yields substantial improvements: VideoLISA achieves an average relative gain of 41.30% while retaining performance on exocentric benchmarks
EgoMask-Train is complementary to exocentric data — fine-tuning even improves ReasonVOS by 1.77%
SAM2 initialization is critical: incorrect initialization causes a performance drop of 10–18%
The IoU_all metric is misleading for long videos due to the high proportion of background frames (Sa2VA's IoU_all paradoxically increases as video length grows); IoU_gold_pred more faithfully reflects true performance
Inference speed: VideoLISA (frame-by-frame segmentation) achieves only 0.42 FPS, whereas SAM2-based methods achieve at least 3.17 FPS

Highlights & Insights¶

Systematic difference analysis: the first quantitative characterization of four key dimensional differences between egocentric and exocentric videos, providing guidance for future model design
Fully automated annotation pipeline: the combination of SAM2 and GPT-4o substantially reduces annotation cost, requiring only human refinement rather than annotation from scratch
Multi-duration tiered design: Short/Medium/Long splits cover practical application scenarios ranging from seconds to minutes
Metric innovation: IoU_gold_pred penalizes hallucinated predictions and is better suited to sparse target scenarios than conventional IoU_all
The dataset provides genuine complementary value — fine-tuning on EgoMask-Train improves rather than degrades exocentric performance

Limitations & Future Work¶

Current model designs are not specifically optimized for the characteristics of egocentric videos
Sa2VA is constrained by input token limits, making key frame selection inflexible — target objects may not appear in the first 5 frames of long videos
Training set localization annotations are at 1 FPS (vs. the original 5 FPS in EgoTracks), potentially missing fine-grained details of fast-moving objects
The Long split contains only 15 videos and 100 expressions, which is relatively small in scale
Promising future directions include: enhancing long-video understanding capability and optimizing frame selection strategies to capture target entities

Compared to existing RVOS datasets (Ref-DAVIS, MeViS, ReasonVOS), EgoMask is the first to cover egocentric videos with multi-duration evaluation
Sa2VA outperforms Grounded-SAM2 on exocentric benchmarks, yet underperforms it on egocentric benchmarks — suggesting that end-to-end models have not fully exploited the potential of pretrained grounding models in egocentric settings
Frame selection strategy is a key direction for improving SAM2-based methods

Rating¶

Novelty: ⭐⭐⭐⭐ First pixel-level egocentric spatiotemporal grounding benchmark, filling an important gap
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-model comparison + fine-tuning validation + detailed analysis + multi-metric evaluation
Writing Quality: ⭐⭐⭐⭐ Thorough analysis, detailed statistics, and rich visualizations
Value: ⭐⭐⭐⭐⭐ The dataset and analysis make a significant contribution to the egocentric video understanding community