Object-Shot Enhanced Grounding Network for Egocentric Video¶
Conference: CVPR 2025
arXiv: 2505.04270
Code: https://github.com/Yisen-Feng/OSGNet
Area: Video Understanding / Egocentric Video / Video Temporal Grounding
Keywords: Ego4D NLQ, Object-aware grounding, shot contrastive learning, Mamba
TL;DR¶
OSGNet targets the two major shortcomings of ego-centric Natural Language Queries (NLQ)—namely, visual features lacking fine-grained object information and neglecting attention switching implied by head-mounted camera motion. To address this, it proposes a two-branch architecture consisting of an "object branch (Co-DETR + CLIP text encoder) + shot branch (shot segmentation based on head turn + shot-level contrastive learning)", achieving state-of-the-art (SOTA) performance on Ego4D-NLQ, Goal-Step, and TACoS.
Background & Motivation¶
- Background: The Ego4D NLQ task requires localizing the temporal segments of answers in long egocentric videos based on query sentences (e.g., "How many drill bits did I remove from the drill before I moved the yellow carton?"), which is a core capability of embodied AI and intelligent assistants.
- Limitations of Prior Work:
- Pre-trained features of egocentric videos (using clip-narration contrastive learning) mostly learn "actions" and lose the fine-grained information of "background small objects" (e.g., measuring tape) that the query cares about;
- Third-person video localization methods (e.g., Moment-DETR, SnAG) perform poorly when directly transferred to this task, as they ignore the "high-frequency camera shot transitions" of head-mounted cameras.
- Key Challenge: The training objective of general video-text backbones (action alignment) is inconsistent with the goal of the NLQ task (fine-grained memory retrieval of background objects).
- Goal:
- Compensate for the "loss of fine-grained object information";
- Leverage the hidden signal "wearer's head movement = attention switching".
- Key Insight: Explicitly inject object category information into video tokens using off-the-shelf detectors combined with CLIP-based textification. Meanwhile, segment videos into shot segments based on "head-turn points" and apply shot-query contrastive learning.
- Core Idea: A two-branch framework: the main branch uses parallel cross-attention to fuse video, query, and object features, while the shot branch aligns shot-query pairs using a contrastive loss.
Method¶
Overall Architecture¶
- Feature Extraction: (a) Object Extraction: Co-DETR (pretrained on LVIS) detects objects in each frame, filters them based on query nouns and a confidence threshold \(\theta\), and encodes class names via CLIP ViT-B/32 to obtain \(\mathbf{O}_{clip} \in \mathbb{R}^{T \times N_o \times D_o}\); (b) Video backbone (EgoVLP/InternVideo) yields \(\mathbf{V}_{clip} \in \mathbb{R}^{T \times D}\); (c) Text encoder yields \(\mathbf{Q}_F \in \mathbb{R}^{L \times D}\).
- Object Encoder: Employs objects as queries and text queries as keys/values for cross-attention, obtaining query-aligned object features \(\mathbf{O}_F\).
- Main Branch: Stacks multiple layers of [BiMamba video \(\rightarrow\) CA(video, query) \(\parallel\) CA(video, object) \(\rightarrow\) gating fusion] to get fused features. A multi-scale transformer is then utilized to generate a feature pyramid, followed by classification and regression heads to output temporal intervals.
- Shot Branch: Separately uses video features, conducts shot segmentation based on head movements, and performs shot-level contrastive learning.
- Inference: Selects the top-K confidence intervals from the main branch.
Key Designs¶
-
Object Injection and Parallel Cross-Attention
- Function: Introduces fine-grained objects of interest from the query as an additional modality to compensate for the missing information in the video backbone.
- Mechanism: Conducts separate cross-attention operations between visual features and queries/objects: \(\mathbf{V}_Q^{(i)} = \hat{\mathbf{V}}^{(i)} + CA(\hat{\mathbf{V}}^{(i)}, \mathbf{Q}_F, \mathbf{Q}_F)\) and \(\mathbf{V}_O^{(i)} = \hat{\mathbf{V}}^{(i)} + CA(\hat{\mathbf{V}}^{(i)}, \mathbf{O}_F, \mathbf{O}_F)\), respectively. They are then fused using a gating mechanism \(\mathbf{A} = \sigma(\text{MLP}(\hat{\mathbf{V}}_Q \| \hat{\mathbf{V}}_O))\): \(\mathbf{V}^{(i+1)} = \mathbf{A}\cdot\hat{\mathbf{V}}_Q + (1-\mathbf{A})\cdot\hat{\mathbf{V}}_O\).
- Design Motivation: Query keywords ("drill bit", "yellow carton") are not necessarily captured within the video features. Explicitly exposing objects via an object detector and employing parallel cross-attention prevents query information from overshadowing object details.
-
BiMamba Long Video Modeling
- Function: Replaces traditional self-attention to process long sequences (egocentric videos can span up to several hours).
- Mechanism: Employs bidirectional Mamba within the fusion block, achieving linear complexity and saving memory compared to transformers.
- Design Motivation: The length of videos in NLQ tasks far exceeds that of typical moment retrieval, making self-attention prohibitively expensive in memory.
-
Shot Branch: Head-Turn Segmentation + Contrastive Learning
- Function: Leverages the unique egocentric signal "head-mounted camera movement corresponds to wearer attention shift" to automatically segment the video into semantically independent shots.
- Mechanism: Estimates the head rotation amplitude from camera trajectory/motion (optical flow or camera-pose variations) and segments shots based on peak values. A representation is extracted for each shot and aligned with the query text feature using a contrastive loss, mapping mutually relevant shots and queries closer in the embedding space.
- Design Motivation: NLQ tasks often contain temporal structures like "do action A first, then B, and finally return to C". Head turns correspond to transitions in user attention or task boundaries, which serve as natural, weakly-supervised segmentation cues.
Loss & Training¶
- Localization loss: \(\mathcal{L}_{ML} = \mathcal{L}_{cls} + \mathcal{L}_{reg}\). Focal loss is used for classification, while Distance-IoU loss (only for positive samples) is used for regression.
- Contrastive loss \(\mathcal{L}_{shot}\) (InfoNCE) for the shot branch.
- Total loss: \(\mathcal{L} = \mathcal{L}_{ML} + \lambda \mathcal{L}_{shot}\).
Key Experimental Results¶
Main Results¶
Ego4D-NLQ v1 Test (R@1, IoU=0.5):
| Method | Feature | R@[email protected] |
|---|---|---|
| InternVideo | E+I | 10.06 |
| CONE | E | 7.84 |
| SnAG | E | 10.29 |
| RGNet | E | 10.61 |
| RGNet† (NaQ pretrain) | E | 11.69 |
| OSGNet | E | 10.71 |
| OSGNet† | E | 15.46 |
OSGNet† improves upon RGNet† by +3.77 points in R@[email protected] and +6.74 points in R@[email protected].
Ego4D-Goal-Step: R@[email protected] is +3.65 higher than BayesianVSLNet. TACoS (third-person comparison): R@[email protected] is +3.32 higher than SnAG, proving effectiveness on generic videos as well. vs GroundVQA on NLQ: R@[email protected] +2.15.
Ablation Study¶
| Configuration | R@[email protected] (Ego4D-NLQ) |
|---|---|
| Baseline (no object, no shot) | ~13 |
| + Object branch | ~14 |
| + Shot branch | ~14.3 |
| Full (object + shot) | 15.46 |
| Replacing BiMamba with self-attn | Out of memory (OOM) or performance degradation |
Key Findings¶
- Object information brings the most significant improvement (~+5 R@1) for "background object" queries (e.g., "where is the screwdriver").
- The shot branch significantly benefits long videos (>5min) but shows almost no gain for short videos.
- The object filtering threshold \(\theta\) should be chosen carefully: too high results in missing small objects, while too low introduces noise, indicating a clear sweet spot.
Highlights & Insights¶
- First explicit modeling of "head-mounted camera motion = attention signal": This is a unique, previously overlooked supervision signal in egocentric videos, which can potentially be extended to egocentric action recognition and action anticipation.
- Lightweight injection of Object-as-text: Utilizing the detector to generate labels instead of extraction-heavy region features, and then encoding class names through CLIP, avoids introducing a large number of extra parameters. This technique of "textifying detection results" can be readily adopted in any video QA task.
- Parallel CA + gating is more stable than sequential CA because it avoids modality order bias, making it a reusable multimodal fusion design.
- Replacing transformers with BiMamba validates the utility of state space models (SSMs) in long video modeling.
Limitations & Future Work¶
- The object branch heavily relies on the detection capability of Co-DETR on the LVIS vocabulary, which may still miss long-tail objects.
- Shot segmentation depends on low-level optical flow/motion estimation, which is prone to false cuts under dramatic lighting changes or unstable camera shake.
- Textifying objects discards geometric information (position/shape/size), which might be insufficient for queries involving spatial relationships; future research could incorporate bounding box coordinates as auxiliary tokens.
- Multi-modality expansion (e.g., audio) has not been explored; egocentric videos often contain verbal instructions that can serve as additional cues.
- Future work: Feeding shot segments as "chapters" into an LLM for retrieval-based reading could further boost performance in long videos.
Related Work & Insights¶
- vs SnAG (CVPR 24): Single-branch without object information; OSGNet adds both the object and shot branches.
- vs NaQ (CVPR 23): A pure data-augmentation approach; OSGNet is orthogonal to and can be combined with NaQ pretraining.
- vs GroundVQA: GroundVQA adopts LLMs for video grounding, whereas this work follows a specialized model approach, making it more computationally efficient.
- Inspiration: Tasks struggling with features missing critical fine-grained details can benefit from injecting text tokens generated by specialized expert models.
Rating¶
- Novelty: ⭐⭐⭐ The combination of object injection and shot contrastive learning fits the requirements of the NLQ task beautifully, though neither technique is entirely new on its own.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on three major datasets with detailed ablation studies and benchmarking against NaQ pretraining.
- Writing Quality: ⭐⭐⭐⭐ The motivation behind the dual-branch architecture is clearly articulated.
- Value: ⭐⭐⭐⭐ Achieves noticeable gains on egocentric NLQ, presenting a solid SOTA baseline; the shot segmentation idea holds transfer value for general long video processing.