VideoGEM: Training-Free Action Grounding in Videos¶
Conference: CVPR 2025
arXiv: 2503.20348
Code: https://github.com/felixVogel02/VideoGEM
Area: Video Understanding
Keywords: Action Grounding, Vision-Language Models, Training-Free, Attention Mechanism, Prompt Decomposition
TL;DR¶
VideoGEM proposes the first training-free spatial action grounding method based on pre-trained image/video-language models. By utilizing layer weighting and prompt decomposition strategies, it outperforms existing training-based methods on four action grounding datasets.
Background & Motivation¶
Background: Vision-language foundation models (e.g., CLIP) have demonstrated powerful capabilities in zero-shot grounding tasks, but they are primarily focused on object grounding within images. Extending these capabilities to action and event grounding in videos poses significant challenges, as actions lack distinct physical boundaries and are typically described by higher-level semantic concepts.
Limitations of Prior Work: Current spatial video grounding methods (e.g., CoMMA, WWW-CLIP) still require specialized training—either fine-tuning with grounding losses or training on large-scale video-text pairs. On the other hand, while training-free methods like GEM perform well on object grounding in images, action grounding requires the model to capture contextual information that extends beyond object boundaries.
Key Challenge: Vision-language models exhibit a strong object bias; when prompted with verb-object combinations, the model tends to locate the object rather than the action itself. Furthermore, high-level semantic concepts like actions typically emerge only in the higher layers of the model, whereas GEM assigns equal weights to all layers.
Goal: To design a training-free method that enables vision-language models to perform spatial action grounding in videos without altering any pre-trained weights.
Key Insight: It is observed that high-level semantic concepts (such as actions) primarily emerge in the higher layers of ViTs, indicating that these layers should be assigned higher weights. At the same time, action descriptions naturally consist of two independent components—verbs and objects—which should be processed separately.
Core Idea: To extend the self-self attention of GEM to video inputs, and to capture high-level semantic action concepts through static+dynamic layer weighting and prompt decomposition.
Method¶
Overall Architecture¶
The input to VideoGEM is a video and its corresponding action description text. The entire pipeline consists of three core components: (1) extending GEM's self-self attention to video frame processing, (2) weighting GEM layers to prioritize high-level semantics, and (3) decomposing the action prompt into three sub-prompts (verb, object, and action), calculating heatmaps for each, and combining them via weighted fusion to obtain the final grounding.
Key Designs¶
-
Video Self-Self Attention Extension:
- Function: Extending the GEM mechanism from image to video inputs, enabling cross-frame spatiotemporal attention.
- Mechanism: Given a video with \(T\) frames, where each frame is divided into \(N\) patches to yield \(T \times N\) tokens, self-self attention is jointly computed over tokens across all frames to automatically aggregate spatial and temporal information. Finally, heatmaps are generated by calculating the cosine similarity between each patch token and the text embedding.
- Design Motivation: Directly processing multiple frames allows the video backbone (e.g., ViCLIP) to naturally capture temporal context, rather than processing frames independently.
-
Static + Dynamic Layer Weighting:
- Function: Adaptively assigning weights to different Transformer layers to prioritize layers that capture high-level semantics.
- Mechanism: The static weight \(w_s^l\) increases monotonically with the layer index, assigning higher fixed weights to higher layers. The dynamic weight \(w_d^l\) is determined by evaluating the change in alignment between the CLS token and the text after removing a specific layer—the layer whose removal causes the largest drop in similarity is considered the most important. These two weights are combined via \(w_c^l = w_s^l - 1/D + w_d^l\), ensuring the sum of weights remains constant.
- Design Motivation: Analysis reveals that abstract concepts like actions and verbs emerge only in the higher layers of the model, making uniform weighting wasteful of irrelevant information in lower layers. Dynamic weighting further adapts to specific prompts, as different concepts may be represented to varying degrees across different layers.
-
Prompt Decomposition:
- Function: Decomposing the action description into three independent prompts—verb, object, and full action—performing grounding for each separately, and then merging them.
- Mechanism: Extracting verbs and objects from the action description, generating formatted prompt texts for each (e.g., "A photo of a person [verb]ing"), computing individual heatmaps to obtain centroid predictions \(c_{verb}\), \(c_{obj}\), and \(c_{act}\), and producing the final prediction via a weighted average \(c_{dec} = 0.2 \cdot c_{verb} + 0.2 \cdot c_{obj} + 0.6 \cdot c_{act}\).
- Design Motivation: Vision-language models suffer from object bias—when prompted directly with a verb-object combination, the model tends to focus solely on the object region. Processing them separately allows the verb heatmap to focus on action-executing regions (like hands) and the object heatmap to focus on the manipulated object, complementing each other to refine the final grounding.
Loss & Training¶
This method is entirely training-free and does not involve any loss functions or training procedures. All operations are performed during inference—leveraging the existing weights of the pre-trained backbone to directly generate grounding results through a parallel path of self-self attention and weighting strategies.
Key Experimental Results¶
Main Results¶
| Method | Training Required | V-HICO | Daly | YC | gYT | Average |
|---|---|---|---|---|---|---|
| WWW-CLIP (CLIP*) | Yes | 62.34 | 71.35 | 58.35 | 56.98 | 62.26 |
| GEM (ViCLIP) | No | 65.08 | 73.75 | 53.62 | 51.28 | 60.93 |
| VideoGEM (CLIP) | No | 76.90 | 84.53 | 52.57 | 47.46 | 65.37 |
| VideoGEM (OpenCLIP) | No | 76.42 | 80.32 | 60.05 | 45.33 | 65.53 |
| VideoGEM (ViCLIP) | No | 75.75 | 78.25 | 55.10 | 57.21 | 66.58 |
Ablation Study¶
| Configuration (ViCLIP) | V-HICO | Daly | gYT | Average |
|---|---|---|---|---|
| No Weighting | 74.79 | 76.84 | 56.39 | 65.60 |
| Dynamic Only | 74.49 | 76.85 | 56.47 | 65.61 |
| Static Only | 76.18 | 78.38 | 56.75 | 66.58 |
| Static + Dynamic | 75.75 | 78.25 | 57.21 | 66.58 |
Key Findings¶
- VideoGEM outperforms the best training-based method by over 3% in average precision across all backbones, while requiring no training at all.
- Image backbones (CLIP/OpenCLIP) perform better on object-oriented datasets (V-HICO, Daly), while the video backbone (ViCLIP) significantly outperforms others on the action-oriented GroundingYouTube dataset.
- Layer importance analysis shows that removing the last few layers has the most significant impact on accuracy, but completely removing the lower layers also degrades performance—supporting the design philosophy of "higher layers are more important, but lower layers are indispensable".
- Dynamic weighting shows a more pronounced effect on OpenCLIP (3%+ improvement on GroundingYouTube), as the CLS token of ViCLIP is primarily formed in the final layer.
Highlights & Insights¶
- Outperforming training-based methods without training is the most significant highlight of this work. By carefully manipulating the internal attention mechanism of pre-trained models (rather than altering weights), robust action grounding is achieved, demonstrating that sufficient spatial semantic information is already encoded in foundation models.
- The prompt decomposition strategy can generalize to other tasks requiring the grounding of complex semantic concepts, such as scene graph grounding, relationship understanding, etc.—where the key idea is to decompose composite concepts into atomic units, locate them separately, and then combine them.
- The dynamic layer weighting mechanism provides a general approach to evaluate the contribution of each layer to specific concepts, which can be transferred to the fields of explainability analysis and feature selection.
Limitations & Future Work¶
- The improvement of VideoGEM on YouCook-Interactions is limited, even falling below training-based methods, likely because cooking scenes require domain-specific knowledge.
- Prompt decomposition relies on NLP tools to extract verbs and objects, which can fail with complex natural language descriptions.
- Layer weight parameters (\(K\), \(D\), static weight values) require manual adjustment; future work can explore adaptive mechanisms.
- Temporal action grounding is not discussed, as the method is limited to spatial grounding.
Related Work & Insights¶
- vs GEM: GEM only addresses image object grounding with uniform weights. VideoGEM extends to video and incorporates layer weighting and prompt decomposition, leading to dramatic improvements in action grounding.
- vs WWW-CLIP: WWW-CLIP requires training on HT100M and fine-tuning the backbone, whereas VideoGEM is training-free and achieves better results—demonstrating that effective inference strategies can compensate for the lack of training data.
- vs CoMMA: CoMMA utilizes multi-layer cross-modal attention and requires specialized training, whereas VideoGEM only leverages a self-self attention variant of existing attention layers.
Rating¶
- Novelty: ⭐⭐⭐⭐ The first training-free video action grounding method, though the core mechanism is an incremental extension of GEM.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four datasets, three backbones, and comprehensive ablation studies.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with well-justified motivation.
- Value: ⭐⭐⭐⭐ Highly practical value as a training-free method, though its generalization across more scenarios warrants further attention.