ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting¶
Conference: CVPR 2026 arXiv: 2603.23186 Code: https://github.com/MICV-yonsei/ViKey Area: Multimodal VLM Keywords: Visual Prompting, Video Large Language Models, Temporal Understanding, Frame Index, Training-Free
TL;DR¶
ViKey overlays frame-index visual prompts (VPs) onto video frames and incorporates a lightweight Keyword-Frame Mapping (KFM) module to significantly improve temporal reasoning in VideoLLMs without any training, achieving near-dense-frame performance with as few as 20% of frames.
Background & Motivation¶
VideoLLMs demonstrate strong performance on multimodal video tasks, yet processing dense video frames incurs prohibitive computational costs, making frame selection a standard practice. However, frame selection introduces a critical side effect: disruption of temporal continuity.
Limitations of Prior Work: Removing intermediate frames causes VideoLLMs to lose the ability to infer event ordering. For instance, given a sparse sequence showing a player crossing a line followed by a referee showing a red card, humans can infer causality, whereas VideoLLMs may incorrectly conclude that the referee committed the foul.
Key Challenge: Frame selection reduces the video to discrete temporal snapshots, making it inherently difficult to reconstruct a coherent event sequence. Existing solutions—such as enhanced temporal encodings and extended context modules—are complex and require substantial training.
Key Insight: Visual prompting (VP) has been shown to effectively guide spatial attention, yet its potential for cross-frame temporal reasoning remains largely unexplored. The authors observe that simply annotating each frame with its index number enables the model to perceive temporal continuity.
Method¶
Overall Architecture¶
ViKey is a training-free, plug-and-play framework: input video frames → overlay frame-index visual prompts → extract key textual concepts from the query → map keywords to the most relevant frames via KFM → rewrite the query with explicit frame indices → feed into the VideoLLM for inference.
Key Designs¶
-
Sequential Visual Prompting:
- Function: Embeds frame index information (e.g., "frame #01") into the pixel space of each frame.
- Mechanism: A text-format frame index is overlaid at the bottom-left corner of each frame. Font size adapts to frame resolution: \(fontsize = \min(width, height) / s\). The effectiveness of VP is validated through three carefully designed experiments: (1) a positional encoding degradation experiment demonstrates that VP can independently recover frame-order information; (2) a frame-level reference experiment shows that VP enables the model to retrieve frame content by index, analogous to a dictionary lookup; (3) attention analysis reveals that VP amplifies image token attention weights in the middle-to-upper layers.
- Design Motivation: Placement at the bottom-left is motivated by an observed positional bias—bottom placements yield substantially higher accuracy than top placements (reverse lookup: bottom 100% vs. top 60–79%), likely because subtitles and watermarks frequently appear at the bottom in training data.
-
Keyword-Frame Mapping (KFM):
- Function: Anchors key concepts from the text query to the most relevant video frames.
- Mechanism: Salient keywords are extracted from the user query; cosine similarity is computed between keyword and per-frame embeddings in a shared embedding space to identify the best-matching frames. The query is then rewritten as an augmented version containing explicit frame indices, e.g., "In frame #03, what does the player do?" This provides explicit temporal anchors for reasoning.
- Design Motivation: VP supplies frame-level indexing capability, while KFM establishes an explicit mapping between textual queries and visual frames; their combination enables precise temporal localization.
-
Positional Bias Analysis and Optimization:
- Function: Characterizes and exploits VideoLLM preferences for VP placement positions.
- Mechanism: VP effectiveness is systematically evaluated at four corner positions (TL/TR/BL/BR). BL and BR achieve 100% accuracy on the reverse lookup task, whereas TL reaches only ~60%. The dominant error pattern for TL is an off-by-one association—the model links the current frame's index to the content of the subsequent frame.
- Design Motivation: Frame tokens are concatenated into a single sequence with no explicit boundaries. Top-positioned indices are prone to confusion with tokens of the following frame, whereas bottom-positioned indices align more naturally with the terminal tokens of the current frame.
Loss & Training¶
ViKey is entirely training-free; it requires neither modification of model parameters nor any additional training.
Key Experimental Results¶
Main Results¶
| Model + Setting | TempCompass | MVBench | VideoMME | LongVideoBench |
|---|---|---|---|---|
| LLaVA-Video-7B (64 frames) | 74.68 | 82.50 | — | 56.42 |
| + ViKey (64 frames) | 77.83 | 87.00 | Improved | 58.66 |
| + ViKey (13 frames = 20%) | ~75 | ~83 | ≈ 64-frame | ~56 |
Consistent improvements are observed across temporal reasoning subsets of TempCompass, MVBench, VideoMME, and LongVideoBench.
Ablation Study¶
| Configuration | Lookup Accuracy | Reverse Lookup Accuracy | Notes |
|---|---|---|---|
| No VP | 12.43% | 18.57% | Extremely low baseline |
| VP (bottom-left) | 64.62% | 100.00% | Substantial gain in frame-level reference |
| VP (top-left) | 55.56% | 60.19% | Evident positional bias |
| VP + KFM | Best | Best | Complementary combination |
Key Findings¶
- VP recovers 2.9–9.9 percentage points of temporal understanding performance even under extreme conditions where positional encodings are corrupted.
- VP increases the average attention weight allocated to image tokens by 11.65%, concentrated in the middle-to-upper layers (layers 4–6, 11–14, and beyond layer 21).
- Using only 20% of frames with ViKey approaches the dense 100%-frame baseline on several benchmarks, demonstrating high efficiency.
Highlights & Insights¶
- Minimal yet effective: Annotating frames with simple index numbers substantially improves temporal reasoning. This "modify the input, not the model" paradigm is both elegant and practical, enabling zero-cost integration into any VideoLLM.
- Positional bias discovery: The marked superiority of bottom-positioned VP over top-positioned VP reveals a training-induced bias in VideoLLMs—stronger attention toward bottom regions—a finding with broader implications for all VP-based methods.
- Frames as a dictionary: Treating frame indices as keys and frame contents as values introduces a novel paradigm for fine-grained temporal control in VideoLLMs.
Limitations & Future Work¶
- The keyword extraction component of KFM relies on an external embedding model, which may become a bottleneck for very long videos.
- VP inherently occupies pixel space within frames; interference may arise for videos that already contain subtitles or watermarks.
- The positional bias suggests that the model may be "memorizing" text at specific locations rather than genuinely comprehending temporal relationships.
- Future directions include: adaptive VP size/position selection and joint optimization with frame selection strategies.
Related Work & Insights¶
- vs. conventional frame selection methods: Frame selection focuses on which frames to retain; ViKey focuses on how to make retained frames more informative. The two approaches are complementary.
- vs. temporal encoding enhancement methods: Training-required methods such as extended context modules achieve comparable performance, whereas ViKey requires no training whatsoever.
- vs. spatial VP methods: Prior VP work guides spatial attention within frames (e.g., circle annotations); ViKey is the first to systematically explore VP for cross-frame temporal reasoning.
Rating¶
- Novelty: ⭐⭐⭐⭐ A simple yet insightful observation; the first systematic exploration of VP for temporal reasoning.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three analytical experiments, four benchmarks, and multiple models—highly rigorous.
- Writing Quality: ⭐⭐⭐⭐⭐ Motivation is clearly articulated, experimental design is precise, and analysis is thorough.
- Value: ⭐⭐⭐⭐ Training-free and plug-and-play; highly practical.