TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos¶
Conference: ACL 2026
arXiv: 2412.02930
Code: N/A
Area: Segmentation
Keywords: Video LLM, Time-Aware Encoding, BiLSTM, Long Video Understanding, Industrial Assembly Dataset
TL;DR¶
This paper proposes TemporalVLM, which extracts local fine-grained temporal features through a time-aware segment encoder (overlapping sliding Video Q-Former + fusion module), then aggregates global long-range dependencies using BiLSTM. This is the first work to introduce LSTM into Video LLMs, outperforming prior methods on four tasks: dense video captioning, temporal grounding, highlight detection, and action segmentation.
Background & Motivation¶
Background: Video LLMs achieve video understanding by combining video encoders with LLMs. Existing methods typically map videos to a fixed number of tokens, causing performance degradation on long videos, and encode frames and timestamps separately, leading to poor temporal reasoning.
Limitations of Prior Work: (1) Treating the entire video as a single segment with fixed token count loses fine-grained information for long videos; (2) Using pooling or query aggregation for global features fails to capture long-range temporal dependencies; (3) Separate encoding of frames and timestamps produces time-insensitive representations.
Key Challenge: Temporal reasoning in long videos requires both local fine-grained understanding (precise localization of individual events) and global semantic understanding (temporal relationships between events), but existing architectures cannot address both simultaneously.
Goal: Design a "coarse-to-fine" video encoder that simultaneously extracts time-aware local features and global features.
Key Insight: Segment long videos into multiple short clips, extract local features at the clip level with a time-aware encoder, then aggregate global features across clips using BiLSTM — combining clip-level granularity with sequence-level long-range modeling.
Core Idea: Overlapping sliding windows + fusion module for time-aware local encoding, BiLSTM for bidirectional long-range aggregation — the first introduction of LSTM into Video LLMs.
Method¶
Overall Architecture¶
Input video is divided into C=6 segments, each sampling 96 frames. Time-aware segment encoder: frames are encoded by EVA-CLIP, then jointly processed with timestamps through Image Q-Former to obtain time-aware frame features, followed by overlapping sliding Video Q-Former and fusion module to produce local features. BiLSTM module: local features from all segments are temporally concatenated and processed through bidirectional LSTM for global feature aggregation. Final features are projected into LLaMA-2 7B's embedding space.
Key Designs¶
-
Overlapping Sliding Video Q-Former + Fusion Module:
- Function: Extract fused time-aware local features within segments
- Mechanism: Process frame features with sliding Video Q-Former using window size q=32 and overlap o=16, producing feature sequences \(\mathbf{S}\) containing redundant boundary tokens. A multi-head self-attention fusion module is applied to \(\mathbf{S}\), fusing diverse temporal perspectives from different windows into context-aware embeddings
- Design Motivation: Compared to TimeChat's non-overlapping windows, overlap produces spatially redundant but temporally complementary tokens; the fusion module leverages this diversity to generate richer segment-level features
-
BiLSTM Global Feature Aggregation:
- Function: Capture bidirectional long-range temporal dependencies across segments
- Mechanism: Local features from all segments are temporally concatenated, processed by forward and backward LSTM separately, with final output being the concatenation \(\mathbf{h}_t = [\mathbf{h}_t^f, \mathbf{h}_t^b]\)
- Design Motivation: Pooling loses temporal information; Transformer's positional encoding with fixed context is less suited than LSTM's recurrent structure for capturing temporal dependencies. Ablation confirms BiLSTM outperforms average pooling, linear layers, unidirectional LSTM, and Transformer
-
IndustryASM Dataset:
- Function: Fill the gap in industrial manufacturing long video temporal segmentation benchmarks
- Mechanism: 4,851 videos, averaging 105 seconds, covering 47 industrial assembly tasks. Frame-level action segmentation annotated by industrial engineers with 92% annotation consistency
- Design Motivation: Existing datasets are biased toward cooking activities or web-sourced (multi-shot); industrial assembly is closer to practical applications and provides continuous single-shot recordings
Loss & Training¶
Standard autoregressive cross-entropy loss (Eq. 8). LLM and image encoder are frozen; only BiLSTM, projection layers, and LoRA are fine-tuned. Trained on 8×A100.
Key Experimental Results¶
Main Results¶
Dense Video Captioning (YouCook2) + Temporal Grounding (Charades-STA) Zero-shot Comparison
| Method | SODA_c | CIDEr | R@1 (IoU=0.5) |
|---|---|---|---|
| VideoChat-Embed | 0.2 | 0.6 | 3.2 |
| TimeChat | — | — | — |
| LongVLM | 0.8 | 2.5 | 13.9 |
| TemporalVLM | Best | Best | Best |
Ablation Study¶
Global Aggregation Method Comparison
| Aggregation | Note |
|---|---|
| Average Pooling | Loses temporal information |
| Linear Layer | No sequence modeling |
| Unidirectional LSTM | Forward information only |
| Transformer | Fixed positional encoding inferior to recurrence |
| BiLSTM | Bidirectional long-range dependencies, optimal |
Key Findings¶
- TemporalVLM outperforms prior methods on all four temporal reasoning tasks
- BiLSTM as global aggregation module consistently outperforms all alternatives
- Overlapping windows + fusion module significantly improves over non-overlapping windows
- Effective on IndustryASM industrial dataset as well, demonstrating practical application value
- First demonstration that LSTM has unique advantages in Video LLMs, and should not be entirely replaced by Transformer
Highlights & Insights¶
- The "return to LSTM" choice is counter-intuitive but effective — in temporal modeling, the inductive bias of recurrent structures outperforms general-purpose attention
- Redundant information from overlapping windows becomes a diversity source for the fusion module — turning a deficiency into an advantage
- The IndustryASM dataset fills an important gap in industrial scenarios
Limitations & Future Work¶
- Fixed 6-segment partition may not suit all video lengths; adaptive segmentation strategies are worth exploring
- BiLSTM's sequential processing limits parallelism; SSM/Mamba variants may be more efficient
- Only evaluated with LLaMA-2 7B; larger or newer LLMs are not assessed
- IndustryASM generalizability — whether 47 assembly tasks cover the diversity of industrial scenarios
Related Work & Insights¶
- vs TimeChat: Uses non-overlapping Video Q-Former without global aggregation; TemporalVLM's overlap + fusion + BiLSTM comprehensively outperforms
- vs LongVLM: Also segments clips but uses pooling for global features without leveraging timestamps; TemporalVLM's time-aware encoding and BiLSTM aggregation are more effective
Rating¶
- Novelty: ⭐⭐⭐⭐ First introduction of BiLSTM into Video LLMs; overlapping fusion design is novel
- Experimental Thoroughness: ⭐⭐⭐⭐ 4 tasks + detailed ablation + new dataset
- Writing Quality: ⭐⭐⭐⭐ Architecture diagram is clear; comparison with prior methods is intuitive
- Value: ⭐⭐⭐⭐ IndustryASM dataset and BiLSTM findings are valuable to the community