TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos¶

Conference: ACL 2026
arXiv: 2412.02930
Code: N/A
Area: Segmentation
Keywords: Video LLM, Time-Aware Encoding, BiLSTM, Long Video Understanding, Industrial Assembly Dataset

TL;DR¶

This paper proposes TemporalVLM, which extracts local fine-grained temporal features through a time-aware segment encoder (overlapping sliding Video Q-Former + fusion module), then aggregates global long-range dependencies using BiLSTM. This is the first work to introduce LSTM into Video LLMs, outperforming prior methods on four tasks: dense video captioning, temporal grounding, highlight detection, and action segmentation.

Background & Motivation¶

Background: Video LLMs achieve video understanding by combining video encoders with LLMs. Existing methods typically map videos to a fixed number of tokens, causing performance degradation on long videos, and encode frames and timestamps separately, leading to poor temporal reasoning.

Limitations of Prior Work: (1) Treating the entire video as a single segment with fixed token count loses fine-grained information for long videos; (2) Using pooling or query aggregation for global features fails to capture long-range temporal dependencies; (3) Separate encoding of frames and timestamps produces time-insensitive representations.

Key Challenge: Temporal reasoning in long videos requires both local fine-grained understanding (precise localization of individual events) and global semantic understanding (temporal relationships between events), but existing architectures cannot address both simultaneously.

Goal: Design a "coarse-to-fine" video encoder that simultaneously extracts time-aware local features and global features.

Key Insight: Segment long videos into multiple short clips, extract local features at the clip level with a time-aware encoder, then aggregate global features across clips using BiLSTM — combining clip-level granularity with sequence-level long-range modeling.

Core Idea: Overlapping sliding windows + fusion module for time-aware local encoding, BiLSTM for bidirectional long-range aggregation — the first introduction of LSTM into Video LLMs.

Method¶

Overall Architecture¶

Input video is divided into C=6 segments, each sampling 96 frames. Time-aware segment encoder: frames are encoded by EVA-CLIP, then jointly processed with timestamps through Image Q-Former to obtain time-aware frame features, followed by overlapping sliding Video Q-Former and fusion module to produce local features. BiLSTM module: local features from all segments are temporally concatenated and processed through bidirectional LSTM for global feature aggregation. Final features are projected into LLaMA-2 7B's embedding space.

Key Designs¶

Overlapping Sliding Video Q-Former + Fusion Module:
- Function: Extract fused time-aware local features within segments
- Mechanism: Process frame features with sliding Video Q-Former using window size q=32 and overlap o=16, producing feature sequences \(\mathbf{S}\) containing redundant boundary tokens. A multi-head self-attention fusion module is applied to \(\mathbf{S}\), fusing diverse temporal perspectives from different windows into context-aware embeddings
- Design Motivation: Compared to TimeChat's non-overlapping windows, overlap produces spatially redundant but temporally complementary tokens; the fusion module leverages this diversity to generate richer segment-level features
BiLSTM Global Feature Aggregation:
- Function: Capture bidirectional long-range temporal dependencies across segments
- Mechanism: Local features from all segments are temporally concatenated, processed by forward and backward LSTM separately, with final output being the concatenation \(\mathbf{h}_t = [\mathbf{h}_t^f, \mathbf{h}_t^b]\)
- Design Motivation: Pooling loses temporal information; Transformer's positional encoding with fixed context is less suited than LSTM's recurrent structure for capturing temporal dependencies. Ablation confirms BiLSTM outperforms average pooling, linear layers, unidirectional LSTM, and Transformer
IndustryASM Dataset:
- Function: Fill the gap in industrial manufacturing long video temporal segmentation benchmarks
- Mechanism: 4,851 videos, averaging 105 seconds, covering 47 industrial assembly tasks. Frame-level action segmentation annotated by industrial engineers with 92% annotation consistency
- Design Motivation: Existing datasets are biased toward cooking activities or web-sourced (multi-shot); industrial assembly is closer to practical applications and provides continuous single-shot recordings

Loss & Training¶

Standard autoregressive cross-entropy loss (Eq. 8). LLM and image encoder are frozen; only BiLSTM, projection layers, and LoRA are fine-tuned. Trained on 8×A100.

Key Experimental Results¶

Main Results¶

Dense Video Captioning (YouCook2) + Temporal Grounding (Charades-STA) Zero-shot Comparison

Method	SODA_c	CIDEr	R@1 (IoU=0.5)
VideoChat-Embed	0.2	0.6	3.2
TimeChat	—	—	—
LongVLM	0.8	2.5	13.9
TemporalVLM	Best	Best	Best

Ablation Study¶

Global Aggregation Method Comparison

Aggregation	Note
Average Pooling	Loses temporal information
Linear Layer	No sequence modeling
Unidirectional LSTM	Forward information only
Transformer	Fixed positional encoding inferior to recurrence
BiLSTM	Bidirectional long-range dependencies, optimal

Key Findings¶

TemporalVLM outperforms prior methods on all four temporal reasoning tasks
BiLSTM as global aggregation module consistently outperforms all alternatives
Overlapping windows + fusion module significantly improves over non-overlapping windows
Effective on IndustryASM industrial dataset as well, demonstrating practical application value
First demonstration that LSTM has unique advantages in Video LLMs, and should not be entirely replaced by Transformer

Highlights & Insights¶

The "return to LSTM" choice is counter-intuitive but effective — in temporal modeling, the inductive bias of recurrent structures outperforms general-purpose attention
Redundant information from overlapping windows becomes a diversity source for the fusion module — turning a deficiency into an advantage
The IndustryASM dataset fills an important gap in industrial scenarios

Limitations & Future Work¶

Fixed 6-segment partition may not suit all video lengths; adaptive segmentation strategies are worth exploring
BiLSTM's sequential processing limits parallelism; SSM/Mamba variants may be more efficient
Only evaluated with LLaMA-2 7B; larger or newer LLMs are not assessed
IndustryASM generalizability — whether 47 assembly tasks cover the diversity of industrial scenarios

vs TimeChat: Uses non-overlapping Video Q-Former without global aggregation; TemporalVLM's overlap + fusion + BiLSTM comprehensively outperforms
vs LongVLM: Also segments clips but uses pooling for global features without leveraging timestamps; TemporalVLM's time-aware encoding and BiLSTM aggregation are more effective

Rating¶

Novelty: ⭐⭐⭐⭐ First introduction of BiLSTM into Video LLMs; overlapping fusion design is novel
Experimental Thoroughness: ⭐⭐⭐⭐ 4 tasks + detailed ablation + new dataset
Writing Quality: ⭐⭐⭐⭐ Architecture diagram is clear; comparison with prior methods is intuitive
Value: ⭐⭐⭐⭐ IndustryASM dataset and BiLSTM findings are valuable to the community