Skip to content

TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos

Conference: ACL 2026
arXiv: 2412.02930
Code: N/A
Area: Segmentation
Keywords: Video LLM, Time-Aware Encoding, BiLSTM, Long Video Understanding, Industrial Assembly Dataset

TL;DR

This paper proposes TemporalVLM, which extracts local fine-grained temporal features through a time-aware segment encoder (overlapping sliding Video Q-Former + fusion module), then aggregates global long-range dependencies using BiLSTM. This is the first work to introduce LSTM into Video LLMs, outperforming prior methods on four tasks: dense video captioning, temporal grounding, highlight detection, and action segmentation.

Background & Motivation

Background: Video LLMs achieve video understanding by combining video encoders with LLMs. Existing methods typically map videos to a fixed number of tokens, causing performance degradation on long videos, and encode frames and timestamps separately, leading to poor temporal reasoning.

Limitations of Prior Work: (1) Treating the entire video as a single segment with fixed token count loses fine-grained information for long videos; (2) Using pooling or query aggregation for global features fails to capture long-range temporal dependencies; (3) Separate encoding of frames and timestamps produces time-insensitive representations.

Key Challenge: Temporal reasoning in long videos requires both local fine-grained understanding (precise localization of individual events) and global semantic understanding (temporal relationships between events), but existing architectures cannot address both simultaneously.

Goal: Design a "coarse-to-fine" video encoder that simultaneously extracts time-aware local features and global features.

Key Insight: Segment long videos into multiple short clips, extract local features at the clip level with a time-aware encoder, then aggregate global features across clips using BiLSTM — combining clip-level granularity with sequence-level long-range modeling.

Core Idea: Overlapping sliding windows + fusion module for time-aware local encoding, BiLSTM for bidirectional long-range aggregation — the first introduction of LSTM into Video LLMs.

Method

Overall Architecture

Input video is divided into C=6 segments, each sampling 96 frames. Time-aware segment encoder: frames are encoded by EVA-CLIP, then jointly processed with timestamps through Image Q-Former to obtain time-aware frame features, followed by overlapping sliding Video Q-Former and fusion module to produce local features. BiLSTM module: local features from all segments are temporally concatenated and processed through bidirectional LSTM for global feature aggregation. Final features are projected into LLaMA-2 7B's embedding space.

Key Designs

  1. Overlapping Sliding Video Q-Former + Fusion Module:

    • Function: Extract fused time-aware local features within segments
    • Mechanism: Process frame features with sliding Video Q-Former using window size q=32 and overlap o=16, producing feature sequences \(\mathbf{S}\) containing redundant boundary tokens. A multi-head self-attention fusion module is applied to \(\mathbf{S}\), fusing diverse temporal perspectives from different windows into context-aware embeddings
    • Design Motivation: Compared to TimeChat's non-overlapping windows, overlap produces spatially redundant but temporally complementary tokens; the fusion module leverages this diversity to generate richer segment-level features
  2. BiLSTM Global Feature Aggregation:

    • Function: Capture bidirectional long-range temporal dependencies across segments
    • Mechanism: Local features from all segments are temporally concatenated, processed by forward and backward LSTM separately, with final output being the concatenation \(\mathbf{h}_t = [\mathbf{h}_t^f, \mathbf{h}_t^b]\)
    • Design Motivation: Pooling loses temporal information; Transformer's positional encoding with fixed context is less suited than LSTM's recurrent structure for capturing temporal dependencies. Ablation confirms BiLSTM outperforms average pooling, linear layers, unidirectional LSTM, and Transformer
  3. IndustryASM Dataset:

    • Function: Fill the gap in industrial manufacturing long video temporal segmentation benchmarks
    • Mechanism: 4,851 videos, averaging 105 seconds, covering 47 industrial assembly tasks. Frame-level action segmentation annotated by industrial engineers with 92% annotation consistency
    • Design Motivation: Existing datasets are biased toward cooking activities or web-sourced (multi-shot); industrial assembly is closer to practical applications and provides continuous single-shot recordings

Loss & Training

Standard autoregressive cross-entropy loss (Eq. 8). LLM and image encoder are frozen; only BiLSTM, projection layers, and LoRA are fine-tuned. Trained on 8×A100.

Key Experimental Results

Main Results

Dense Video Captioning (YouCook2) + Temporal Grounding (Charades-STA) Zero-shot Comparison

Method SODA_c CIDEr R@1 (IoU=0.5)
VideoChat-Embed 0.2 0.6 3.2
TimeChat
LongVLM 0.8 2.5 13.9
TemporalVLM Best Best Best

Ablation Study

Global Aggregation Method Comparison

Aggregation Note
Average Pooling Loses temporal information
Linear Layer No sequence modeling
Unidirectional LSTM Forward information only
Transformer Fixed positional encoding inferior to recurrence
BiLSTM Bidirectional long-range dependencies, optimal

Key Findings

  • TemporalVLM outperforms prior methods on all four temporal reasoning tasks
  • BiLSTM as global aggregation module consistently outperforms all alternatives
  • Overlapping windows + fusion module significantly improves over non-overlapping windows
  • Effective on IndustryASM industrial dataset as well, demonstrating practical application value
  • First demonstration that LSTM has unique advantages in Video LLMs, and should not be entirely replaced by Transformer

Highlights & Insights

  • The "return to LSTM" choice is counter-intuitive but effective — in temporal modeling, the inductive bias of recurrent structures outperforms general-purpose attention
  • Redundant information from overlapping windows becomes a diversity source for the fusion module — turning a deficiency into an advantage
  • The IndustryASM dataset fills an important gap in industrial scenarios

Limitations & Future Work

  • Fixed 6-segment partition may not suit all video lengths; adaptive segmentation strategies are worth exploring
  • BiLSTM's sequential processing limits parallelism; SSM/Mamba variants may be more efficient
  • Only evaluated with LLaMA-2 7B; larger or newer LLMs are not assessed
  • IndustryASM generalizability — whether 47 assembly tasks cover the diversity of industrial scenarios
  • vs TimeChat: Uses non-overlapping Video Q-Former without global aggregation; TemporalVLM's overlap + fusion + BiLSTM comprehensively outperforms
  • vs LongVLM: Also segments clips but uses pooling for global features without leveraging timestamps; TemporalVLM's time-aware encoding and BiLSTM aggregation are more effective

Rating

  • Novelty: ⭐⭐⭐⭐ First introduction of BiLSTM into Video LLMs; overlapping fusion design is novel
  • Experimental Thoroughness: ⭐⭐⭐⭐ 4 tasks + detailed ablation + new dataset
  • Writing Quality: ⭐⭐⭐⭐ Architecture diagram is clear; comparison with prior methods is intuitive
  • Value: ⭐⭐⭐⭐ IndustryASM dataset and BiLSTM findings are valuable to the community