Skip to content

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Conference: NeurIPS 2025 arXiv: 2506.05332 Code: Project Page Area: Video Understanding / Multimodal VLM Keywords: Long Video Understanding, Video-LMM, Memory Augmentation, Instruction-Following Dataset, Hour-Scale Video

TL;DR

This work constructs VideoMarathon, the first large-scale hour-level video instruction-following dataset (9,700 hours, 3.3M QA pairs, 22 task types), and proposes Hour-LLaVA, a model that leverages a memory repository, forgetting mechanism, and MemAug module to enable efficient training and inference on hour-scale videos at 1 FPS, achieving state-of-the-art results among open-source models of comparable scale across four long video benchmarks.

Background & Motivation

Background: Recent Video-LMMs have achieved notable progress on video QA and video summarization tasks, yet training data predominantly consists of short clips (average under 1 minute). Existing datasets such as LLaVA-Video-178K have an average video duration of only 0.6 minutes.

Limitations of Prior Work: Evaluation benchmarks (e.g., LVBench with an average of 67 minutes, Video-MME with an average of 17 minutes) require models to comprehend hour-scale videos, whereas models are trained exclusively on short clips of a few minutes, resulting in a severe train–test length mismatch. Existing approaches (e.g., uniform sampling of 64 frames) incur substantial information loss when processing long videos.

Key Challenge: The absence of high-quality long-video instruction-following data, combined with existing models' inability to efficiently handle the massive token count of hour-scale videos. GPU memory constraints prevent models from directly consuming all visual tokens produced by 1-FPS sampling over thousands of frames.

Goal: (1) Construct large-scale long-video training data; (2) Design a model architecture that exploits complete video context under limited computational budget.

Key Insight: Drawing inspiration from the human memory system—selectively retaining and recalling critical information while systematically discarding redundancy—the work designs a memory augmentation mechanism that balances token compression with information preservation.

Core Idea: A hierarchical video captioning pipeline generates large-scale long-video QA data; a memory repository caches full video features, which are then compressed via a forgetting mechanism and enriched through cross-attention, enabling learnable token compression.

Method

Overall Architecture

The system comprises two major components: (1) VideoMarathon Dataset—generated via hierarchical video captioning (clip → event → global) combined with DeepSeek-V3 to produce 3.3M QA pairs; (2) Hour-LLaVA Model—a video encoder (SigLIP) extracts features at 1 FPS → projector (2-layer MLP) → forgetting mechanism (spatial + temporal compression to 1/16) → MemAug module (4-layer Transformer with cross- and self-attention recall) → LLM decoder (Qwen2-7B) generates answers. The full video features are stored in a memory repository; the compressed, decayed tokens are enhanced by MemAug before being fed to the LLM.

Key Designs

  1. VideoMarathon Dataset Construction:

    • 28K long videos (3–60 minutes, totaling ~9,700 hours) are collected from five sources: Panda-70M, Ego4D, ActivityNet, YouCook2, and MovieChat-1K.
    • Hierarchical captioning pipeline: Qwen2VL-7B generates detailed per-clip descriptions across six dimensions (temporality, spatiality, objects, actions, scenes, and summary); DeepSeek-V3 then aggregates these into event-level and global-level descriptions.
    • QA pairs covering 22 task types (6 major themes) are generated from the hierarchical descriptions, comprising 1.73M open-ended QA pairs and 1.57M multiple-choice questions.
    • Design Motivation: Only by training on large-scale long videos can models explicitly learn long-range dependencies. Experiments confirm that LLaVA-Video cannot benefit from VideoMarathon training due to sparse sampling, which prevents learning long-term patterns.
  2. Forgetting Mechanism + MemAug Module:

    • Memory Repository: Full video tokens sampled at 1 FPS are stored in the memory repository, granting the model permanent access to the complete video context.
    • Forgetting Mechanism: Spatially, 3/4 of tokens are randomly discarded (compression ratio 1/4); temporally, 3/4 of frames are uniformly discarded; the overall compression ratio is approximately 1/16, yielding decayed video tokens \(\tilde{\mathbf{H}}_v\).
    • MemAug Module: A 4-layer Transformer block. In cross-attention, the decayed tokens \(\tilde{\mathbf{H}}_v\) and question tokens \(\mathbf{H}_q\) serve as queries while the memory repository \(\mathbf{H}_v\) provides keys and values, retrieving discarded information from the complete context on demand. In self-attention, question information flows into video tokens, endowing them with question-awareness. Formally: \(\hat{\mathbf{H}}_v = f_{\theta_M}(\tilde{\mathbf{H}}_v, \mathbf{H}_q | \mathbf{H}_v)\)
    • Design Motivation: Unlike handcrafted keyframe selection or question-guided compression, MemAug is a learnable compression mechanism supervised end-to-end by the next-token prediction loss.
  3. Three-Stage Training:

    • Stage 1: Image–language pre-training (3B image–text pairs; only MemAug is trained).
    • Stage 2: Video–language adaptation (0.6M mixed data; full-parameter training for 1 epoch).
    • Stage 3: Video instruction fine-tuning (4.4M mixed data, including 0.7M VideoMarathon long-video samples; visual encoder frozen; optimal long-to-short video ratio of 3:1).

Loss & Training

Standard cross-entropy autoregressive loss. Learning rate 2e-5 with cosine annealing and AdamW optimizer. Hour-LLaVA-7B is trained on 64 AMD MI300X GPUs.

Key Experimental Results

Main Results

Method Params TempCompass (11s) LongVideoBench (459s) Video-MME Long (2466s) LVBench (4037s)
GPT-4o 70.9 66.7 65.3 48.9
LLaVA-Video-7B 7B 60.0 58.2 56.3 33.7
Apollo-7B 7B 60.0 56.0 60.0 37.1
Video-XL 7B 59.3 55.4 55.5 38.8
Hour-LLaVA-7B 7B 63.2 60.1 62.2 40.5

Ablation Study

Configuration TempCompass LongVideoBench LVBench Notes
Hour-LLaVA (MemAug) 59.7 54.0 40.6 Full 3B model
Uniform compression (no MemAug) 59.3 52.1 38.3 Learnable vs. uniform
Keyframe compression 59.1 52.0 38.9 Learnable vs. keyframe
Question-guided compression 56.0 51.0 37.5 Worst — premature filtering loses context
Memory repository 100% → <10% ~35 Reducing repository size causes ~5-point drop

Key Findings

  • MemAug consistently outperforms all handcrafted compression baselines across benchmarks, exceeding the best handcrafted method by approximately 2 points on LVBench.
  • LLaVA-Video cannot benefit from VideoMarathon training even when fine-tuned on it — sparse sampling is the fundamental bottleneck.
  • After spatial forgetting retains 1/4 of tokens, MemAug achieves performance comparable to the no-compression baseline on image tasks (MMStar: 51.9 vs. 52.8), demonstrating its generalizability to token compression for Image-LMMs.
  • Hour-LLaVA-3B already surpasses more than half of 7B-scale models and exhibits out-of-distribution generalization on LVBench, where average video length exceeds the training maximum.

Highlights & Insights

  • Filling the Data Gap: VideoMarathon is the first large-scale hour-level video instruction-following dataset. The three-level clip → event → global captioning strategy elegantly addresses the challenge of generating high-quality QA pairs for long videos using large models, and is transferable to any long-video annotation scenario.
  • Learnable vs. Handcrafted Compression: A unified experimental framework compares uniform, keyframe-based, and question-guided compression strategies. Notably, question-guided compression performs worst — premature information filtering discards critical context — underscoring the advantage of end-to-end learning.
  • MemAug's Potential for Image-LMMs: The spatial compression ablation shows that 1/4 tokens + MemAug can match full-token performance, suggesting that memory augmentation could directly accelerate Image-LMM inference.

Limitations & Future Work

  • Training data is entirely model-synthesized (Qwen2VL + DeepSeek-V3), introducing noise and hallucinations; no noise-robust training strategy is designed.
  • Only video and text modalities are handled; audio information is ignored.
  • Evaluation relies predominantly on multiple-choice QA, which may not comprehensively measure long-video understanding capabilities.
  • The longest training videos are 60 minutes; performance on ultra-long videos (e.g., full-length films of 2–3 hours) remains to be validated.
  • The forgetting mechanism is relatively naive; content-importance-aware adaptive forgetting warrants further exploration.
  • vs. LLaVA-Video: Uniform sampling of 64 frames prevents learning from long-video data; the memory repository + MemAug architecture of Hour-LLaVA is the critical differentiator.
  • vs. LongVU / Video-XL: These methods rely on handcrafted heuristic compression (keyframe selection, KV-cache retrieval); Hour-LLaVA with learnable MemAug consistently outperforms them under the same token budget.
  • vs. LongVILA: LongVILA extends context to 2M tokens but demands substantial computational resources and sequence parallelism; Hour-LLaVA achieves greater efficiency through compression and recovery.

Rating

  • Novelty: ⭐⭐⭐⭐ The dataset construction pipeline is comprehensive and practical; the MemAug design is well-motivated though of moderate novelty.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four benchmarks with detailed ablations covering forgetting strategies, data mixing, memory repository size, and compression method comparisons.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure, detailed method descriptions, and rich figures and tables.
  • Value: ⭐⭐⭐⭐⭐ VideoMarathon fills a substantial gap in long-video training data, and Hour-LLaVA demonstrates the necessity of hour-scale video training.