Unleashing Hour-Scale Video Training for Long Video-Language Understanding¶

Conference: NeurIPS 2025 arXiv: 2506.05332 Code: Project Page Area: Video Understanding / Multimodal VLM Keywords: Long Video Understanding, Video-LMM, Memory Augmentation, Instruction-Following Dataset, Hour-Scale Video

TL;DR¶

This work constructs VideoMarathon, the first large-scale hour-level video instruction-following dataset (9,700 hours, 3.3M QA pairs, 22 task types), and proposes Hour-LLaVA, a model that leverages a memory repository, forgetting mechanism, and MemAug module to enable efficient training and inference on hour-scale videos at 1 FPS, achieving state-of-the-art results among open-source models of comparable scale across four long video benchmarks.

Background & Motivation¶

Background: Recent Video-LMMs have achieved notable progress on video QA and video summarization tasks, yet training data predominantly consists of short clips (average under 1 minute). Existing datasets such as LLaVA-Video-178K have an average video duration of only 0.6 minutes.

Limitations of Prior Work: Evaluation benchmarks (e.g., LVBench with an average of 67 minutes, Video-MME with an average of 17 minutes) require models to comprehend hour-scale videos, whereas models are trained exclusively on short clips of a few minutes, resulting in a severe train–test length mismatch. Existing approaches (e.g., uniform sampling of 64 frames) incur substantial information loss when processing long videos.

Key Challenge: The absence of high-quality long-video instruction-following data, combined with existing models' inability to efficiently handle the massive token count of hour-scale videos. GPU memory constraints prevent models from directly consuming all visual tokens produced by 1-FPS sampling over thousands of frames.

Goal: (1) Construct large-scale long-video training data; (2) Design a model architecture that exploits complete video context under limited computational budget.

Key Insight: Drawing inspiration from the human memory system—selectively retaining and recalling critical information while systematically discarding redundancy—the work designs a memory augmentation mechanism that balances token compression with information preservation.

Core Idea: A hierarchical video captioning pipeline generates large-scale long-video QA data; a memory repository caches full video features, which are then compressed via a forgetting mechanism and enriched through cross-attention, enabling learnable token compression.

Method¶

Overall Architecture¶

The system comprises two major components: (1) VideoMarathon Dataset—generated via hierarchical video captioning (clip → event → global) combined with DeepSeek-V3 to produce 3.3M QA pairs; (2) Hour-LLaVA Model—a video encoder (SigLIP) extracts features at 1 FPS → projector (2-layer MLP) → forgetting mechanism (spatial + temporal compression to 1/16) → MemAug module (4-layer Transformer with cross- and self-attention recall) → LLM decoder (Qwen2-7B) generates answers. The full video features are stored in a memory repository; the compressed, decayed tokens are enhanced by MemAug before being fed to the LLM.

Key Designs¶

VideoMarathon Dataset Construction:
- 28K long videos (3–60 minutes, totaling ~9,700 hours) are collected from five sources: Panda-70M, Ego4D, ActivityNet, YouCook2, and MovieChat-1K.
- Hierarchical captioning pipeline: Qwen2VL-7B generates detailed per-clip descriptions across six dimensions (temporality, spatiality, objects, actions, scenes, and summary); DeepSeek-V3 then aggregates these into event-level and global-level descriptions.
- QA pairs covering 22 task types (6 major themes) are generated from the hierarchical descriptions, comprising 1.73M open-ended QA pairs and 1.57M multiple-choice questions.
- Design Motivation: Only by training on large-scale long videos can models explicitly learn long-range dependencies. Experiments confirm that LLaVA-Video cannot benefit from VideoMarathon training due to sparse sampling, which prevents learning long-term patterns.
Forgetting Mechanism + MemAug Module:
- Memory Repository: Full video tokens sampled at 1 FPS are stored in the memory repository, granting the model permanent access to the complete video context.
- Forgetting Mechanism: Spatially, 3/4 of tokens are randomly discarded (compression ratio 1/4); temporally, 3/4 of frames are uniformly discarded; the overall compression ratio is approximately 1/16, yielding decayed video tokens \(\tilde{\mathbf{H}}_v\).
- MemAug Module: A 4-layer Transformer block. In cross-attention, the decayed tokens \(\tilde{\mathbf{H}}_v\) and question tokens \(\mathbf{H}_q\) serve as queries while the memory repository \(\mathbf{H}_v\) provides keys and values, retrieving discarded information from the complete context on demand. In self-attention, question information flows into video tokens, endowing them with question-awareness. Formally: \(\hat{\mathbf{H}}_v = f_{\theta_M}(\tilde{\mathbf{H}}_v, \mathbf{H}_q | \mathbf{H}_v)\)
- Design Motivation: Unlike handcrafted keyframe selection or question-guided compression, MemAug is a learnable compression mechanism supervised end-to-end by the next-token prediction loss.
Three-Stage Training:
- Stage 1: Image–language pre-training (3B image–text pairs; only MemAug is trained).
- Stage 2: Video–language adaptation (0.6M mixed data; full-parameter training for 1 epoch).
- Stage 3: Video instruction fine-tuning (4.4M mixed data, including 0.7M VideoMarathon long-video samples; visual encoder frozen; optimal long-to-short video ratio of 3:1).

Loss & Training¶

Standard cross-entropy autoregressive loss. Learning rate 2e-5 with cosine annealing and AdamW optimizer. Hour-LLaVA-7B is trained on 64 AMD MI300X GPUs.

Key Experimental Results¶

Main Results¶

Method	Params	TempCompass (11s)	LongVideoBench (459s)	Video-MME Long (2466s)	LVBench (4037s)
GPT-4o	—	70.9	66.7	65.3	48.9
LLaVA-Video-7B	7B	60.0	58.2	56.3	33.7
Apollo-7B	7B	60.0	56.0	60.0	37.1
Video-XL	7B	59.3	55.4	55.5	38.8
Hour-LLaVA-7B	7B	63.2	60.1	62.2	40.5

Ablation Study¶

Configuration	TempCompass	LongVideoBench	LVBench	Notes
Hour-LLaVA (MemAug)	59.7	54.0	40.6	Full 3B model
Uniform compression (no MemAug)	59.3	52.1	38.3	Learnable vs. uniform
Keyframe compression	59.1	52.0	38.9	Learnable vs. keyframe
Question-guided compression	56.0	51.0	37.5	Worst — premature filtering loses context
Memory repository 100% → <10%	—	—	~35	Reducing repository size causes ~5-point drop

Key Findings¶

MemAug consistently outperforms all handcrafted compression baselines across benchmarks, exceeding the best handcrafted method by approximately 2 points on LVBench.
LLaVA-Video cannot benefit from VideoMarathon training even when fine-tuned on it — sparse sampling is the fundamental bottleneck.
After spatial forgetting retains 1/4 of tokens, MemAug achieves performance comparable to the no-compression baseline on image tasks (MMStar: 51.9 vs. 52.8), demonstrating its generalizability to token compression for Image-LMMs.
Hour-LLaVA-3B already surpasses more than half of 7B-scale models and exhibits out-of-distribution generalization on LVBench, where average video length exceeds the training maximum.

Highlights & Insights¶

Filling the Data Gap: VideoMarathon is the first large-scale hour-level video instruction-following dataset. The three-level clip → event → global captioning strategy elegantly addresses the challenge of generating high-quality QA pairs for long videos using large models, and is transferable to any long-video annotation scenario.
Learnable vs. Handcrafted Compression: A unified experimental framework compares uniform, keyframe-based, and question-guided compression strategies. Notably, question-guided compression performs worst — premature information filtering discards critical context — underscoring the advantage of end-to-end learning.
MemAug's Potential for Image-LMMs: The spatial compression ablation shows that 1/4 tokens + MemAug can match full-token performance, suggesting that memory augmentation could directly accelerate Image-LMM inference.

Limitations & Future Work¶

Training data is entirely model-synthesized (Qwen2VL + DeepSeek-V3), introducing noise and hallucinations; no noise-robust training strategy is designed.
Only video and text modalities are handled; audio information is ignored.
Evaluation relies predominantly on multiple-choice QA, which may not comprehensively measure long-video understanding capabilities.
The longest training videos are 60 minutes; performance on ultra-long videos (e.g., full-length films of 2–3 hours) remains to be validated.
The forgetting mechanism is relatively naive; content-importance-aware adaptive forgetting warrants further exploration.

vs. LLaVA-Video: Uniform sampling of 64 frames prevents learning from long-video data; the memory repository + MemAug architecture of Hour-LLaVA is the critical differentiator.
vs. LongVU / Video-XL: These methods rely on handcrafted heuristic compression (keyframe selection, KV-cache retrieval); Hour-LLaVA with learnable MemAug consistently outperforms them under the same token budget.
vs. LongVILA: LongVILA extends context to 2M tokens but demands substantial computational resources and sequence parallelism; Hour-LLaVA achieves greater efficiency through compression and recovery.

Rating¶

Novelty: ⭐⭐⭐⭐ The dataset construction pipeline is comprehensive and practical; the MemAug design is well-motivated though of moderate novelty.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four benchmarks with detailed ablations covering forgetting strategies, data mixing, memory repository size, and compression method comparisons.
Writing Quality: ⭐⭐⭐⭐ Clear structure, detailed method descriptions, and rich figures and tables.
Value: ⭐⭐⭐⭐⭐ VideoMarathon fills a substantial gap in long-video training data, and Hour-LLaVA demonstrates the necessity of hour-scale video training.