VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video SpatioTemporal Augmentation¶
Conference: CVPR 2025
arXiv: 2412.00927
Code: Project Page
Area: Video Understanding
Keywords: Long Video Understanding, High-Resolution Video, Data Augmentation, Needle-in-a-Haystack, Video LMM
TL;DR¶
The VISTA framework is proposed, which synthesizes long-duration and high-resolution video instruction data by spatiotemporally combining existing video-caption data (covering 7 augmentation methods). By constructing the VISTA-400K dataset, it achieves an average improvement of 3.3% on long video understanding benchmarks and introduces the first high-resolution video understanding benchmark, HRVideoBench, yielding a 6.5% improvement.
Background & Motivation¶
- Challenges of Long Videos and High Resolutions: Current open-source video LMMs are primarily optimized for short, low-resolution videos. Handling long-sequence video inputs (long-duration or high-resolution) remains a major challenge.
- Scarcity of High-Quality Data: Existing video instruction datasets face limitations such as short durations (VideoChat2 focuses on short videos), low sampling rates (ShareGPT4Video is only 0.15fps, making the content nearly static), and low resolution (FineVideo is dominated by 360p).
- Opaqueness of Closed-source Solutions: Models like Kangaroo and Qwen2-VL claim to use long-video training data but keep data details private, hindering the community from understanding what kind of data truly aids long-circumstance video understanding.
- Inspiration from Data Augmentation: CutMix, MixUp, and VideoMix in image/video classification have proven that synthetic data can train more robust classifiers. This paper extends this idea to instruction tuning for video LMMs.
- Gap in High-Resolution Benchmarks: Previously, there was no comprehensive benchmark specifically designed to evaluate video LMMs' understanding capabilities on high-resolution videos.
Method¶
Overall Architecture¶
The VISTA framework: Given a candidate video set \(\mathbf{V}\) and its captions \(\mathbf{C}\), augmented videos \(V^* = \Phi(\mathbf{V})\) are generated via a video augmentation operator \(\Phi\), and QA pairs \((q,a) = \Theta(\mathbf{C})\) are produced using Gemini-1.5-Pro as the QA generator \(\Theta\). It contains 7 augmentation methods, producing the VISTA-400K dataset (approximately 400K entries).
Key Designs¶
Design 1: Temporal-Domain Augmentation — Long Video Captioning & Event Relationship QA - Function: Synthesizes long videos by temporally stitching short clips together to generate instruction data for summarization and event-ordering understanding. - Core Idea: Extracts multiple short clips (interval \(\le 5\) seconds) from the same source video and stitches them into a long video. Gemini is used to generate a long video caption and event-relationship QAs based on individual clip descriptions, including free-form and multiple-choice questions. - Design Motivation: Extends video duration while preserving natural scene transitions; event-order understanding is a core capability of long video understanding.
Design 2: Spatiotemporal Needle-in-a-Haystack (NIAH) QA - Function: Trains the model to precisely retrieve key information from the massive tokens of long and high-resolution videos. - Core Idea: Four variants: (1) Temporal NIAH: A short clip is randomly inserted into the middle of a long video; (2) Two Needle NIAH: A short clip is split into two and inserted into different time points of a long video; (3) Spatial NIAH: A low-resolution video is superimposed onto a random location of a high-resolution video; (4) Spatiotemporal NIAH: Concurrently inserting the needle in both temporal and spatial dimensions. Distractor choices for multiple-choice questions are generated from the haystack descriptions, ensuring the model is more likely to choose incorrectly if it fails to locate the needle. - Design Motivation: NIAH is a standard paradigm for evaluating the long-context retrieval capabilities of LLMs/LMMs; the four variants cover different retrieval dimensions across time, space, and spatiotemporal domains.
Design 3: High-Resolution Video Grid QA - Function: Enhances the model's capability to understand local regions within high-resolution videos. - Core Idea: Randomly samples 64 low-resolution videos and arranges them into an \(8 \times 8\) grid (each \(240 \times 135\)), synthesizing them into a \(1920 \times 1080\) video. A specific cell \((i,j)\) is randomly selected to generate QA pairs regarding its content. Distractors are generated from other cells. - Design Motivation: Simulates scenarios requiring focus on local details in high-resolution videos, training the model to locate and understand small-region contents based on spatial indexing.
Loss & Training¶
The standard video LMM instruction-tuning loss (cross-entropy/next-token prediction) is employed to fine-tune the baseline model on VISTA-400K.
Key Experimental Results¶
VISTA-400K Dataset Statistics¶
| Subset | Type | Avg. Duration | Avg. Resolution | Data Size |
|---|---|---|---|---|
| Long Video Captioning | Caption | 33.2s | 1277×720 | 58.6K |
| Event Relationship QA | QA | 33.4s | 1278×720 | 56.9K |
| Temporal NIAH | QA | 67.6s | 640×358 | 59.8K |
| Two Needle NIAH | QA | 112.4s | 591×382 | 52.3K |
| Spatial NIAH | QA | 9.9s | 1726×971 | 60.0K |
| Spatiotemporal NIAH | QA | 89.9s | 591×383 | 56.5K |
| HR Video Grid QA | QA | 3s | 1920×1080 | 59.9K |
| VISTA-400K | - | 48.6s | 1160×666 | 403.9K |
Fine-Tuning Performance Improvement¶
| Metric | Avg. Gain on Long Video Benchmarks | Gain on HRVideoBench |
|---|---|---|
| VISTA Fine-Tuning | +3.3% | +6.5% |
Key Findings¶
- Achieves an average improvement of 3.3% on four long-video benchmarks: Video-MME, MLVU, LVBench, and LongVideoBench.
- Scores a 6.5% boost on the newly introduced HRVideoBench, validating the effectiveness of Spatial NIAH and Grid QA.
- Ablation studies indicate that performance significantly drops when video enhancement is omitted, confirming that the synthetic videos themselves are key.
- QA synthesis solely requires text processing (via Gemini API) without needing multimodal functionalities, making the cost substantially lower than other methods.
Highlights & Insights¶
- Data-centric Perspective: Significant improvements in long/high-resolution video understanding are achieved using only high-quality synthetic data, without modifying the model architecture.
- Innovation in NIAH Training Data: Translates the NIAH paradigm, commonly used in LLM evaluation, into a training data generation method.
- Fully Open-Sourced & Reproducible: Data sources are all public datasets, and the synthesis pipeline is scalable.
- HRVideoBench bridges the evaluation gap in high-resolution video understanding.
- Significant Cost-efficiency Advantage: Does not rely on Gemini's multimodal capabilities, operating purely on text-processing APIs.
Limitations & Future Work¶
- Stitching or superimposing synthetic videos may introduce unnatural visual artifacts, which might lead the model to learn spurious patterns like "stitching boundaries".
- QAs generated by the NIAH method can be relatively simple and do not fully cover question types requiring deep reasoning.
- Current augmentations are mainly based on simple geometric combinations (stitching, superposition, grids), lacking semantic-level video synthesis.
- HRVideoBench only contains 200 questions, leaving its scale and semantic diversity to be expanded.
- Reliance on Gemini-1.5-Pro for generating QA pairs introduces a dependency on closed-source models.
Related Work & Insights¶
- vs. ShareGPT4Video: While ShareGPT4Video collects 40K high-quality captions, its videos are near-static at 0.15fps. VISTA yields videos with true temporal dynamics and spatial diversity via spatiotemporal augmentations.
- vs. Kangaroo / Qwen2-VL: These models claim to utilize long-video training data but keep details proprietary. VISTA is fully open-sourced and reproducible, enabling the community to study which data actually works.
- VISTA's video augmentation concepts can be extended to data synthesis in other fields such as 3D scene understanding and audio-visual multimodality.
- The NIAH training paradigm serves as a valuable reference for all LMMs requiring long-context retrieval capabilities.
- The Grid QA method operates similarly to visual grounding training strategies and can be migrated to high-resolution image understanding.
Rating¶
⭐⭐⭐⭐ — Address the data bottleneck of long and high-resolution video understanding with a simple and practical data augmentation framework. The design of NIAH training data is particularly creative. Both VISTA-400K and HRVideoBench present valuable contributions to the community.