VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM¶
Conference: CVPR 2025
arXiv: 2501.00599
Code: Available (Project Page + Code links)
Area: Video Understanding
Keywords: Video Region Understanding, Object-level Instruction Data, Spatial-Temporal Object Encoder, Video LLMs, Fine-grained Understanding
TL;DR¶
VideoRefer Suite is a systematic system consisting of a dataset (700K object-level video instruction data), a model (a spatial-temporal object encoder enabling pixel-level region understanding), and a benchmark (multi-dimensional evaluation) to empower Video LLMs with the ability to perceive, reason about, and retrieve any object in a video at any timestamp.
Background & Motivation¶
Background: Video LLMs (e.g., VideoLLaMA2) perform exceptionally well in overall video understanding, but they primarily focus on scene-level understanding and fail to precisely target user-specified objects. Although region understanding methods in the image domain (such as GPT4RoI, Ferret, Osprey) are relatively mature, research on object-level understanding in the video domain remains limited.
Limitations of Prior Work: (1) Existing video region understanding methods (such as Artemis) only support single-object, coarse bounding-box-level features, falling short in analyzing multi-object relationships and complex reasoning; (2) Directly converting bounding box coordinates into text prompts (e.g., VTimeLLM) leading to imprecise region understanding; (3) Lack of high-quality object-level video instruction data and comprehensive evaluation benchmarks.
Key Challenge: Fine-grained understanding in videos requires both precise spatial localization (pixel-level masks) and rich temporal context (cross-frame tracking). Existing architectures and data are insufficient to support this requirement.
Goal: To build VideoRefer Suite—a complete solution covering data, models, and benchmarks—to empower Video LLMs to perform fine-grained video understanding for any object at any time.
Key Insight: Employ a multi-agent data engine to automatically construct high-quality object-level annotations, design a unified spatial-temporal object encoder to support a mixture of single-frame and multi-frame inputs, and establish a comprehensive benchmark containing both description generation and multiple-choice questions.
Core Idea: Use pixel-level masks as a unified region representation, extract object representations through a spatial token extractor (Mask Pooling), adaptively aggregate cross-frame information with a temporal token merging module, and interleave object-level tokens with scene-level tokens as input to the LLM.
Method¶
Overall Architecture¶
Built upon VideoLLaMA2.1. The input video is processed by a shared visual encoder to extract frame-level feature maps \(\mathbf{F}_I\), and users specify target objects of interest via masks. A spatial-temporal object encoder (REnc) processes the target masks and feature maps to generate object-level tokens \(\mathcal{T}_R\). Scene-level tokens \(\mathcal{T}_Z\), object-level tokens \(\mathcal{T}_R\), and text tokens \(\mathcal{T}_x\) are interleaved and input to the LLM to achieve fine-grained video object understanding. It supports both single-frame and multi-frame modes.
Key Designs¶
-
Multi-Agent Data Engine (VideoRefer-700K):
- Function: Automatically construct large-scale high-quality object-level video instruction data
- Mechanism: Five collaborative agents work in a pipeline: (1) Analyzer (Qwen2-7B) extracts nouns from raw captions; (2) Annotator (InternVL2-26B) queries twice to generate dynamic action descriptions and static appearance descriptions; (3) Segmentor (Grounding-DINO + HQ-SAM + SAM2) generates pixel-level masks; (4) Reviewer (Qwen2-7B) utilizes Osprey region descriptions to verify mask-description correspondence, keeping only the 40% of samples that pass validation; (5) Refiner (GPT-4o) summarizes and refines the final descriptions
- Design Motivation: Using multiple expert models with different specialties to collaborate allows the automated pipeline to ensure data scale (700K), while strict filtering by the Reviewer and refinement by GPT-4o guarantee high data quality
-
Spatial-Temporal Object Encoder (Spatial Token Extractor + Temporal Token Merge):
- Function: Extract precise object-level representations from video frames
- Mechanism: Spatially, the 2D binary mask is resized to the feature map dimension, and the aggregated features of all regions are extracted via Mask Pooling. It is then passed through an MLP to obtain the object token \(\mathbf{O} \in \mathbb{R}^{1 \times C}\). Temporally, for multi-frame object tokens \(\mathbf{O} \in \mathbb{R}^{k \times C}\), adjacent frame cosine similarities \(\mathbf{S}_{m,m+1}\) are calculated. The \(k-u\) pairs with the highest similarities are merged (average pooling), ultimately retaining \(u\) representative tokens
- Design Motivation: Mask Pooling is more precise than RoI Align (pixel-level vs. box-level). Temporal merging eliminates redundancy by consolidating representations of similar frames while preserving key temporal variations
-
VideoRefer-Bench Comprehensive Benchmark:
- Function: Comprehensively evaluate the regional video understanding capability of Video LLMs
- Mechanism: Two sub-benchmarks: Bench\(^D\) (description generation, 400 samples, scored by GPT-4o from 0 to 5 across 4 dimensions: main-body consistency, appearance, dynamics, and hallucination) and Bench\(^Q\) (multiple-choice questions, 1000 tasks, covering 5 types: foundational, sequential, relational, reasoning, and predictive). All QA pairs must associate with custom video regions to prevent models from answering without viewing the video
- Design Motivation: Existing benchmarks either only feature description tasks or lack region-level understanding requirements. Bench covers multiple dimensions to ensure comprehensive assessment
Loss & Training¶
Standard autoregressive language modeling loss \(\mathcal{L} = \sum \log P(y|V, R_1,...,R_n, x)\). Two-stage training: Stage 1 pre-trains the alignment between the object encoder and the LLM using 500K short descriptions; Stage 2 fine-tunes all trainable parameters using 125K detailed descriptions + 75K QA.
Key Experimental Results¶
Main Results (VideoRefer-Bench\(^Q\))¶
| Method | Foundational | Sequential | Relational | Reasoning | Prediction | Overall |
|---|---|---|---|---|---|---|
| Qwen2-VL-7B | 52.0 | 49.2 | 50.0 | 43.3 | 45.0 | 48.7 |
| Artemis-7B | 48.0 | 45.2 | 40.0 | 36.7 | 37.5 | 43.0 |
| VideoRefer-7B | 72.0 | 66.4 | 60.0 | 60.0 | 57.5 | 64.8 |
Ablation Study¶
| Configuration | Bench\(^D\) Avg | Bench\(^Q\) Avg |
|---|---|---|
| Box-level feature (RoI) | 2.51 | 57.2 |
| w/o Temporal Token Merge | 2.68 | 61.5 |
| w/o Reviewer Filtering | 2.55 | 59.8 |
| Full VideoRefer | 2.82 | 64.8 |
Key Findings¶
- Mask-level features significantly outperform box-level features (Bench\(^Q\) 64.8 vs 57.2), demonstrating that pixel-level precision is crucial for region understanding.
- Temporal Token Merge boosts performance by 3.3 points, highlighting the necessity of cross-frame temporal aggregation.
- Quality filtering by the Reviewer (retaining only 40%) yields a 5.0-point gain, underscoring that data quality outweighs volume (data quality > data quantity).
- VideoRefer also shows improvements on general video understanding benchmarks (e.g., MVBench +2.4%), indicating that object-level understanding capabilities comprehensively enhance overall video understanding.
Highlights & Insights¶
- Complete Data-Model-Benchmark Framework: This systematic approach establishes solid infrastructure for the video region understanding field.
- Multi-Agent Data Engine: The methodology of using multi-model collaboration combined with strict review to construct high-quality data is generalizable to other domains.
- Unified Mask Representation: Consolidating boxes, points, and free-form regions into binary masks simplifies model design and improves flexibility.
- Seamlessly integrates with SAM2, enabling comprehension of corresponding targets by simply clicking anywhere.
Limitations & Future Work¶
- Mask Pooling with the object encoder is a simple average operation, which might lose spatial structural details within the region.
- Temporal Token Merge is based on simple cosine similarity, which might cause misalignment or incorrect merging of critical frames in fast-moving scenes.
- The data engine relies on multiple large models (especially GPT-4o as the Refiner), resulting in high costs.
- Future work could explore finer region representations (e.g., multi-granularity feature pyramids).
Related Work & Insights¶
- vs. Artemis: Artemis performs single-object reference using an external RoI tracker + box-level features. VideoRefer supports mask-level features + multi-object scenarios, considerably outperforming in complex reasoning.
- vs. Osprey: Osprey is a pioneer in image region understanding; VideoRefer extends its core ideas to the video domain and integrates the temporal dimension.
- The Reviewer's workflow in the data engine (utilizing an independent model to verify annotation consistency) is highly recommended for other data generation pipelines.
Rating¶
- Novelty: ⭐⭐⭐⭐ Systematic work with logically sound component designs
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Custom benchmark, general benchmark, detailed ablation, and data quality analysis
- Writing Quality: ⭐⭐⭐⭐ Clear structure and rich illustrations
- Value: ⭐⭐⭐⭐⭐ Provides a foundational infrastructure for video region understanding