SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs¶

Conference: CVPR 2025
arXiv: 2603.12382
Code: To be confirmed
Area: Multimodal VLM
Keywords: Video Segmentation, Video MLLM, Temporal Consistency, Pixel-grounding, Dual-prompt Mechanism

TL;DR¶

This paper proposes the SPARROW framework, which addresses the challenges of poor temporal referential consistency and unstable first-frame initialization in video MLLMs through Target-Specific Tracking Features (TSF) and a dual-prompt (BOX+SEG) mechanism, achieving consistent improvements across 3 mainstream video MLLMs on 6 benchmarks.

Background & Motivation¶

Background: Multimodal Large Language Models (MLLMs) have achieved excellent performance in image-level pixel grounding, but extending them to the video domain faces challenges such as motion dynamics, occlusions, and temporal consistency.

Limitations of Prior Work: Existing video MLLMs (e.g., VideoGLaMM, UniPixel, GLUS) rely on static [SEG] text tokens for frame-by-frame grounding, which only convey semantic information of "what to look at" and fail to capture the target's spatial-temporal variations in location and appearance. This results in spatial drift (target segmentation shifting over time), identity switching (the same target misassociated across different frames), and referential inconsistency (the same language description referring to different regions in different frames).

Key Challenge: Text prompts are static while videos are dynamic, forcing the model to rely solely on visual cues to infer motion and appearance changes, without an explicit temporal reference mechanism. Concurrently, the first-frame initialization is unstable (as [SEG] only provides semantic cues without spatial priors), leading to error propagation and accumulation over time.

Goal: (i) Temporal referential consistency—how to maintain target identity across frames without drifting; (ii) First-frame grounding robustness—how to provide accurate initial localization to avoid error propagation.

Key Insight: Inject temporally aligned target features from a tracking perspective for training supervision, and utilize a coarse-to-fine dual-prompt strategy to stabilize initialization.

Core Idea: Use target features generated by offline tracking for temporal supervision during training (TSF), combined with a BOX+SEG dual-prompt for coarse-to-fine grounding during inference, achieving plug-and-play enhancement for video MLLMs.

Method¶

Overall Architecture¶

The inputs are video \(\mathbf{V} \in \mathbb{R}^{T_v \times H \times W \times C}\) and text query \(Q\). Features are extracted from the video using a dual-branch encoder (spatial encoder \(\mathcal{F}_g\) + temporal encoder \(\mathcal{F}_h\)), then projected to the LLM embedding space via a V→L adapter. The LLM (fine-tuned with LoRA) outputs two grounding tokens, [BOX] and [SEG], which are projected back to the visual space via L→V adapters, driving a class-agnostic proposer and SAM2 decoder to generate the final segmentation masks. The entire pipeline is plug-and-play, without modifying the base LLM and visual backbones.

Key Designs¶

Target-Specific Tracking Features (TSF):
- Function: Inject temporally aligned target-level features during training to teach the model how to maintain target identity across frames.
- Mechanism: Given a text query, GroundingDINO is used to detect the target in a specific frame, and CLDTracker propagates it across frames to obtain trajectory boxes. To reduce redundancy, K-means clustering (\(K=4\)) is conducted in the joint visual-spatial feature space, selecting the samples closest to the centroids as a representative subset. After region encoding, these are projected as TSF tokens and concatenated with the LLM input.
- Design Motivation: The static [SEG] token cannot encode motion information. TSF provides diverse target appearance representations (across different frames and poses), enabling the model to learn identity persistence during training. During inference, TSF is not used by default, eliminating the need for external detectors/trackers.
- Supporting Dataset: Integrated 7 public datasets containing a total of 30,646 video sequences and 45,231 QA pairs.
Dual-Prompt Grounding (Dual-Prompt):
- Function: Combine the spatial prior of [BOX] and the semantic grounding of [SEG] to achieve coarse-to-fine segmentation.
- Mechanism: The LLM outputs the [BOX] embedding to drive a class-agnostic proposer to generate \(K=300\) candidate boxes, which are then scored, filtered, and regressively refined through cross-attention fusing language and visual features. The filtered high-confidence boxes, along with the [SEG] embedding, are fed into the SAM2 decoder to generate refined masks.
- Design Motivation: Relying solely on [SEG] for first-frame initialization is unstable, making recovery from drift difficult; [BOX] provides coarse-grained geometric constraints while [SEG] performs semantic refinement. This naturally supports multi-instance queries.
Class-Agnostic Proposer:
- Function: Generate category-free candidate boxes on frozen Hiera features.
- Mechanism: Construct an FPN on multi-scale Hiera features, which are then fed into a Deformable-DETR decoder where the classification branch is replaced with a single objectness head. Pre-trained on COCO, Objects365, OpenImages, and V3Det.
- Design Motivation: Decouple from external detectors, keeping it lightweight and free from category supervision.

Loss & Training¶

Two-stage training: - Stage 1 (TSF Injection): Train V→L adapters, L→V SEG adapters, and LoRA. Loss: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{CE}} + \mathcal{L}_{\text{BCE}} + \mathcal{L}_{\text{DICE}}\) - Stage 2 (BOX Learning): Freeze Stage 1, pre-train the proposer first, and then fine-tune the Filtration Head and L→V BOX adapter. \(\lambda_{\text{cls}}=1.0\), \(\lambda_{\text{box}}=2.0\)

Key Experimental Results¶

Main Results¶

Benchmark	Metric	Baseline (VideoGLaMM)	+SPARROW	Gain
MeViS val	J&F	45.2	47.5	+2.3
MeViS val^u	J&F	48.5	57.4	+8.9
Ref-DAVIS17	J&F	69.5	76.8	+7.3
Ref-YTVOS	J&F	66.8	68.9	+2.1
VidSTG	mIoU	39.66	45.06	+5.4
VideoGCG	mIoU	62.34	65.59	+3.25

Consistent improvements are also observed on stronger baselines: +2.2 on Ref-DAVIS17 for UniPixel and +2.6 on Ref-DAVIS17 for GLUS.

Ablation Study (Ref-DAVIS17, VideoGLaMM baseline)¶

Configuration	J&F	Explanation
Baseline	69.5	No TSF, No BOX
+ TSF(train only)	72.4 (+2.9)	TSF training supervision is effective
+ BOX only	72.5 (+3.0)	Dual-prompts are effective
+ TSF(train) + BOX	76.8 (+7.3)	Default setup, the two are complementary
+ TSF(train+infer) + BOX	77.7 (+8.2)	Best performance when also using TSF during inference
[SEG] only inference	69.5	Single prompt is weak
[BOX]+[SEG] inference	72.5 (+3.0)	Significant synergistic gain of dual-prompts

Key Findings¶

Even when TSF is not used during inference, it still yields a +2.9 gain, demonstrating that the model has internalized temporal consistency capabilities through training.
The BOX+SEG dual-prompt outperforms either single prompt by over 3 points.
The largest improvement is observed on the weaker baseline (VideoGLaMM) (+8.9 on MeViS val^u), with stable gains on stronger baselines as well.
Supervising the Filtration Head with [BOX] features performs 1.9 points better than with [SEG] features.

Highlights & Insights¶

Plug-and-play design: TSF and Dual-Prompt, as lightweight modules, can be seamlessly integrated into any video MLLM without altering the backbone network, validated across 3 different architectures.
Training-inference decoupling: TSF uses pseudo-tracking signals for supervision during training and can be removed during inference (default config), preserving performance gains without adding inference overhead.
Coarse-to-fine dual-prompt: The cascading design of [BOX]→[SEG] addresses both "imprecise localization" and "unclear boundary" issues, while naturally supporting multi-instance queries.

Limitations & Future Work¶

Dependency on proposal recall: Small, heavily occluded, or unseen objects might be missed by the proposer, after which BOX/SEG cannot recover them.
TSF relies on pseudo-tracking annotations from GroundingDINO and CLDTracker, where tracking noise or ID switching might introduce bias.
Early BOX errors in long sequences may still accumulate due to the lack of an explicit error-correction mechanism.

vs VideoGLaMM: Uses only [SEG] for frame-by-frame grounding without temporal cues; SPARROW improves it by +7.3 J&F.
vs UniPixel: Features online memory but initializes from the first-frame mask; SPARROW's BOX provides better initialization.
vs GLUS: Uses global context + dense query frames but remains per-frame semantics; SPARROW incorporates explicit tracking features.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of TSF and dual-prompts is novel in video MLLMs, though each component is not completely original individually.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely thorough, with 3 baselines × 6 datasets × detailed ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear structure and well-elaborated motivation.
Value: ⭐⭐⭐⭐ High practicality due to the plug-and-play design, though the improvement margins on stronger baselines are limited.