PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?¶

Conference: NeurIPS 2025 arXiv: 2509.02807 Code: https://github.com/MSiam/PixFoundation-2.0 Area: Video Understanding / Visual Grounding Keywords: Motion-centric evaluation, visual grounding, video MLLM, referring segmentation, motion understanding

TL;DR¶

By introducing four motion-centric probing techniques and the MoCentric-Bench benchmark, this paper demonstrates that current video multimodal LLMs fail to genuinely exploit motion information in pixel-level visual grounding tasks and can be deceived by static keyframes.

Background & Motivation¶

Background: Multimodal large language models perform well on high-level video tasks (QA, captioning), but their pixel-level visual grounding capabilities have not been thoroughly investigated.

Limitations of Prior Work: Existing benchmarks assume motion is necessary, yet in practice a single static frame is often sufficient to resolve motion-referring expressions. Video MLLMs claim to leverage temporal information but may rely solely on powerful visual encoders and LLMs.

Key Challenge: There is no means to distinguish "true motion" (temporal dynamics) from "pseudo-motion" (simulatable by static keyframes), and no benchmark exists for evaluating motion-order understanding.

Key Insight: The paper designs two categories of probing techniques—motion existence and motion order—and constructs MoCentric-Bench to compel models to utilize genuine motion.

Method¶

Overall Architecture¶

A three-tier structure: (1) problem diagnosis—analyzing deficiencies in existing benchmarks such as RefDAVIS and MeVIS; (2) motion-centric probe design—automatically synthesizing fake motion and reversed motion; (3) baselines and adaptation—providing a single-frame strong baseline and a LoRA fine-tuned variant of Sa2VA.

Key Designs¶

Motion Existence Probe
- Function: Determines whether a model genuinely relies on motion information.
- Mechanism: Qwen2.5-VL is used for coarse-grained temporal localization to identify keyframes; fake videos (repeated keyframes) are generated and combined with original videos into four layout configurations to test whether models are deceived.
- Design Motivation: Motion-referring expressions (e.g., "jumping to the left") may be resolvable from a single frame combined with spatial layout alone.
Motion Order Probe
- Function: Determines whether a model understands temporal direction.
- Mechanism: GPT-4o is used to convert expressions such as "pull" into their reverse counterparts (e.g., "push"); videos are played in reverse to test whether models can distinguish forward from backward motion.
- Design Motivation: A model that truly understands motion should be able to differentiate forward and reversed video sequences.
MLLM + SAM 2.0 Strong Baseline
- Function: Validates the competitiveness of single-frame approaches.
- Mechanism: Qwen2.5-VL performs bounding-box localization on keyframes; SAM 2.0 generates full segmentation masks—no temporal reasoning is involved.
- Design Motivation: If this single-frame baseline approaches state-of-the-art performance, it indicates that existing datasets are insufficiently motion-centric.

Loss & Training¶

The Sa2VA★ fine-tuned variant applies LoRA to the visual encoder and is trained with supervised learning on MoCentric-Bench synthetic data.

Key Experimental Results¶

Main Results (Existing Benchmarks)¶

Method	RefDAVIS-17 J&F	MeVIS val J&F
LISA	64.8	37.2
VideoGLAMM	69.5	45.2
Sa2VA	75.2	51.5
MLLM+SAM2 (single-frame baseline)	70.5	44.5
MLLM+SAM2† (keyframe)	71.7	46.9

Ablation Study (MoCentric-Bench)¶

Model	Orig. val	Single-frame mix	Drop%	Reversed mix	Drop%
VidGLAMM	48.2	21.6	-55.2%	34.0	-29.4%
Sa2VA	58.9	28.5	-51.6%	61.1	+3.7%
MLLM+SAM2†	57.4	28.1	-51.0%	53.1	-7.5%
Sa2VA★ (fine-tuned)	63.1	38.2	-39.5%	56.4	-10.6%

Key Findings¶

The single-frame baseline matches or surpasses multiple state-of-the-art methods, exposing a heavy reliance on static information in existing datasets.
All models suffer substantial performance drops under single-frame mixing (39–55%), indicating that they are deceived by fake motion.
Sa2VA even shows a marginal gain (+3.7%) under reversed mixing, demonstrating that it has no understanding of motion direction whatsoever.

Highlights & Insights¶

First motion-centric evaluation: Systematically exposes methodological issues in existing benchmarks, with far-reaching implications for evaluation protocol design in video understanding.
Automated data synthesis: VLM + LLM pipelines automatically generate fake and reversed motion, resulting in a low-cost and scalable approach.
Clear identification of weaknesses: The state-of-the-art method (Sa2VA) suffers a dramatic performance drop on motion-centric tasks (51.5 → 28.5), a finding that will drive video MLLMs toward genuine motion understanding.

Limitations & Future Work¶

MoCentric-Bench is relatively small in scale (32–152 objects); larger-scale validation is needed.
Fine-tuning processes only the first five frames, which may be insufficient to fully exploit motion.
Reversed expressions may be semantically incompatible with certain videos, requiring manual filtering.

vs. ATP (Buch et al., 2022): ATP also contrasts single-frame and full-video performance, but is limited to video-level understanding tasks; this paper is the first to systematically study pixel-level grounding tasks in this manner.
vs. Kowal et al. (2022): Prior work analyzes the ratio of static to dynamic content; this paper proposes a complete probing framework.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First motion-centric visual grounding evaluation
Experimental Thoroughness: ⭐⭐⭐⭐ Four probing techniques are comprehensive; the benchmark could be larger
Writing Quality: ⭐⭐⭐⭐⭐ Problem formulation is clear; methods are readily reproducible
Value: ⭐⭐⭐⭐⭐ Advances video MLLMs toward genuine motion understanding