Skip to content

Geometry-Guided Camera Motion Understanding in VideoLLMs

Conference: CVPR 2026 arXiv: 2603.13119 Code: To be released Area: Interpretability Keywords: Camera motion understanding, VideoLLM, geometry guidance, 3D foundation model, motion primitives, structured prompting

TL;DR

This work systematically reveals camera motion blind spots in VideoLLMs through a benchmarking-diagnosis-injection framework, and significantly improves fine-grained camera motion understanding without fine-tuning by leveraging a frozen 3D foundation model (VGGT) for geometric feature extraction, a lightweight temporal classifier, and structured prompt injection.

Background & Motivation

State of the Field

Background: Camera motion (pan/tilt/dolly, etc.) constitutes a core geometric signal in cinematic grammar, directly influencing narrative structure, attentional guidance, and spatial layout expression. However, existing VideoLLMs are predominantly optimized for high-level semantics (object recognition, action understanding) and lack explicit camera motion supervision. Experiments reveal that most VideoLLMs achieve near-random accuracy (~25%) on camera motion VQA, indicating that this critical signal is severely neglected. Notably, models fine-tuned specifically on CameraBench perform even worse than the original Qwen2.5-VL, exposing fundamental issues with conventional fine-tuning approaches.

Problem Formulation

Goal: 1. What causes the systematic failure of VideoLLMs in recognizing fine-grained camera motion primitives? 2. How can reliable camera motion information be injected without modifying VideoLLM weights?

Method

Overall Architecture

The pipeline consists of three stages: (1) extracting per-frame camera tokens from a frozen 3D foundation model (VGGT); (2) predicting constraint-aware motion labels via a lightweight Transformer temporal classifier; (3) injecting per-second motion label sequences as structured prompts into VideoLLM inference. The entire pipeline is plug-and-play and model-agnostic.

Key Designs

  1. CameraMotionDataset: Built upon the MultiCamVideo subset of ReCamMaster, comprising 12,274 one-second clips, each deterministically annotated with 15 atomic motion primitives (pan-left/right, tilt-up/down, dolly-in/out, etc.) derived from precise extrinsic matrices, with a human-verified agreement rate of 93%.
  2. Constraint-Aware Multi-Label Classification: A mutual exclusion matrix \(\mathbf{M} \in \{0,1\}^{K \times K}\) is defined over the 15 primitive classes. Training incorporates an incompatibility regularizer \(\mathcal{L}_{inc} = \sum M_{ij} p_i p_j\) and a cardinality regularizer \(\mathcal{L}_{card}\) to ensure physically plausible prediction combinations.
  3. Probing Diagnosis: Q-Former probes are applied to frozen ViT layers of Qwen2.5-VL to read camera motion signals. Layer 7 (the first full-attention layer) yields the highest performance, with signal strength declining monotonically at greater depths — indicating that token compression and semantic alignment training progressively erase motion cues.
  4. VGGT-Q-Former Distillation: VGGT camera tokens (from a 1.2B-parameter model) are distilled into a Q-Former with only 8.72M parameters, achieving 5.3× throughput improvement, reducing peak memory to 39%, with only an 8.13% drop in instance accuracy.

Loss & Training

  • Primary loss: BCE \(\mathcal{L}_{bce}\)
  • Constraint regularizer: \(\mathcal{L}_{inc} = \sum M_{ij} p_i p_j\) penalizes co-activation of mutually exclusive primitives
  • Cardinality regularizer: \(\mathcal{L}_{card}\) constrains each clip to predict 1–3 labels
  • Distillation: MSE regression loss \(\mathcal{L}_{reg} = \sum \|\tilde{c}_t - c'_t\|^2\), trained progressively in three stages

Key Experimental Results

Method Instance Acc Macro-F1 Weighted-F1
VGGT + Constraints 0.738 0.87 0.92
VGGT w/o Constraints 0.572 0.79 0.84
VGGT-Q-Former Distilled 0.638 0.83 0.87
Q-Former Probing 0.450 0.69 0.74
  • Most off-the-shelf VideoLLMs achieve near-random accuracy (~25%) on CameraMotionVQA.
  • After motion label injection, VideoLLM descriptions shift from vague motion statements to cinematically structured narratives with explicit directionality and temporal organization.

Ablation Study

  • Removing constraint regularization reduces instance accuracy from 73.8% to 57.2%, demonstrating the critical role of mutual exclusion constraints.
  • Probing results: camera motion signal is strongest at ViT layer 7 (shallow full-attention layer) and nearly vanishes by layer 31.
  • Distilled vs. full VGGT: throughput of 23.36 vs. 4.39 samples/s with controlled accuracy loss.
  • Temporal convolution vs. average pooling: removing temporal modeling leads to a notable accuracy drop.

Highlights & Insights

  • Using probing to quantitatively diagnose "where information is lost" provides an excellent methodological framework for understanding bottlenecks in large models.
  • The constraint-aware label design is elegant: the mutual exclusion matrix combined with cardinality regularization enforces physical plausibility directly at the loss level.
  • Structured prompt injection modifies model inference behavior without any weight updates.

Limitations & Future Work

  • The dataset is synthetic (UE5-rendered); generalization to real-world videos remains to be validated.
  • Coverage is limited to extrinsic motion (pan/tilt/dolly); intrinsic changes such as zoom are not addressed.
  • Only VGGT is explored as the 3D foundation model backbone; comparisons with other geometry models are absent.
  • Prediction of the static class is unreliable, as VGGT's reconstruction prior assumes camera motion, making static segments potentially out-of-distribution.
  • CameraBench: Provides primitive-level motion annotations and VQA evaluation, but labels are manually annotated without precise camera parameters; this paper's extrinsic-based deterministic annotation is more reliable.
  • SpatialVID: Offers per-frame depth and pose-driven motion instructions, but targets video generation rather than understanding; this paper inversely leverages geometric signals to enhance comprehension.
  • Shot-by-Shot: Uses shot-level cinematic grammar cues to guide description generation, but does not address primitive-level motion recognition.

Rating

  • Novelty: ⭐⭐⭐⭐ (complete diagnosis-injection framework design; novel constraint-aware multi-label classification)
  • Experimental Thoroughness: ⭐⭐⭐⭐ (benchmarking + probing + distillation + qualitative analysis)
  • Writing Quality: ⭐⭐⭐⭐ (clear logic, professional figures and tables)
  • Value: ⭐⭐⭐ (targeted at cinematic video understanding scenarios)