Geometry-Guided Camera Motion Understanding in VideoLLMs¶

Conference: CVPR 2026 arXiv: 2603.13119 Code: To be released Area: Interpretability Keywords: Camera motion understanding, VideoLLM, geometry guidance, 3D foundation model, motion primitives, structured prompting

TL;DR¶

This work systematically reveals camera motion blind spots in VideoLLMs through a benchmarking-diagnosis-injection framework, and significantly improves fine-grained camera motion understanding without fine-tuning by leveraging a frozen 3D foundation model (VGGT) for geometric feature extraction, a lightweight temporal classifier, and structured prompt injection.

Background & Motivation¶

State of the Field¶

Background: Camera motion (pan/tilt/dolly, etc.) constitutes a core geometric signal in cinematic grammar, directly influencing narrative structure, attentional guidance, and spatial layout expression. However, existing VideoLLMs are predominantly optimized for high-level semantics (object recognition, action understanding) and lack explicit camera motion supervision. Experiments reveal that most VideoLLMs achieve near-random accuracy (~25%) on camera motion VQA, indicating that this critical signal is severely neglected. Notably, models fine-tuned specifically on CameraBench perform even worse than the original Qwen2.5-VL, exposing fundamental issues with conventional fine-tuning approaches.

Problem Formulation¶

Goal: 1. What causes the systematic failure of VideoLLMs in recognizing fine-grained camera motion primitives? 2. How can reliable camera motion information be injected without modifying VideoLLM weights?

Method¶

Overall Architecture¶

The pipeline consists of three stages: (1) extracting per-frame camera tokens from a frozen 3D foundation model (VGGT); (2) predicting constraint-aware motion labels via a lightweight Transformer temporal classifier; (3) injecting per-second motion label sequences as structured prompts into VideoLLM inference. The entire pipeline is plug-and-play and model-agnostic.

Key Designs¶

CameraMotionDataset: Built upon the MultiCamVideo subset of ReCamMaster, comprising 12,274 one-second clips, each deterministically annotated with 15 atomic motion primitives (pan-left/right, tilt-up/down, dolly-in/out, etc.) derived from precise extrinsic matrices, with a human-verified agreement rate of 93%.
Constraint-Aware Multi-Label Classification: A mutual exclusion matrix \(\mathbf{M} \in \{0,1\}^{K \times K}\) is defined over the 15 primitive classes. Training incorporates an incompatibility regularizer \(\mathcal{L}_{inc} = \sum M_{ij} p_i p_j\) and a cardinality regularizer \(\mathcal{L}_{card}\) to ensure physically plausible prediction combinations.
Probing Diagnosis: Q-Former probes are applied to frozen ViT layers of Qwen2.5-VL to read camera motion signals. Layer 7 (the first full-attention layer) yields the highest performance, with signal strength declining monotonically at greater depths — indicating that token compression and semantic alignment training progressively erase motion cues.
VGGT-Q-Former Distillation: VGGT camera tokens (from a 1.2B-parameter model) are distilled into a Q-Former with only 8.72M parameters, achieving 5.3× throughput improvement, reducing peak memory to 39%, with only an 8.13% drop in instance accuracy.

Loss & Training¶

Primary loss: BCE \(\mathcal{L}_{bce}\)
Constraint regularizer: \(\mathcal{L}_{inc} = \sum M_{ij} p_i p_j\) penalizes co-activation of mutually exclusive primitives
Cardinality regularizer: \(\mathcal{L}_{card}\) constrains each clip to predict 1–3 labels
Distillation: MSE regression loss \(\mathcal{L}_{reg} = \sum \|\tilde{c}_t - c'_t\|^2\), trained progressively in three stages

Key Experimental Results¶

Method	Instance Acc	Macro-F1	Weighted-F1
VGGT + Constraints	0.738	0.87	0.92
VGGT w/o Constraints	0.572	0.79	0.84
VGGT-Q-Former Distilled	0.638	0.83	0.87
Q-Former Probing	0.450	0.69	0.74

Most off-the-shelf VideoLLMs achieve near-random accuracy (~25%) on CameraMotionVQA.
After motion label injection, VideoLLM descriptions shift from vague motion statements to cinematically structured narratives with explicit directionality and temporal organization.

Ablation Study¶

Removing constraint regularization reduces instance accuracy from 73.8% to 57.2%, demonstrating the critical role of mutual exclusion constraints.
Probing results: camera motion signal is strongest at ViT layer 7 (shallow full-attention layer) and nearly vanishes by layer 31.
Distilled vs. full VGGT: throughput of 23.36 vs. 4.39 samples/s with controlled accuracy loss.
Temporal convolution vs. average pooling: removing temporal modeling leads to a notable accuracy drop.

Highlights & Insights¶

Using probing to quantitatively diagnose "where information is lost" provides an excellent methodological framework for understanding bottlenecks in large models.
The constraint-aware label design is elegant: the mutual exclusion matrix combined with cardinality regularization enforces physical plausibility directly at the loss level.
Structured prompt injection modifies model inference behavior without any weight updates.

Limitations & Future Work¶

The dataset is synthetic (UE5-rendered); generalization to real-world videos remains to be validated.
Coverage is limited to extrinsic motion (pan/tilt/dolly); intrinsic changes such as zoom are not addressed.
Only VGGT is explored as the 3D foundation model backbone; comparisons with other geometry models are absent.
Prediction of the static class is unreliable, as VGGT's reconstruction prior assumes camera motion, making static segments potentially out-of-distribution.

CameraBench: Provides primitive-level motion annotations and VQA evaluation, but labels are manually annotated without precise camera parameters; this paper's extrinsic-based deterministic annotation is more reliable.
SpatialVID: Offers per-frame depth and pose-driven motion instructions, but targets video generation rather than understanding; this paper inversely leverages geometric signals to enhance comprehension.
Shot-by-Shot: Uses shot-level cinematic grammar cues to guide description generation, but does not address primitive-level motion recognition.

Rating¶

Novelty: ⭐⭐⭐⭐ (complete diagnosis-injection framework design; novel constraint-aware multi-label classification)
Experimental Thoroughness: ⭐⭐⭐⭐ (benchmarking + probing + distillation + qualitative analysis)
Writing Quality: ⭐⭐⭐⭐ (clear logic, professional figures and tables)
Value: ⭐⭐⭐ (targeted at cinematic video understanding scenarios)