Geometry-Guided Camera Motion Understanding in VideoLLMs¶
Conference: CVPR 2026 arXiv: 2603.13119 Code: To be released Area: Interpretability Keywords: Camera motion understanding, VideoLLM, geometry guidance, 3D foundation model, motion primitives, structured prompting
TL;DR¶
This work systematically reveals camera motion blind spots in VideoLLMs through a benchmarking-diagnosis-injection framework, and significantly improves fine-grained camera motion understanding without fine-tuning by leveraging a frozen 3D foundation model (VGGT) for geometric feature extraction, a lightweight temporal classifier, and structured prompt injection.
Background & Motivation¶
State of the Field¶
Background: Camera motion (pan/tilt/dolly, etc.) constitutes a core geometric signal in cinematic grammar, directly influencing narrative structure, attentional guidance, and spatial layout expression. However, existing VideoLLMs are predominantly optimized for high-level semantics (object recognition, action understanding) and lack explicit camera motion supervision. Experiments reveal that most VideoLLMs achieve near-random accuracy (~25%) on camera motion VQA, indicating that this critical signal is severely neglected. Notably, models fine-tuned specifically on CameraBench perform even worse than the original Qwen2.5-VL, exposing fundamental issues with conventional fine-tuning approaches.
Problem Formulation¶
Goal: 1. What causes the systematic failure of VideoLLMs in recognizing fine-grained camera motion primitives? 2. How can reliable camera motion information be injected without modifying VideoLLM weights?
Method¶
Overall Architecture¶
The pipeline consists of three stages: (1) extracting per-frame camera tokens from a frozen 3D foundation model (VGGT); (2) predicting constraint-aware motion labels via a lightweight Transformer temporal classifier; (3) injecting per-second motion label sequences as structured prompts into VideoLLM inference. The entire pipeline is plug-and-play and model-agnostic.
Key Designs¶
- CameraMotionDataset: Built upon the MultiCamVideo subset of ReCamMaster, comprising 12,274 one-second clips, each deterministically annotated with 15 atomic motion primitives (pan-left/right, tilt-up/down, dolly-in/out, etc.) derived from precise extrinsic matrices, with a human-verified agreement rate of 93%.
- Constraint-Aware Multi-Label Classification: A mutual exclusion matrix \(\mathbf{M} \in \{0,1\}^{K \times K}\) is defined over the 15 primitive classes. Training incorporates an incompatibility regularizer \(\mathcal{L}_{inc} = \sum M_{ij} p_i p_j\) and a cardinality regularizer \(\mathcal{L}_{card}\) to ensure physically plausible prediction combinations.
- Probing Diagnosis: Q-Former probes are applied to frozen ViT layers of Qwen2.5-VL to read camera motion signals. Layer 7 (the first full-attention layer) yields the highest performance, with signal strength declining monotonically at greater depths — indicating that token compression and semantic alignment training progressively erase motion cues.
- VGGT-Q-Former Distillation: VGGT camera tokens (from a 1.2B-parameter model) are distilled into a Q-Former with only 8.72M parameters, achieving 5.3× throughput improvement, reducing peak memory to 39%, with only an 8.13% drop in instance accuracy.
Loss & Training¶
- Primary loss: BCE \(\mathcal{L}_{bce}\)
- Constraint regularizer: \(\mathcal{L}_{inc} = \sum M_{ij} p_i p_j\) penalizes co-activation of mutually exclusive primitives
- Cardinality regularizer: \(\mathcal{L}_{card}\) constrains each clip to predict 1–3 labels
- Distillation: MSE regression loss \(\mathcal{L}_{reg} = \sum \|\tilde{c}_t - c'_t\|^2\), trained progressively in three stages
Key Experimental Results¶
| Method | Instance Acc | Macro-F1 | Weighted-F1 |
|---|---|---|---|
| VGGT + Constraints | 0.738 | 0.87 | 0.92 |
| VGGT w/o Constraints | 0.572 | 0.79 | 0.84 |
| VGGT-Q-Former Distilled | 0.638 | 0.83 | 0.87 |
| Q-Former Probing | 0.450 | 0.69 | 0.74 |
- Most off-the-shelf VideoLLMs achieve near-random accuracy (~25%) on CameraMotionVQA.
- After motion label injection, VideoLLM descriptions shift from vague motion statements to cinematically structured narratives with explicit directionality and temporal organization.
Ablation Study¶
- Removing constraint regularization reduces instance accuracy from 73.8% to 57.2%, demonstrating the critical role of mutual exclusion constraints.
- Probing results: camera motion signal is strongest at ViT layer 7 (shallow full-attention layer) and nearly vanishes by layer 31.
- Distilled vs. full VGGT: throughput of 23.36 vs. 4.39 samples/s with controlled accuracy loss.
- Temporal convolution vs. average pooling: removing temporal modeling leads to a notable accuracy drop.
Highlights & Insights¶
- Using probing to quantitatively diagnose "where information is lost" provides an excellent methodological framework for understanding bottlenecks in large models.
- The constraint-aware label design is elegant: the mutual exclusion matrix combined with cardinality regularization enforces physical plausibility directly at the loss level.
- Structured prompt injection modifies model inference behavior without any weight updates.
Limitations & Future Work¶
- The dataset is synthetic (UE5-rendered); generalization to real-world videos remains to be validated.
- Coverage is limited to extrinsic motion (pan/tilt/dolly); intrinsic changes such as zoom are not addressed.
- Only VGGT is explored as the 3D foundation model backbone; comparisons with other geometry models are absent.
- Prediction of the static class is unreliable, as VGGT's reconstruction prior assumes camera motion, making static segments potentially out-of-distribution.
Related Work & Insights¶
- CameraBench: Provides primitive-level motion annotations and VQA evaluation, but labels are manually annotated without precise camera parameters; this paper's extrinsic-based deterministic annotation is more reliable.
- SpatialVID: Offers per-frame depth and pose-driven motion instructions, but targets video generation rather than understanding; this paper inversely leverages geometric signals to enhance comprehension.
- Shot-by-Shot: Uses shot-level cinematic grammar cues to guide description generation, but does not address primitive-level motion recognition.
Rating¶
- Novelty: ⭐⭐⭐⭐ (complete diagnosis-injection framework design; novel constraint-aware multi-label classification)
- Experimental Thoroughness: ⭐⭐⭐⭐ (benchmarking + probing + distillation + qualitative analysis)
- Writing Quality: ⭐⭐⭐⭐ (clear logic, professional figures and tables)
- Value: ⭐⭐⭐ (targeted at cinematic video understanding scenarios)