VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues¶

Conference: CVPR 2026 arXiv: 2506.21742 Code: Available (dataset and data collection framework are open-sourced) Area: Video Understanding Keywords: Video Question Answering, Implicit Reasoning, Visual Relational Reasoning, Benchmark, Multimodal Understanding

TL;DR¶

This paper proposes VRR-QA, a benchmark comprising 1K carefully annotated video QA pairs designed to evaluate models' ability to reason about implicit visual relationships in videos—such as off-screen events, cross-frame causality, and spatial relationship inference. The benchmark reveals significant deficiencies in implicit reasoning among current state-of-the-art VideoQA models, including GPT-O3: the best-performing model achieves only 64% accuracy, far below the human baseline of 83%.

Background & Motivation¶

Background: VideoQA has achieved notable progress in recent years through multimodal learning that aligns visual and textual modalities. Existing benchmarks (e.g., MVBench, TempCompass, VideoMME) primarily target questions with explicitly visible answers—recognizing actions, objects, and events that are directly observable on screen.
Limitations of Prior Work: Human video comprehension goes beyond perceiving what appears on screen; it involves inferring relationships that are implied but not directly depicted—for instance, deducing a bullet's trajectory from the direction a character is running, even when the bullet and the target never appear in the same frame. Existing benchmarks rarely cover such implicit reasoning tasks.
Key Challenge: Current models rely heavily on surface-level visual cues and perform poorly when cross-frame inference is required to recover spatial relationships, causal chains, or social dynamics that are never explicitly shown. Yet no systematic benchmark exists to quantify this capability gap.
Goal: To construct the first VideoQA benchmark dedicated to implicit visual relational reasoning, systematically evaluate existing models on this task, and define a taxonomy of nine reasoning dimensions.
Key Insight: Movies and animated works are selected as video sources because such creative content inherently employs narrative techniques (e.g., implied causality, off-screen actions, perspective shifts) that embed implicit reasoning into content comprehension, while preventing explicit visual cues from leaking into the questions.
Core Idea: By constructing questions around film clips that require inference rather than direct observation, the paper establishes a benchmark that genuinely tests implicit reasoning in video understanding.

Method¶

Overall Architecture¶

The VRR-QA construction pipeline consists of four stages: (1) selecting 1K creative video clips from diverse films; (2) expert annotators using an in-house FrameQuiz tool to annotate temporal segments, questions, and answer choices; (3) non-expert annotators answering the questions via the ImplicitEval tool to establish a human baseline; and (4) GPT-4.1 performing preliminary classification, followed by expert re-annotation of reasoning categories. The final benchmark contains 1K QA pairs spanning 107 films, 15 genres, and seven decades.

Key Designs¶

Taxonomy of Nine Implicit Reasoning Categories:
- Function: Provides a systematic classification of implicit visual reasoning capabilities.
- Mechanism: Covers horizontal spatial reasoning (inferring relative left–right positions of objects), vertical spatial reasoning (above–below relationships), relative depth and distance, perspective and visibility (inferring who can see what), motion and trajectory dynamics, causal and motivational reasoning, implicit counting (requiring cross-frame aggregation of scattered visual evidence), physical and environmental context, and social interaction and relationships.
- Design Motivation: Ensures comprehensive coverage of the various implicit reasoning dimensions involved in human video understanding, enabling evaluation results to pinpoint specific model weaknesses.
Expert End-to-End Annotation:
- Function: Ensures annotation quality and the implicitness of each question.
- Mechanism: Unlike benchmarks that rely on template generation or LLM-assisted annotation, all 1K questions in VRR-QA are authored by the paper's authors (computer vision experts) and cross-validated. The annotation tool supports frame-by-frame inspection, temporal segment labeling, and save-and-replay verification.
- Design Motivation: Expert annotation by CV researchers ensures that each question genuinely probes implicit reasoning rather than surface-level perception; template- and LLM-generated questions tend to produce items answerable from explicit visual cues.
"No Visual Leakage" Film-Based Video Source:
- Function: Prevents models from answering questions through direct visual observation.
- Mechanism: Film clips are selected specifically because they omit direct depictions of the target relationship—for example, a bullet flying toward a princess that never appears in the same frame as Mario, requiring reasoning about the princess's running direction and bullet trajectory to infer the bullet's displacement relative to Mario. The 1K clips are drawn from 107 films across 15 genres (3D animation, live action, etc.).
- Design Motivation: The narrative conventions of cinema make implicit reasoning a prerequisite for comprehension rather than an optional enhancement.

Evaluation Design¶

Evaluation covers 30+ VideoQA model configurations, including open-source models (LLaVA series, Qwen2-VL, InternVL3, Gemma 3, etc.) and proprietary models (GPT-O3, GPT-5.2, Gemini 3 Flash, Claude 4.5 Sonnet, etc.), examining the effects of varying model scale and number of input frames.

Key Experimental Results¶

Main Results¶

Model	Overall Accuracy	Macro Avg.	Horizontal Spatial	Motion & Trajectory	Motivation	Implicit Counting
Human Baseline	83.0%	85.6%	85.4%	91.9%	94.4%	65.9%
GPT-O3	64.1%	68.6%	50.3%	71.4%	85.4%	39.5%
Gemini 3 Flash	61.8%	67.6%	52.8%	73.6%	86.6%	48.3%
GPT-4.1	54.3%	58.6%	42.9%	59.3%	82.9%	41.9%
InternVL 3 (7B)	43.3%	50.2%	34.8%	51.7%	64.6%	34.9%
LLaVA-Video (7B)	42.1%	46.3%	36.0%	60.4%	62.2%	14.0%

Key Findings¶

Analysis Dimension	Finding
Reasoning vs. Non-Reasoning Models	The reasoning model GPT-O3 outperforms GPT-4.1 by 9.8%, demonstrating that deep reasoning is critical for implicit understanding.
Model Scale Effect	Larger variants of GPT-4.1 substantially outperform smaller ones; among open-source models, Qwen2.5-VL-32B offers only marginal gains over its 7B counterpart.
Effect of Frame Count	Increasing the number of input frames does not consistently improve performance, indicating that the bottleneck lies in reasoning capability rather than insufficient visual information.
Hardest Categories	Implicit counting and horizontal spatial reasoning are the weakest areas for all models, with the largest gaps relative to human performance.
Textual Diversity	VRR-QA achieves a mean pairwise similarity (MPS) of 0.161—lower than all comparison benchmarks—indicating the highest question diversity.

Key Findings¶

No open-source model exceeds 50% overall accuracy on VRR-QA.
The reasoning-oriented model GPT-O3 performs best across all categories but still lags substantially behind humans on horizontal spatial reasoning and implicit counting.
Even the strongest proprietary model trails the human baseline by approximately 19 percentage points.
Performance varies markedly across categories: social interaction and motivational reasoning are relatively tractable (GPT-O3 reaches 85–86%), while implicit counting is extremely challenging (GPT-O3 achieves only 39.5%).

Highlights & Insights¶

Benchmark Design That Addresses a Critical Gap: VRR-QA is the first VideoQA benchmark dedicated to implicit reasoning. Its design principles—selecting cinematic content, employing expert annotation, and constructing genuinely implicit questions—offer valuable lessons for future benchmark development.
Fine-Grained Taxonomy: The nine reasoning dimensions provide a clear capability map for future research, enabling precise diagnosis of which reasoning types are most challenging for current models.
Validation of Reasoning Model Advantages: Experiments clearly demonstrate the critical role of deliberate reasoning capability (e.g., GPT-O3's chain-of-thought reasoning) in implicit understanding, pointing toward a promising direction for future VideoQA model architecture design.

Limitations & Future Work¶

The dataset scale is relatively small (only 1K QA pairs), which may be insufficient to support large-scale training or fine-grained statistical analyses.
The exclusive use of film footage does not cover implicit reasoning in practical domains such as instructional or surveillance video.
The multiple-choice format may not fully capture models' open-ended reasoning capabilities.
No training split or fine-tuning protocol is provided; the benchmark is intended solely for evaluation.
Future directions include constructing larger-scale implicit reasoning training data and designing pre-training strategies grounded in implicit reasoning.

vs. MVBench: MVBench aggregates explicit questions from existing datasets; VRR-QA focuses on original, implicit reasoning questions.
vs. VideoMME: VideoMME evaluates multimodal understanding (including subtitles and audio); VRR-QA is vision-only and dedicated to implicit reasoning.
vs. TempCompass: TempCompass tests temporal understanding through algorithmically edited videos; VRR-QA uses natural cinematic content to probe deep reasoning.

Rating¶

Novelty: ⭐⭐⭐⭐ Fills a clear gap in implicit reasoning evaluation for VideoQA; the taxonomy design is well-motivated.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across 30+ model configurations with thorough multi-dimensional analysis.
Writing Quality: ⭐⭐⭐⭐ Clear structure, vivid examples, and persuasive motivation.
Value: ⭐⭐⭐⭐ Exposes fundamental deficiencies in current VideoQA models and charts a clear direction for improvement.

VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Evaluation Design¶

Key Experimental Results¶

Main Results¶

Key Findings¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Related Papers¶