Beyond Single-View Sufficiency: CVBench for Cross-View Human Understanding¶
Conference: CVPR 2026
Paper: CVF Open Access
Code: To be confirmed
Area: Human Understanding / Multimodal VLM
Keywords: Cross-view understanding, MLLM evaluation benchmark, Multi-view fusion, Single-view bias, Spatio-temporal reasoning
TL;DR¶
Addressing the loophole in existing MLLM benchmarks that default to "single-view sufficiency" and only reward single-image recognition, this work constructs CVBench—3,000 human understanding questions where each item is verifiably "unsolvable via single-view, solvable via cross-view" (12 spatio-temporal tasks, 4-way synchronized cameras). Evaluation reveals that even the strongest models lag nearly 50 points behind humans, identifying a systematic failure mechanism across all models: "single-view bias."
Background & Motivation¶
Background: Human perception of social scenes is inherently a multi-view integration problem—the same scene is observed by multiple cameras from different angles over time, and humans easily fuse complementary or occluded visual cues into a consistent understanding of "who is who, what they are doing, and how they interact." However, modern MLLM evaluations are almost entirely built on single-view scenarios, where even recent video benchmarks assume "the given single image/video contains all information necessary to answer the question."
Limitations of Prior Work: This "sufficient-view" paradigm only rewards recognition or temporal reasoning within a single continuous visual stream, failing to assess cross-view fusion capabilities. Although many MLLMs accept multi-image inputs, they are never systematically tested on "whether they can synthesize complementary or even conflicting information." Consequently, models frequently fail in real-world multi-camera environments: confusing similar-looking people, double-counting the same person across cameras, misjudging contact during depth ambiguity, or failing to predict actions from partial limb views—pathological failure modes in security, sports analysis, and human-robot collaboration.
Key Challenge: Existing multi-image/video VQA benchmarks suffer from "cherry-picking"—models score by "finding the easiest view containing the answer" without being penalized for ignoring contradictory evidence or failing to synthesize a 3D consistent explanation. It remains unclear whether models are performing true cross-view fusion or just selecting an optimal single view.
Goal: Formalize "cross-view human understanding" as a core but severely undervalued MLLM capability and construct a benchmark that verifiably mandates multi-view synthesis and diagnoses failure causes.
Key Insight: The authors leverage an operational construction principle: to test cross-view fusion, every question must be unsolvable in any single view and requires merging two or more views to disambiguate. This turns "cherry-picking" into a failed strategy.
Core Idea: Use "verifiable single-view insufficiency" as a hard constraint to rebuild human understanding benchmarks. Combined with hard negatives tailored to failure modes, this upgrades the benchmark from "permitting multi-view input" to "mandating multi-view synthesis."
Method¶
Overall Architecture¶
CVBench is an evaluation benchmark rather than a specific model. Its "method" centers on data construction that is both challenging and fair. The benchmark is organized along two complementary axes: the Spatial Axis (same time, different cameras; testing identity association, de-duplication counting, fine-grained contact/occlusion reasoning) and the Temporal Axis (multiple timestamps, multiple cameras; testing identity continuity, action recognition, motion prediction). Each axis is further divided by granularity into coarse-grained (scene-level interpretation) and fine-grained (limb/action precision), totaling 12 tasks. All clips are unified to 4 synchronized views for 6–30 seconds. Each question requires maintaining identity and fusing complementary cues, localized by truth "evidence spans."
The construction pipeline is a four-stage serial process:
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Multi-camera Human Video Sources<br/>8 Datasets (EgoExo4D / M3GYM / WILDTRACK, etc.)"] --> B["Time Alignment + 4 Orthogonal Views Selection<br/>SoM Labeling for Target Individuals"]
B --> C["SVI + CVR Dual Principle Verification<br/>Single-View Unsolvable ∧ Cross-View Solvable"]
C --> D["12 Tasks across Two Axes<br/>Spatial/Temporal × Coarse/Fine Granularity"]
D --> E["QA Generation + Failure-Driven Hard Negatives<br/>5 Options including 'None of the above'"]
E --> F["Three-Stage Quality Control<br/>Peer Review → Human Testing → Blind Text Validation"]
F --> G["CVBench: 3,000 Questions"]
Key Designs¶
1. SVI + CVR Principles: Making "Cherry-Picking" a Failed Strategy This is the fundamental difference between CVBench and prior multi-image benchmarks. While previous benchmarks allowed multi-view input, they did not prevent models from cheating via a single easy view. This work requires each candidate item to pass two hurdles: Single-View Insufficiency (SVI)—annotators must confirm no single view can unambiguously answer the question; and Cross-View Resolvability (CVR)—annotators must confirm merging views makes the answer uniquely solvable. This constraint targets the "single-view bias" by eliminating questions solvable through a single perspective.
2. 12-Task Taxonomy: Decomposing Cross-View Understanding into Diagnosable Sub-abilities The benchmark expands across "Spatial × Temporal" and "Coarse × Fine Granularity" axes. Spatial-coarse tasks include cross-view de-duplication counting and identity association. Spatial-fine tasks include limb occlusion and contact recognition (requiring sub-centimeter geometric disambiguation). Temporal tasks involve trajectory summarization and motion recognition across discontinuous fields of view. The dataset contains 3,000 items: 1,508 spatial and 1,492 temporal.
3. Failure-Driven Hard Negative Construction Distractors are semi-automatically constructed based on common failure modes. For counting tasks, distractors include counts derived from partial views (e.g., if View 1 sees 2 and View 2 sees 2 for a global truth of 3, both "2" and the naive sum "4" are included as distractors). For temporal tasks, distractors include reversed action sequences or over-generalized verbs. Each question follows a 5-option format including "None of the above" to prevent guessing based on plausibility.
4. Three-Stage Quality Control To ensure visual grounding over world knowledge, the benchmark uses: Stage 1: Peer Review for SVI/CVR validation; Stage 2: Human Testing to establish a performance upper bound (~94%); and Stage 3: Blind Text Validation where text-only LLMs attempt the questions. If a model can answer correctly via linguistic priors, the question is rewritten.
Key Experimental Results¶
Main Results¶
Spatial Tasks (Table 2, All represents overall accuracy, %):
| Model | Category | Coarse-grained | Fine-grained | All |
|---|---|---|---|---|
| Qwen2.5-VL-7B | Open-source | 29.9 | 24.4 | 27.2 |
| InternVL3-78B | Best Open-source | 38.5 | 31.1 | 34.8 |
| Gemini-2.5-Flash | Closed-source | 40.5 | 33.6 | 37.1 |
| GPT-5 | Closed-source | 41.2 | 35.5 | 38.4 |
| Gemini-2.5-Pro | Best Closed-source | 40.9 | 36.3 | 38.6 |
| Human | Baseline | 96.7 | 92.1 | 94.4 |
| Random / Blind | Baseline | — | — | 20.0 / 18.5 |
Temporal Tasks (Table 3, %):
| Model | Category | Coarse-grained | Fine-grained | All |
|---|---|---|---|---|
| Qwen2.5-VL-7B | Open-source | 28.6 | 23.5 | 26.0 |
| InternVL3-78B | Best Open-source | 35.3 | 30.3 | 32.8 |
| GPT-5 | Closed-source | 35.6 | 31.8 | 33.7 |
| Gemini-2.5-Pro | Best Closed-source | 35.8 | 33.9 | 34.9 |
| Human | Baseline | 95.7 | 91.4 | 93.5 |
| Random / Blind | Baseline | — | — | 20.0 / 21.3 |
Ablation Study¶
A manual review of 500 failure cases (250 spatial + 250 temporal) categorized primary error causes (Table 4, %):
| Failure Category | Domain | InternVL3-78B | GPT-5 |
|---|---|---|---|
| Single-View Bias | Spatial | 42.0 | 38.8 |
| Geometric Failure | Spatial | 38.4 | 35.2 |
| Identity Confusion | Spatial | 14.8 | 22.0 |
| Temporal Incoherence | Temporal | 44.8 | 40.4 |
| Identity Confusion | Temporal | 35.2 | 37.6 |
| Single-View Bias | Temporal | 15.6 | 16.8 |
Key Findings¶
- "Single-View Bias" is a systematic defect: Over 40% of spatial failures stem from models "cherry-picking" a single view instead of fusing them. Models often rely on a confident but incorrect single-view prediction rather than using other views as corrective signals.
- Weak Geometric/Physical Reasoning: Over 35% of fine-grained failures occur because models cannot distinguish true contact from proximity, relying on view-dependent 2D pixel adjacency.
- Temporal Double-Counting: Models often recount the same action when a person moves between viewpoints.
- Robustness to Language Priors: The "Blind" baseline (~18-21%) is near the random 20%, confirming that the questions require strict visual grounding.
Highlights & Insights¶
- Formalizing "Evaluation Loopholes" into Hard Constraints: The SVI \(\land\) CVR principles verifiably block the "cherry-picking" escape route.
- Isomorphism between Distractors and Failure Modes: Distractors explicitly target "naive summation" or "action reversal," mapping scores directly to specific capability deficits.
- Evidence-Span Supervision: Providing metadata on which frames resolve the ambiguity allows the benchmark to help diagnose "why" a model failed, not just if it was wrong.
Limitations & Future Work¶
- The authors emphasize diagnosis over solutions; CVBench identifies shortfalls but does not propose a specific cross-view architecture.
- The 4-view setup with 6–30 second clips is fixed; whether conclusions generalize to more cameras or longer durations remains to be verified.
- The multiple-choice format, while reproducible, may not capture subtle reasoning nuances.
Related Work & Insights¶
- vs. Video VQA: While standard video benchmarks reward single-stream recognition, CVBench mandates that the single stream is insufficient.
- vs. Classic Multi-view Datasets: CVBench bridges the gap between structured multi-view data (like MOT or 3D pose) and modern MLLM language-grounded evaluation.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Formalizes SVI as a verifiable constraint for cross-view assessment.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive testing of 11 models across 12 tasks with human baselines.
- Writing Quality: ⭐⭐⭐⭐ Clear logic from motivation to diagnostic results.
- Value: ⭐⭐⭐⭐⭐ Provides a crucial diagnostic tool for the next generation of spatio-temporal MLLMs.