NeurIPS 2025 (Spotlight) Multimodal VLM Multi-view understanding egocentric view exocentric view scene graph VQA CoT

Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs¶

Conference: NeurIPS 2025 (Spotlight) arXiv: 2505.21955 Code: Available (https://github.com/Leeinsu1/Towards-Comprehensive-Scene-Understanding) Area: Multimodal VLM / Scene Understanding Keywords: Multi-view understanding, egocentric view, exocentric view, scene graph, VQA, CoT

TL;DR¶

This paper proposes E3VQA, the first multi-view VQA benchmark, and M3CoT, a prompting technique that fuses three complementary scene graphs, to enhance multi-view scene understanding in Large Vision-Language Models (LVLMs), achieving gains of 4.84% on GPT-4o and 5.94% on Gemini 2.0 Flash.

Background & Motivation¶

Deployment of LVLMs in Interactive Applications¶

Large vision-language models are increasingly deployed in interactive applications such as virtual/augmented reality, where the first-person (egocentric) view provided by head-mounted cameras serves as a key input. However, egocentric views carry inherent limitations:

Narrow field of view: Head-mounted cameras have a limited FOV and cannot capture the full scene.

Lack of global context: Only the local region the user is attending to is visible.

Difficulty in spatial reasoning: Answering questions requiring global spatial information is challenging.

Complementary Value of the Third-Person View¶

The third-person (exocentric) view can provide: - Global scene layout - Complete object visibility - A global view of spatial relationships - Contextual information about user–environment interaction

Core Problem¶

How can first- and third-person views be effectively fused to enable LVLMs to understand scenes more comprehensively? Existing work almost exclusively relies on a single viewpoint.

Method¶

Overall Architecture¶

Input: Synchronized ego-exo image pairs
  ├── E3VQA Benchmark: 4K high-quality QA pairs
  └── M3CoT Reasoning: Three-view scene graph fusion
       ├── Ego Scene Graph
       ├── Exo Scene Graph
       └── Cross-view Scene Graph
       → Unified scene representation → LVLM answer

Key Designs¶

1. E3VQA Benchmark¶

Data Construction: - Synchronized ego-exo image pairs sourced from the Ego-Exo4D dataset - 4,000 high-quality question–answer pairs covering multiple question types: - Spatial reasoning: Where is the object? What are the relative positions? - Action understanding: What is the user doing? - Causal reasoning: Why is the user attending to this object? - Counting and attributes: How many instances of a specific object are in the scene? - Quality control: Manual annotation with multi-round verification

Question Type Distribution: - Answerable from ego view only: ~30% - Answerable from exo view only: ~25% - Requiring both views for correct answer: ~45%

M3CoT is a training-free prompting technique whose core contribution is constructing a unified multi-view scene representation:

Step 1: Generate Three Scene Graphs

Ego Scene Graph \(G_e\): Extracted from the egocentric image
- Nodes: detected objects
- Edges: spatial relations (left/right/above/below/front/behind) + interaction relations
Exo Scene Graph \(G_x\): Extracted from the third-person image
- Nodes: globally visible objects
- Edges: global spatial relations
Cross-view Scene Graph \(G_c\): Cross-view associations
- Nodes: objects co-visible in both views
- Edges: cross-view correspondences and complementary information

Step 2: Fuse into a Unified Representation

The three scene graphs are merged into a single textualized unified scene description, which is provided as additional context to the LVLM.

Step 3: Chain-of-Thought Reasoning

The fused scene representation guides the LVLM through multi-step reasoning:

[Ego Image] + [Exo Image] + [Unified Scene Representation]
→ Step 1: Understand the information provided by each view
→ Step 2: Identify the complementary relationship between the two views
→ Step 3: Synthesize and answer the question

Loss & Training¶

M3CoT is a completely training-free method: - It leverages the in-context learning capability of LVLMs - Implemented solely through carefully designed prompts - Applicable to any LVLM that supports multi-image input

Key Experimental Results¶

Main Results¶

Performance of different LVLMs on the E3VQA benchmark:

Model	Ego Only	Exo Only	Ego+Exo (no CoT)	+Standard CoT	+M3CoT (Ours)
GPT-4o	52.3	48.7	58.4	61.2	66.0 (+4.84)
Gemini 2.0 Flash	49.8	46.2	55.1	58.3	64.2 (+5.94)
Claude 3.5 Sonnet	50.1	47.3	56.8	59.4	63.8 (+4.42)
LLaVA-1.5-13B	38.2	35.6	42.1	44.7	48.3 (+3.58)
InternVL2-8B	41.5	38.9	45.3	47.6	51.2 (+3.61)

Performance gain by question type (GPT-4o):

Question Type	Ego+Exo Baseline	+M3CoT	Gain
Spatial reasoning	55.2	63.8	+8.6
Action understanding	62.1	67.4	+5.3
Causal reasoning	54.8	62.1	+7.3
Counting and attributes	61.5	64.7	+3.2
Ego-only questions	68.3	70.1	+1.8
Two-view questions	48.6	58.2	+9.6

Ablation Study¶

Contribution of each M3CoT component (GPT-4o):

Configuration	Accuracy	Gain
Baseline (Ego+Exo, no CoT)	58.4	—
+ Ego Scene Graph only	61.5	+3.1
+ Exo Scene Graph only	60.8	+2.4
+ Cross-view Scene Graph only	62.3	+3.9
+ Ego + Exo Scene Graphs	63.7	+5.3
+ Full M3CoT (all three)	66.0	+7.6

Key Findings¶

Multi-view significantly outperforms single-view: Ego+Exo exceeds Ego-only by ~6%, and M3CoT yields an additional gain of 4–6%.
Cross-view scene graph contributes the most: Cross-view association is more valuable than either single-view scene graph alone.
Spatial reasoning benefits the most: Questions requiring global spatial information gain the most from multi-view fusion (+8.6%).
Two-view questions see the largest improvement: M3CoT achieves a gain of up to 9.6% on this question category.
Closed-source models benefit more: GPT-4o and Gemini obtain larger gains from M3CoT than open-source models.

Highlights & Insights¶

First multi-view VQA benchmark: E3VQA fills the gap in evaluation of joint ego-exo scene understanding.
Training-free improvement: M3CoT requires no additional training and is plug-and-play.
Spotlight acceptance: Reflects high-quality problem formulation and systematic evaluation.
Practical applicability: Directly applicable to intelligent assistants in AR/VR scenarios.
Revealing LVLM limitations: Systematic evaluation exposes shortcomings of existing LVLMs in multi-view reasoning.

Limitations & Future Work¶

Limited dataset scale: 4K QA pairs are relatively small and may not cover all scene types.
Scene graph extraction quality: The effectiveness of M3CoT depends on the accuracy of scene graph extraction.
Two-image constraint: Real AR/VR scenarios may involve more than two viewpoints.
Static images only: Temporal information from video sequences is not considered.
Computational overhead: M3CoT requires an additional scene graph extraction step, increasing inference latency.

Ego-Exo4D: Provides synchronized ego-exo video data, serving as the data foundation for E3VQA.
Visual Question Answering (VQA): Classic single-view VQA benchmarks such as VQAv2 and GQA.
Scene graph generation: Datasets such as Visual Genome have advanced scene graph generation techniques.
Multi-view understanding: Classic multi-view learning methods; this paper is the first to integrate them with LVLMs.
Future directions: Video-level multi-view reasoning, fusion of more than two views, and end-to-end training approaches.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First ego-exo joint VQA benchmark + training-free multi-view fusion method
Theoretical Depth: ⭐⭐⭐ — Primary contributions are methodological and benchmark-oriented
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Multiple models, multiple question types, thorough ablation
Practical Impact: ⭐⭐⭐⭐⭐ — Direct benefit for AR/VR applications
Writing Quality: ⭐⭐⭐⭐ — Clear structure with rich visualizations