Skip to content

Learning Multi-View Spatial Reasoning from Cross-View Relations

Conference: CVPR 2026 arXiv: 2603.27967 Code: https://cross-view-relations.github.io Area: 3D Vision Keywords: multi-view spatial reasoning, cross-view relations, vision-language models, robotic manipulation, dataset construction

TL;DR

XVR (Cross-View Relations) constructs a large-scale multi-view visual question answering dataset of 100K samples. By explicitly training VLMs on three categories of tasks—correspondence, verification, and viewpoint localization—XVR significantly improves cross-view spatial reasoning, yielding notable gains on both multi-view benchmarks and robotic manipulation tasks.

Background & Motivation

Vision-language models (VLMs) excel at single-view visual tasks but are severely limited in multi-view spatial reasoning, which is essential for robotic systems to understand 3D environments and perform cross-view manipulation.

  1. Single-view limitations: Existing spatial reasoning datasets and benchmarks are almost exclusively single-view, suffering from limited information and frequent occlusions.
  2. Shallow multi-view understanding: Even datasets with multi-view data (e.g., AllAnglesBench) focus only on "what objects are visible" in each view, rather than the geometric relationships between views.
  3. Lack of explicit cross-view supervision: Without explicit cross-view relational training, VLMs tend to produce predictions that appear locally plausible within individual views but are geometrically inconsistent across views.

Key Insight: Inspired by the Structure-from-Motion (SfM) pipeline, which integrates multi-view information through three key steps—establishing correspondences, verifying geometric consistency, and estimating camera poses—the authors translate these steps into three categories of cross-view supervision tasks to construct the XVR dataset and directly train VLMs for cross-view reasoning.

Method

Overall Architecture

Input: Multi-view images from calibrated multi-view captures (general domain) and robotic manipulation trajectories (robot domain). Data generation pipeline: Visual QA samples in multiple-choice format are automatically generated using 3D geometric information and spatiotemporal metadata. Output: 100K training samples + 1,866 test samples (XVR-Eval), covering 8 specific task types with an average of 4.32 images per sample. Fine-tuning Qwen3-VL-2B on XVR yields substantial improvements.

Key Designs

  1. Three Cross-View Reasoning Categories:

    • Function: Provide structured cross-view relational supervision signals.
    • Mechanism: (a) Correspondence: includes point correspondence (matching the same 3D point across views) and directional correspondence (aligning directional arrows across views). (b) Verification: includes spatial verification (detecting 3D spatial inconsistencies across views) and temporal verification (identifying temporally discontinuous frames in a sequence). (c) Localization: includes four subtasks—viewpoint localization, directional view localization, cross-scene localization, and language-conditioned localization. In total, 8 task types.
    • Design Motivation: Directly mirrors the three core steps of SfM—establishing correspondences, verifying geometric consistency, and estimating camera poses—which constitute the foundational capabilities for multi-view 3D understanding.
  2. Dual-Domain Generation Pipeline:

    • Function: Automatically generate high-quality cross-view QA samples at scale.
    • Mechanism: The general domain leverages calibrated multi-view RGB-D captures from the WildRGB-D dataset, generating precise correspondence and localization tasks via 3D-to-2D projection. 3D points and camera positions are sampled and projected onto multiple views; spatially separated distractors are generated to ensure non-trivial questions. The robot domain leverages manipulation trajectory data from OXE and AgiBot-World, generating verification and localization tasks based on spatiotemporal metadata and camera identifiers. SSIM filtering is applied to ensure temporal differences are visually distinguishable.
    • Design Motivation: The two data sources are complementary—the general domain provides precise geometric supervision, while the robot domain contributes rich viewpoint variation and temporal dynamics.
  3. VLA Downstream Transfer:

    • Function: Transfer cross-view reasoning capabilities to robotic manipulation.
    • Mechanism: Qwen3-VL-2B-XVR (fine-tuned on XVR) serves as the vision-language backbone for a VLA model, augmented with a diffusion action head (following the GR00T-N1.5 architecture). The system is trained and evaluated on Franka Emika arm manipulation tasks in the RoboCasa simulation environment.
    • Design Motivation: To validate that cross-view spatial reasoning not only improves perceptual capability but also directly translates into gains in embodied manipulation performance.

Loss & Training

Fine-tuning employs a standard multiple-choice VQA loss. Key data quality control measures include: retaining only samples with point cloud density ≥1M in the general domain; and keeping only sequences with ≥3 cameras, ≥20-second trajectories, and sufficient motion dynamics in the robot domain. XVR-Eval is constructed from data sources unseen during training to ensure generalization.

Key Experimental Results

Main Results

Model XVR-Eval Overall Type
Random 32.64% Baseline
Human 83.85% Human baseline
Eagle2-2B 16.99% Open-source
Qwen3-VL-2B-Instruct 36.82% Open-source
Qwen3-VL-4B-Instruct 45.02% Open-source
Claude-4.5-Sonnet 51.18% Closed-source
GPT-5 61.74% Closed-source
Qwen3-VL-2B-XVR (Ours) 68.06% Fine-tuned

The XVR-fine-tuned 2B model surpasses all closed-source models including GPT-5, achieving a 1.8× improvement over the base model.

Ablation Study (XVR-Eval Subtask Analysis)

Task Qwen3-VL-2B Qwen3-VL-2B-XVR Gain
Point Correspondence 46.59% 94.32% +47.73
Spatial Verification 23.11% 84.85% +61.74
Viewpoint Localization 19.50% 57.68% +38.18
Directional Correspondence 26.14% 53.79% +27.65
Temporal Verification 45.29% 41.18% −4.11

External benchmark transfer: Consistent improvements on MindCube-Tiny and RoboSpatial-Home, with +7.6% on the Compatibility subtask and +7.0% on the Among subtask.

VLA manipulation success rate (RoboCasa): TurnOffMicrowave shows the largest gain (~+13%), with notable improvements also on CoffeePressButton and PnPCabToCounter.

Key Findings

  • Point Correspondence and Spatial Verification exhibit the most dramatic improvements (+47.73 and +61.74 pp, respectively), exceeding human-level performance, indicating that geometric matching tasks benefit most from explicit cross-view training.
  • Temporal Verification is the only task that declines (−4.11 pp), as XVR training emphasizes static spatial-geometric reasoning at the expense of temporal sensitivity—revealing a spatial–temporal reasoning trade-off.
  • 2B model > GPT-5: Explicit cross-view supervision is more valuable than model scale; Qwen3-VL-2B-XVR (2B parameters) outperforms GPT-5.
  • Gemini-Robotics-ER-1.5 achieves only 6.22% on Viewpoint Localization—below random chance—demonstrating that robot-specific training alone cannot substitute for explicit cross-view relational supervision.
  • Cross-domain transfer is effective: XVR training on outside-looking-in configurations generalizes to inside-looking-out scenarios in MindCube.

Highlights & Insights

  • The mapping from the SfM pipeline to VLM training is particularly elegant—translating the classical correspondence–verification–localization workflow into learnable QA tasks represents an effective strategy for injecting geometric knowledge into large models.
  • The finding that small model + explicit supervision > large model + zero-shot is significant: for structured tasks such as spatial reasoning, data quality and task design matter more than model scale.
  • Successful VLA transfer validates the hypothesis that better spatial perception leads to better manipulation; the XVR-trained visual backbone can be plugged in to improve robotic performance directly.
  • The dual-domain design of the data generation pipeline is noteworthy—automatically generating large-scale training data by exploiting metadata (camera parameters, trajectories) from existing datasets.

Limitations & Future Work

  • Temporal reasoning degradation: XVR prioritizes static multi-view spatial reasoning, sacrificing temporal dynamic understanding; future work could incorporate explicit temporal relational training.
  • VLA evaluation is simulation-only: The RoboCasa simulator does not fully capture the complexity of real physical environments; validation on real hardware is needed.
  • The general domain data is primarily drawn from WildRGB-D, which may offer limited scene diversity (predominantly tabletop objects); extending to outdoor and large-scale scene data could yield further gains.
  • Joint training with 3D perception tasks such as depth estimation and surface normal estimation has not been explored.
  • vs. MultiSPA: MultiSPA provides large-scale multi-frame spatial reasoning data with depth and visual correspondences, but lacks explicit cross-view geometric relational supervision; XVR's cross-view relations are more structured.
  • vs. MindCube: MindCube evaluates scene imagination from limited viewpoints; XVR achieves transferable gains on this benchmark, demonstrating that cross-view training generalizes to spatial imagination tasks.
  • vs. SpatialVLM / RoboSpatial: These works inject 3D spatial cues into single-view understanding; XVR extends this to cross-view relational understanding across multiple views, representing a more comprehensive form of spatial intelligence.
  • vs. pi0.5: pi0.5 enhances embodied reasoning by strengthening the VLM backbone; XVR provides a data-driven pathway toward similar objectives.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The mapping from SfM to VLM training is highly innovative; the three-category task design is theoretically grounded and empirically effective.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comparisons with 10 VLMs (including closed-source), internal and external benchmarks, VLA transfer, and human baselines—comprehensive coverage.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, rigorous task definitions, well-designed figures, and in-depth analysis.
  • Value: ⭐⭐⭐⭐⭐ Fills the gap in training data for multi-view spatial reasoning in VLMs; VLA transfer validates practical applicability.