Skip to content

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Conference: ICLR 2026 arXiv: 2506.03135 Code: Project Page Area: Multimodal VLM / Benchmarking Keywords: Spatial Reasoning, VLM Benchmark, Cognitive Psychology, Dynamic Reasoning, Perspective Transformation

TL;DR

Grounded in cognitive psychology, this work introduces OmniSpatial—the first comprehensive spatial reasoning benchmark—systematically covering 4 dimensions (dynamic reasoning, complex spatial logic, spatial interaction, and perspective transformation) across 50 subcategories with 8.4K manually annotated QA pairs. The strongest reasoning model, o3, achieves only 56.33% while humans reach 92.63%, revealing that complex spatial reasoning remains a fundamental bottleneck for VLMs.

Background & Motivation

Background: Spatial reasoning is a core capability of VLMs. Existing benchmarks (SpatialBot-Bench, EmbSpatial, etc.) focus on basic spatial relationships—left/right discrimination, distance estimation, and object counting. State-of-the-art reasoning models (o3, Gemini-2.5-Pro) have already surpassed 90% accuracy on these benchmarks, indicating that basic spatial understanding is approaching saturation.

Limitations of Prior Work: - Basic spatial relations (left/right/front/back/counting) ≠ complex spatial reasoning (rotation/deformation/path planning/viewpoint transformation) → existing benchmarks underestimate the true capability gap. - Existing benchmarks rely heavily on template-based automatic annotation → insufficient data diversity and challenge, with rigid question phrasing (e.g., "Is A to the left of B?"). - Lack of a systematic taxonomy grounded in cognitive psychology → task designs across benchmarks are fragmented and limited in coverage.

Key Challenge: The "high scores" VLMs achieve on existing benchmarks mask their fundamental deficiencies in complex spatial reasoning in real-world scenarios—understanding an AED location in an emergency requires not only identifying "to the right of the door," but also reading a schematic diagram, associating a map with the physical scene, and planning a route.

Goal: To construct a "non-saturable" comprehensive spatial reasoning benchmark covering the full spectrum of spatial cognitive abilities from basic to high-order.

Key Insight: Drawing from cognitive psychology theories of spatial cognition (Chabris 2006; Meneghetti 2022), the paper decomposes complex spatial reasoning into 4 complementary dimensions, using this framework to design 50 subcategories and ensure theoretical completeness.

Core Idea: Redefine the complete boundaries of spatial reasoning evaluation by leveraging cognitive psychology theories of spatial cognition.

Method

Overall Architecture: A 4-Dimension × 50-Subcategory Taxonomy

Visual-spatial reasoning is formalized as the mapping \(f:(\mathbf{I}_{1:T}, q) \longrightarrow a\), where \(\mathbf{I}_{1:T}\) is an RGB observation stream, \(q\) is the task query, and \(a\) belongs to a verifiable answer/action space. Non-visual priors are excluded to ensure that improvements can be attributed to visual reasoning.

The taxonomy rests on two pillars: (i) cognitive psychology foundations—independent faculties of spatial cognition (visualization, mental rotation, perspective transformation, spatial updating); and (ii) going beyond basic relations—tasks already saturated in existing benchmarks no longer provide discriminative power.

Key Design 1: Comprehensive Coverage Across Four Cognitive Dimensions

Dimension Subcategories Core Cognitive Ability Representative Tasks
Dynamic Reasoning 11 Inferring motion and temporal change from visual evidence Motion trajectory prediction, physical simulation, traffic scene analysis
Complex Spatial Logic 15 High-order reasoning over relations, transformations, and geometric structures 3D structural inference, mental folding/unfolding, spatial compatibility judgment
Spatial Interaction 12 Task-oriented reasoning under environmental constraints Path planning, obstacle avoidance, context-aware action selection
Perspective Transformation 12 Ability to adopt alternative viewpoints Mental rotation, mirror image understanding, multi-agent perspective coordination

Design Motivation: Each dimension corresponds to a distinct cognitive faculty—dynamic reasoning emphasizes motion inference, complex logic captures abstract transformations, spatial interaction focuses on real-time environmental engagement, and perspective transformation reflects cognitive flexibility. Together, they cover the full application spectrum from robotic manipulation to autonomous driving.

Key Design 2: Multi-Source Data and Rigorous Manual Annotation

Data are drawn from four sources: (1) web images—spanning multiple countries, scenes, and weather conditions, with specific search terms (-ai, -generated) to exclude synthetic content; (2) cognitive test items—public spatial cognition tests emphasizing pure spatial reasoning; (3) driving examination questions—license exam scenarios from at least 3 countries, with frame extraction and annotation from U.S. driving test videos; (4) existing datasets—MME (with depth information) and HOI4D (human-object interaction video frame sequences).

Annotation employs conversational natural phrasing (e.g., "If you are entering the classroom, on which side are the students?") rather than templates ("Is A to the left of B?"), with 6 annotators cross-validated at Krippendorff's \(\alpha = 0.84\), indicating high agreement. The final benchmark comprises a 1.5K test set (fully manually annotated) and a 6.9K training set.

Key Design 3: PointGraph and SpatialCoT Augmentation Strategies

PointGraph: An open-vocabulary grounding model (Florence-2) is used to localize multiple objects and extract centroids and bounding boxes, assembled into a JSON-format scene graph → providing VLMs with explicit geometric cues to assist reasoning.

SpatialCoT: Inspired by human mental imagery, InstantMesh generates 6 novel viewpoints for each input image, combined into a multi-view mosaic → fed together with the question into the VLM → chain-of-thought reasoning. This provides strong geometric priors, helping resolve occlusion and viewpoint-related reasoning ambiguities.

Key Experimental Results

Main Results: Representative Model Performance on OmniSpatial-test (%)

Model Avg. Operations Motion Analysis Traffic Localization Geography Strategy Pattern Recognition Geometric Reasoning Egocentric Allocentric
Random 24.98 - 24.86 26.30 25.88 23.43 27.27 21.44 24.77 22.55 24.84
GPT-4o 47.81 65.54 57.23 56.47 52.38 54.09 26.29 25.48 75.98 39.49 39.76
o3 56.33 71.89 66.18 61.18 68.57 65.45 40.21 29.68 77.06 48.40 48.19
Gemini-2.5-Pro 55.19 67.57 71.39 62.35 75.24 64.55 43.30 34.84 74.51 38.03 37.35
InternVL3-78B 49.33 63.78 63.12 56.24 59.24 51.45 27.63 30.19 74.51 38.46 35.90
SoFar-3B 45.14 56.49 51.16 54.12 53.14 52.73 31.75 22.88 71.60 36.56 41.69
Human 92.63 94.62 96.07 91.38 95.11 92.15 89.02 85.90 98.53 94.30 90.26

Saturation Comparison: Existing Benchmarks vs. OmniSpatial

Model SpatialBot-Bench EmbSpatial OmniSpatial
o3 >90% >90% 56.33%
Gemini-2.5-Pro >90% >90% 55.19%
Human ~95% ~95% 92.63%

Key Findings

  • The strongest reasoning model, o3 (56.33%), vs. humans (92.63%) → a gap of 36 percentage points → complex spatial reasoning is far from solved.
  • Strategy (~40%) and Pattern Recognition (~30%) are the most challenging dimensions → even o3 answers fewer than half correctly.
  • Perspective transformation (egocentric/allocentric, ~48%/~48%) proves notably difficult → VLMs lack intrinsic 3D representations and mental rotation capabilities.
  • Specialized spatial models (SpatialBot, RoboPoint) show no advantage on OmniSpatial (35–40%) → their "specialized" training sets are too simplistic.
  • PointGraph and SpatialCoT improve performance on certain dimensions but with limited gains → the root cause is a deficiency in fundamental spatial cognitive abilities.

Highlights & Insights

  • "A Warning Against Saturation": The paper clearly demonstrates that existing benchmarks have been "solved" by state-of-the-art models → the community needs harder evaluation standards. OmniSpatial elevates assessment from pattern matching to cognitive reasoning.
  • Theoretical Anchor in Cognitive Psychology: Rather than arbitrarily adding difficult questions, the taxonomy is derived from spatial cognition theory → systematicity and completeness are theoretically guaranteed.
  • Diagnostic Value of 50 Subcategories: Difficulty varies dramatically across subtasks (geometric reasoning ~75% vs. pattern recognition ~30%) → providing precise guidance for model improvement.
  • Human Performance Upper Bound at 92.63%: Even humans do not achieve 100% → some tasks (e.g., pattern recognition at 85.90%) are genuinely challenging for humans as well → demonstrating the depth of the benchmark design.

Limitations & Future Work

  • The benchmark primarily relies on static images and a small number of video frames → dynamic spatial reasoning can be further extended to continuous video.
  • All 3D reasoning tasks are still conducted on 2D images → truly interactive 3D environments (VR/simulators) are not addressed.
  • Manual annotation ensures high quality but incurs large scaling costs → semi-automatic annotation pipelines should be explored for continuous data expansion.
  • PointGraph and SpatialCoT show limited effectiveness as augmentation strategies → more fundamental improvements may require introducing 3D spatial priors at the model architecture level.
  • vs. SpatialBot-Bench/EmbSpatial: Only 6–8 basic spatial relation categories with template annotation → OmniSpatial offers 50 categories with manual annotation → a comprehensive upgrade in dimensionality and difficulty.
  • vs. VSI-Bench (Yang et al., 2024): 8 indoor scene categories with template annotation and 288 samples → OmniSpatial covers indoor/outdoor scenes across multiple countries with 6.5K images.
  • vs. RoboSpatial (Song et al., 2024): Template-based automatic annotation at the million-sample scale → large in volume but limited in diversity and difficulty.
  • Insight: Could OmniSpatial be integrated with embodied AI → enabling models to execute actions based on spatial reasoning in simulators → shifting from "answering questions" to "completing tasks"?

Rating

⭐⭐⭐⭐⭐ (5/5)

Overall assessment: The first comprehensive spatial reasoning benchmark grounded in cognitive psychology theory, featuring carefully curated manual annotations across 50 categories × 8.4K questions. The substantial gap between o3 (56%) and humans (93%) validates the benchmark's discriminative power and value—establishing a new standard for evaluating spatial cognitive abilities in VLMs.