OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models¶

Conference: ICLR 2026 arXiv: 2506.03135 Code: Project Page Area: Multimodal VLM / Benchmarking Keywords: Spatial Reasoning, VLM Benchmark, Cognitive Psychology, Dynamic Reasoning, Perspective Transformation

TL;DR¶

Grounded in cognitive psychology, this work introduces OmniSpatial—the first comprehensive spatial reasoning benchmark—systematically covering 4 dimensions (dynamic reasoning, complex spatial logic, spatial interaction, and perspective transformation) across 50 subcategories with 8.4K manually annotated QA pairs. The strongest reasoning model, o3, achieves only 56.33% while humans reach 92.63%, revealing that complex spatial reasoning remains a fundamental bottleneck for VLMs.

Background & Motivation¶

Background: Spatial reasoning is a core capability of VLMs. Existing benchmarks (SpatialBot-Bench, EmbSpatial, etc.) focus on basic spatial relationships—left/right discrimination, distance estimation, and object counting. State-of-the-art reasoning models (o3, Gemini-2.5-Pro) have already surpassed 90% accuracy on these benchmarks, indicating that basic spatial understanding is approaching saturation.

Limitations of Prior Work: - Basic spatial relations (left/right/front/back/counting) ≠ complex spatial reasoning (rotation/deformation/path planning/viewpoint transformation) → existing benchmarks underestimate the true capability gap. - Existing benchmarks rely heavily on template-based automatic annotation → insufficient data diversity and challenge, with rigid question phrasing (e.g., "Is A to the left of B?"). - Lack of a systematic taxonomy grounded in cognitive psychology → task designs across benchmarks are fragmented and limited in coverage.

Key Challenge: The "high scores" VLMs achieve on existing benchmarks mask their fundamental deficiencies in complex spatial reasoning in real-world scenarios—understanding an AED location in an emergency requires not only identifying "to the right of the door," but also reading a schematic diagram, associating a map with the physical scene, and planning a route.

Goal: To construct a "non-saturable" comprehensive spatial reasoning benchmark covering the full spectrum of spatial cognitive abilities from basic to high-order.

Key Insight: Drawing from cognitive psychology theories of spatial cognition (Chabris 2006; Meneghetti 2022), the paper decomposes complex spatial reasoning into 4 complementary dimensions, using this framework to design 50 subcategories and ensure theoretical completeness.

Core Idea: Redefine the complete boundaries of spatial reasoning evaluation by leveraging cognitive psychology theories of spatial cognition.

Method¶

Overall Architecture: A 4-Dimension × 50-Subcategory Taxonomy¶

Visual-spatial reasoning is formalized as the mapping \(f:(\mathbf{I}_{1:T}, q) \longrightarrow a\), where \(\mathbf{I}_{1:T}\) is an RGB observation stream, \(q\) is the task query, and \(a\) belongs to a verifiable answer/action space. Non-visual priors are excluded to ensure that improvements can be attributed to visual reasoning.

The taxonomy rests on two pillars: (i) cognitive psychology foundations—independent faculties of spatial cognition (visualization, mental rotation, perspective transformation, spatial updating); and (ii) going beyond basic relations—tasks already saturated in existing benchmarks no longer provide discriminative power.

Key Design 1: Comprehensive Coverage Across Four Cognitive Dimensions¶

Dimension	Subcategories	Core Cognitive Ability	Representative Tasks
Dynamic Reasoning	11	Inferring motion and temporal change from visual evidence	Motion trajectory prediction, physical simulation, traffic scene analysis
Complex Spatial Logic	15	High-order reasoning over relations, transformations, and geometric structures	3D structural inference, mental folding/unfolding, spatial compatibility judgment
Spatial Interaction	12	Task-oriented reasoning under environmental constraints	Path planning, obstacle avoidance, context-aware action selection
Perspective Transformation	12	Ability to adopt alternative viewpoints	Mental rotation, mirror image understanding, multi-agent perspective coordination

Design Motivation: Each dimension corresponds to a distinct cognitive faculty—dynamic reasoning emphasizes motion inference, complex logic captures abstract transformations, spatial interaction focuses on real-time environmental engagement, and perspective transformation reflects cognitive flexibility. Together, they cover the full application spectrum from robotic manipulation to autonomous driving.

Key Design 2: Multi-Source Data and Rigorous Manual Annotation¶

Data are drawn from four sources: (1) web images—spanning multiple countries, scenes, and weather conditions, with specific search terms (-ai, -generated) to exclude synthetic content; (2) cognitive test items—public spatial cognition tests emphasizing pure spatial reasoning; (3) driving examination questions—license exam scenarios from at least 3 countries, with frame extraction and annotation from U.S. driving test videos; (4) existing datasets—MME (with depth information) and HOI4D (human-object interaction video frame sequences).

Annotation employs conversational natural phrasing (e.g., "If you are entering the classroom, on which side are the students?") rather than templates ("Is A to the left of B?"), with 6 annotators cross-validated at Krippendorff's \(\alpha = 0.84\), indicating high agreement. The final benchmark comprises a 1.5K test set (fully manually annotated) and a 6.9K training set.

Key Design 3: PointGraph and SpatialCoT Augmentation Strategies¶

PointGraph: An open-vocabulary grounding model (Florence-2) is used to localize multiple objects and extract centroids and bounding boxes, assembled into a JSON-format scene graph → providing VLMs with explicit geometric cues to assist reasoning.

SpatialCoT: Inspired by human mental imagery, InstantMesh generates 6 novel viewpoints for each input image, combined into a multi-view mosaic → fed together with the question into the VLM → chain-of-thought reasoning. This provides strong geometric priors, helping resolve occlusion and viewpoint-related reasoning ambiguities.

Key Experimental Results¶

Main Results: Representative Model Performance on OmniSpatial-test (%)¶

Model	Avg.	Operations	Motion Analysis	Traffic	Localization	Geography	Strategy	Pattern Recognition	Geometric Reasoning	Egocentric	Allocentric
Random	24.98	-	24.86	26.30	25.88	23.43	27.27	21.44	24.77	22.55	24.84
GPT-4o	47.81	65.54	57.23	56.47	52.38	54.09	26.29	25.48	75.98	39.49	39.76
o3	56.33	71.89	66.18	61.18	68.57	65.45	40.21	29.68	77.06	48.40	48.19
Gemini-2.5-Pro	55.19	67.57	71.39	62.35	75.24	64.55	43.30	34.84	74.51	38.03	37.35
InternVL3-78B	49.33	63.78	63.12	56.24	59.24	51.45	27.63	30.19	74.51	38.46	35.90
SoFar-3B	45.14	56.49	51.16	54.12	53.14	52.73	31.75	22.88	71.60	36.56	41.69
Human	92.63	94.62	96.07	91.38	95.11	92.15	89.02	85.90	98.53	94.30	90.26

Saturation Comparison: Existing Benchmarks vs. OmniSpatial¶

Model	SpatialBot-Bench	EmbSpatial	OmniSpatial
o3	>90%	>90%	56.33%
Gemini-2.5-Pro	>90%	>90%	55.19%
Human	~95%	~95%	92.63%

Key Findings¶

The strongest reasoning model, o3 (56.33%), vs. humans (92.63%) → a gap of 36 percentage points → complex spatial reasoning is far from solved.
Strategy (~40%) and Pattern Recognition (~30%) are the most challenging dimensions → even o3 answers fewer than half correctly.
Perspective transformation (egocentric/allocentric, ~48%/~48%) proves notably difficult → VLMs lack intrinsic 3D representations and mental rotation capabilities.
Specialized spatial models (SpatialBot, RoboPoint) show no advantage on OmniSpatial (35–40%) → their "specialized" training sets are too simplistic.
PointGraph and SpatialCoT improve performance on certain dimensions but with limited gains → the root cause is a deficiency in fundamental spatial cognitive abilities.

Highlights & Insights¶

"A Warning Against Saturation": The paper clearly demonstrates that existing benchmarks have been "solved" by state-of-the-art models → the community needs harder evaluation standards. OmniSpatial elevates assessment from pattern matching to cognitive reasoning.
Theoretical Anchor in Cognitive Psychology: Rather than arbitrarily adding difficult questions, the taxonomy is derived from spatial cognition theory → systematicity and completeness are theoretically guaranteed.
Diagnostic Value of 50 Subcategories: Difficulty varies dramatically across subtasks (geometric reasoning ~75% vs. pattern recognition ~30%) → providing precise guidance for model improvement.
Human Performance Upper Bound at 92.63%: Even humans do not achieve 100% → some tasks (e.g., pattern recognition at 85.90%) are genuinely challenging for humans as well → demonstrating the depth of the benchmark design.

Limitations & Future Work¶

The benchmark primarily relies on static images and a small number of video frames → dynamic spatial reasoning can be further extended to continuous video.
All 3D reasoning tasks are still conducted on 2D images → truly interactive 3D environments (VR/simulators) are not addressed.
Manual annotation ensures high quality but incurs large scaling costs → semi-automatic annotation pipelines should be explored for continuous data expansion.
PointGraph and SpatialCoT show limited effectiveness as augmentation strategies → more fundamental improvements may require introducing 3D spatial priors at the model architecture level.

vs. SpatialBot-Bench/EmbSpatial: Only 6–8 basic spatial relation categories with template annotation → OmniSpatial offers 50 categories with manual annotation → a comprehensive upgrade in dimensionality and difficulty.
vs. VSI-Bench (Yang et al., 2024): 8 indoor scene categories with template annotation and 288 samples → OmniSpatial covers indoor/outdoor scenes across multiple countries with 6.5K images.
vs. RoboSpatial (Song et al., 2024): Template-based automatic annotation at the million-sample scale → large in volume but limited in diversity and difficulty.
Insight: Could OmniSpatial be integrated with embodied AI → enabling models to execute actions based on spatial reasoning in simulators → shifting from "answering questions" to "completing tasks"?

Rating¶

⭐⭐⭐⭐⭐ (5/5)

Overall assessment: The first comprehensive spatial reasoning benchmark grounded in cognitive psychology theory, featuring carefully curated manual annotations across 50 categories × 8.4K questions. The substantial gap between o3 (56%) and humans (93%) validates the benchmark's discriminative power and value—establishing a new standard for evaluating spatial cognitive abilities in VLMs.