SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?¶

Conference: ICLR 2026
arXiv: 2602.03916
Code: spatialab-reasoning.github.io
Area: Multimodal VLM
Keywords: Spatial Reasoning, VLM Benchmark, MCQ Evaluation, Open-ended Evaluation, Real-world Scenarios

TL;DR¶

This paper introduces SpatiaLab, a real-world spatial reasoning benchmark comprising 1,400 visual QA pairs spanning 30 subcategories across 6 major spatial task categories. Supporting both MCQ and open-ended evaluation formats, SpatiaLab reveals a substantial gap between the strongest current VLMs (InternVL3.5-72B: 54.93% MCQ) and humans (87.57%), with the gap widening further under open-ended settings.

Background & Motivation¶

Background: Spatial reasoning is a foundational cognitive ability for humans and is critical to robotics, autonomous driving, and AR/VR. While VLMs have made progress in multimodal representation and language grounding, spatial judgment in real-world environments remains fragile.

Limitations of Prior Work: - Existing spatial reasoning benchmarks are overly simplified: most focus on binary spatial relations, coarse depth categorization, or synthetic/puzzle-style scenes. - Controlled environments reduce perceptual and reasoning difficulty, causing apparent saturation that masks failures under distribution shift. - Critical challenges such as occlusion reasoning, cross-view scale consistency, and path planning under partial observability are severely undersampled. - Models that perform well on synthetic benchmarks such as ScanQA and BLINK frequently fail in real-world settings.

Key Challenge: Humans seamlessly integrate multidimensional spatial information—relative position, depth, orientation, scale, navigation, and 3D geometry—whereas VLMs fall far short of human performance on any single dimension, let alone joint multi-dimensional reasoning.

Goal: - Construct a real-world benchmark covering all core axes of spatial reasoning. - Employ dual-format evaluation (MCQ and open-ended) to avoid format bias. - Evaluate 25+ VLMs and establish a human baseline. - Conduct in-depth failure analysis and provide actionable directions for improvement.

Key Insight: Drawing from cognitive psychology's taxonomy of spatial cognition, the paper systematically decomposes spatial reasoning into \(6 \times 5 = 30\) fine-grained task types, constructing the benchmark from real photographs rather than synthetic data.

Core Idea: SpatiaLab employs dual-format evaluation across 30 real-world spatial reasoning tasks to systematically expose fundamental deficiencies of VLMs in depth perception, occlusion reasoning, navigation planning, and 3D geometry.

Method¶

Overall Architecture¶

SpatiaLab = Benchmark Dataset + Evaluation Protocol + Improvement Strategy Exploration

6 major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry.
Each major category contains 5 subcategories → 30 task types in total.
Each subcategory contains ≥25 questions, each major category ≥200 questions → 1,400 verified QA pairs in total.
Dual format: MCQ (4-choice) + open-ended generation.

Key Designs¶

Multi-source Image Collection
- Function: Construct a visually diverse real-world image repository.
- Mechanism: Three complementary sources—automated web crawling, targeted online retrieval, and manual indoor/outdoor photography. Systematic coverage along 6 meta-dimensions: illumination, texture complexity, edge complexity, spatial relations, material type, and gravity constraints.
- Design Motivation: Ensure the benchmark reflects real-world visual noise and complexity rather than controlled laboratory conditions.
- Complexity statistics: average 21.48 objects per image, 11.88 partially visible, 3.23 depth layers, and 2.07 spatial reasoning steps per chain.
Three-stage Annotation and Quality Control
- Function: Ensure semantic validity, answer correctness, and task clarity of all QA pairs.
- Mechanism: Phase 1 annotator training → Phase 2 paired spatial QA generation per image → Phase 3 dual-format encoding. Three rounds of review: semantic validation → independent verification → gold-standard establishment.
- Design Motivation: Error rates in spatial reasoning QA are high under complex scenes; three rounds of review ensure the reliability of the final 1,400 questions.
Improvement Strategy Exploration
- Function: Systematically test multiple approaches for enhancing VLM spatial reasoning.
- Methods covered: intrinsic reasoning, CoT prompting, CoT + self-reflection, SFT fine-tuning (40% data / 60% evaluation), and a multi-agent system (SpatioXolver).
- Design Motivation: Beyond exposing problems, the paper provides actionable improvement directions. SFT yields the best results on navigation and orientation; multi-agent reasoning helps on orientation but stagnates or degrades on other categories.

Loss & Training¶

(This is a benchmark paper; no training loss is defined. SFT experiments fine-tune Qwen-VL2.5-3B-Instruct with a standard cross-entropy loss.)

Key Experimental Results¶

Main Results (MCQ Format, 25+ Models)¶

Model	3D Geometry	Depth & Occlusion	Orientation	Relative Positioning	Size & Scale	Navigation	Overall
Human Baseline	93.70	74.13	91.58	91.51	88.89	87.76	87.57
InternVL3.5-72B	50.00	57.14	53.47	66.04	49.21	54.85	54.93
GPT-5-mini	48.74	54.83	60.40	62.74	44.84	56.54	54.29
o4-mini-medium	51.26	58.30	54.95	64.15	40.87	51.48	53.21
Spatial-specialized Models	~42	~38	~48	~38	~43	~39	~41
Random Baseline	25.00	25.00	25.00	25.00	25.00	25.00	25.00

Open-ended Format Comparison¶

Model	MCQ Overall	Open-ended Overall	Performance Drop
GPT-5-mini	54.29	40.93	−13.36
o4-mini-medium	53.21	37.86	−15.35
InternVL3.5-72B	54.93	23.36	−31.57
Human Baseline	87.57	64.93	−22.64
Avg. MCQ→Open Gap	—	—	−23.0%

Key Findings¶

Top models achieve only 55% (MCQ) / 41% (open-ended): A large gap remains relative to human performance of 88% / 65%. Spatial-specialized models perform even worse (~41%), indicating that current specialization approaches are ineffective.
Open-ended evaluation exposes true capability: The average MCQ-to-open-ended drop is 23%; spatial-specialized models drop the most (~27%), suggesting that MCQ scores can overestimate true spatial reasoning ability.
Three most challenging categories: Size & Scale, Depth & Occlusion, and Spatial Navigation consistently emerge as bottlenecks, with most models scoring below 50% (or 30% in open-ended settings).
Model scale ≠ spatial reasoning ability: Llama-3.2-11B achieves only 30.5%, worse than many 4B models, indicating that spatial reasoning requires specialized capabilities beyond raw parameter count.
Limited gains from reasoning augmentation: CoT helps on the Orientation category; SFT improves Navigation (+7.69%); however, multi-agent systems degrade performance on Occlusion and Size & Scale.
Systematic failure patterns: Tasks involving object rotation (2%), reflective surfaces (<20%), and tool handedness (<30%) result in near-total failure.

Highlights & Insights¶

Well-designed real-world dual-format evaluation: The 30-task taxonomy across 1,400 questions represents the most fine-grained categorization in spatial reasoning research; the MCQ + open-ended dual format addresses format bias, a critical issue overlooked by prior benchmarks.
Counterintuitive finding—spatial-specialized models underperform general models: SpaceOm, SpaceThinker, and SpaceQwen all lag behind InternVL3.5-72B on real-world scenes, demonstrating that spatial capabilities acquired from synthetic training data do not transfer to real-world settings.
Diagnostic value of error analysis: Cluster analysis reveals that failures concentrate in three patterns: spatial mislocalization, perspective/scale errors, and occlusion ordering failures—directly attributable to the lack of geometric supervision in VLM training.
Necessity of open-ended evaluation: The average MCQ-to-open-ended drop of 23% is largest on Navigation (which requires the most multi-step reasoning), indicating that current VLMs rely on elimination strategies rather than genuine spatial understanding.

Limitations & Future Work¶

Although high in quality, the 1,400 questions are limited in quantity; as few as 25+ questions per subcategory may be insufficient for stable evaluation.
Open-ended evaluation relies on an LLM judge (Gemini-2.5-Flash); while Cohen's kappa = 0.738, the judging process itself remains imperfect.
Temporal spatial reasoning in video settings is not covered.
Directions for improvement: Developing spatial reasoning pre-training data based on physics engines, or incorporating explicit geometric encoding modules into VLMs to address spatial reasoning deficiencies.

vs. BLINK-Spatial (2024): 14 task types / 3.8K questions but mixes synthetic and real data; best performance 59%. SpatiaLab focuses on 30 real-world task types, offering finer granularity and greater difficulty.
vs. OmniSpatial (2025): 50 categories but only 1.5K questions in a puzzle setting; best performance 56%. SpatiaLab emphasizes real-world scenes over puzzle-style settings.
vs. VSI-Bench (2025): An indoor video benchmark with 8 categories; best performance 45%. SpatiaLab covers a broader range of scene types and image modalities.

Rating¶

Novelty: ⭐⭐⭐⭐ The 30-category taxonomy and dual-format evaluation design are novel, though the core methodology (benchmark construction) is not an entirely new paradigm.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Highly comprehensive: 25+ models, human baselines, improvement strategy exploration, and error analysis.
Writing Quality: ⭐⭐⭐⭐ Well-structured with in-depth analysis, though somewhat lengthy.
Value: ⭐⭐⭐⭐⭐ Fills a critical gap in real-world spatial reasoning evaluation, quantifies the VLM–human gap, and provides important guidance for the VLM community.