SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models¶

Conference: CVPR 2026
arXiv: 2602.20901
Code: https://github.com/xieyc99/SpatiaLQA
Area: Multimodal VLM
Keywords: Spatial logical reasoning, VLM benchmark, scene graph, indoor scene understanding, multi-step reasoning

TL;DR¶

This paper proposes SpatiaLQA, a benchmark comprising 9,605 QA pairs across 241 real-world indoor scenes, systematically evaluates 41 VLMs on spatial logical reasoning, and introduces a recursive scene graph-assisted reasoning method to enhance VLMs' spatial logical reasoning capabilities.

Background & Motivation¶

Background: VLMs have achieved strong performance on general VQA and logical reasoning tasks, yet remain inadequate in complex real-world scenarios that require the integration of spatial understanding and multi-step logical reasoning.

Limitations of Prior Work: Existing benchmarks focus either on spatial understanding (e.g., SpatialRGPT-Bench) or logical reasoning (e.g., MathVista) in isolation, lacking an evaluation framework that integrates both. Furthermore, EQA tasks target action execution rather than purely visual-semantic reasoning.

Key Challenge: Spatial logical reasoning demands that a model simultaneously possess precise spatial perception and rigorous multi-step causal reasoning; the fusion of these two capabilities has not been systematically studied in existing VLMs.

Goal: (a) Construct a comprehensive spatial logical reasoning benchmark; (b) systematically evaluate existing VLMs on this task; (c) propose improvement methods.

Key Insight: Decompose complex scenes into task-relevant scene graphs, enabling VLMs to focus on the spatial context surrounding target objects.

Core Idea: A recursive scene graph construction method progressively decomposes complex indoor scenes into task-relevant spatial relation graphs, thereby enhancing VLMs' multi-step spatial reasoning capabilities.

Method¶

Overall Architecture¶

Given an indoor scene image and a question requiring multi-step spatial reasoning, the method produces a sequence of logically coherent operation steps. The pipeline consists of three stages: (1) obtaining depth maps and segmentation maps via visual foundation models; (2) recursively constructing a scene graph centered on the target object; (3) feeding the scene graph together with the question into a VLM to generate the final answer.

Key Designs¶

SpatiaLQA Benchmark Construction:
- Function: Constructs 9,605 QA pairs derived from 241 real-world indoor scenes.
- Mechanism: Three-stage data collection — 2,401 pairs annotated manually, 2,251 pairs obtained via subgraph extraction augmentation, and 4,953 pairs generated via graph expansion augmentation.
- Design Motivation: Directly constructing large-scale spatial logical reasoning data is prohibitively costly; subgraph extraction based on logical dependencies and graph expansion enable efficient augmentation.
Evaluation Metric Design:
- Function: Step-level matching based on GPT-4o and the Hungarian algorithm.
- Mechanism: GPT-4o first generates a matching matrix between predicted steps and annotated steps; the Hungarian algorithm then determines the optimal one-to-one assignment; precision and recall are finally computed for both content and preconditions.
- Design Motivation: Open-ended multi-step answers cannot be evaluated with conventional accuracy metrics, necessitating step-level semantic matching.
Recursive Scene Graph-Assisted Reasoning (RSGAR):
- Function: Leverages Depth Anything V2 and SAM to obtain depth and segmentation information, then recursively constructs a scene graph centered on the target object.
- Mechanism: The task-specified object serves as the initial source node; the VLM identifies objects in direct contact with it along with their spatial relations, forming scene graph nodes and edges; the process iterates until a maximum number of iterations is reached.
- Design Motivation: Directly processing complex scenes often causes VLMs to overlook critical spatial relations; progressive decomposition directs the model's attention to local spatial contexts.

Loss & Training¶

RSGAR is an inference-time method that requires no additional training, augmenting reasoning directly using pretrained VLMs and visual foundation models.

Key Experimental Results¶

Main Results¶

Model	\(F_c\) (Content F1)	\(F_p\) (Precondition F1)
Human	97.6	92.5
GPT-4o	52.5	19.2
Claude 3.5 Sonnet	46.3	15.8
Gemini 2.0 Flash	44.1	14.7
GPT-4o + RSGAR	56.8	22.4
InternVL2-26B	38.2	12.1

Ablation Study¶

Configuration	\(F_c\)	\(F_p\)	Note
GPT-4o (baseline)	52.5	19.2	No scene graph assistance
+ Depth map	53.8	20.1	Depth information only
+ Segmentation map	54.2	20.5	Segmentation information only
+ RSGAR (1 round)	55.1	21.3	Single-round scene graph
+ RSGAR (3 rounds)	56.8	22.4	3 recursive rounds, best performance

Key Findings¶

Even the strongest model, GPT-4o, achieves only approximately \(F_c = 52.5\%\) on spatial logical reasoning, a substantial gap from the human score of 97.6%.
All VLMs perform considerably worse on precondition reasoning \(F_p\), indicating that understanding inter-step dependencies is the core challenge.
Model performance degrades sharply as the number of answer steps increases, confirming that multi-step reasoning is the primary bottleneck.
RSGAR yields consistent improvements across multiple VLMs, validating the effectiveness of scene graph decomposition.

Highlights & Insights¶

Evaluation Framework Design: Combining GPT-4o for semantic matching with the Hungarian algorithm for optimal alignment elegantly addresses the challenge of evaluating open-ended multi-step responses. This two-stage evaluation paradigm is transferable to other multi-step reasoning tasks.
Recursive Scene Graph Decomposition: Transforming end-to-end complex spatial reasoning into progressively focused sub-problem solving is analogous to a spatial counterpart of chain-of-thought reasoning, and cleverly exploits the complementary strengths of visual foundation models.
Data Augmentation Strategy: Subgraph extraction and graph expansion efficiently generate large volumes of training data from limited annotations while preserving logical consistency.

Limitations & Future Work¶

The benchmark covers only indoor scenes; complex outdoor environments (e.g., traffic scenarios, construction sites) are not addressed.
RSGAR depends on external visual models (SAM, Depth Anything), introducing additional computational overhead and error propagation.
The maximum number of iterations in scene graph construction is fixed, lacking an adaptive termination mechanism.
Incorporating spatial logical reasoning capabilities into VLM training has not been explored; the proposed method enhances reasoning only at inference time.

vs. SpatialRGPT-Bench: SpatialRGPT focuses solely on spatial understanding without involving multi-step logical reasoning; SpatiaLQA extends this by incorporating step-level dependency relations.
vs. EmbodiedBench: EmbodiedBench targets embodied execution with a predefined action primitive output space; SpatiaLQA focuses on open-vocabulary reasoning processes.

Rating¶

Novelty: ⭐⭐⭐⭐ Proposes a new task definition and large-scale benchmark, filling the gap in spatial logical reasoning evaluation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluates 41 VLMs covering mainstream models with comprehensive analysis.
Writing Quality: ⭐⭐⭐⭐ Well-structured, though some descriptions are verbose.
Value: ⭐⭐⭐⭐ The benchmark resource is of significant value to the community; the proposed method has considerable room for improvement.