SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models¶
Conference: CVPR 2026
arXiv: 2602.20901
Code: https://github.com/xieyc99/SpatiaLQA
Area: Multimodal VLM
Keywords: Spatial logical reasoning, VLM benchmark, scene graph, indoor scene understanding, multi-step reasoning
TL;DR¶
This paper proposes SpatiaLQA, a benchmark comprising 9,605 QA pairs across 241 real-world indoor scenes, systematically evaluates 41 VLMs on spatial logical reasoning, and introduces a recursive scene graph-assisted reasoning method to enhance VLMs' spatial logical reasoning capabilities.
Background & Motivation¶
Background: VLMs have achieved strong performance on general VQA and logical reasoning tasks, yet remain inadequate in complex real-world scenarios that require the integration of spatial understanding and multi-step logical reasoning.
Limitations of Prior Work: Existing benchmarks focus either on spatial understanding (e.g., SpatialRGPT-Bench) or logical reasoning (e.g., MathVista) in isolation, lacking an evaluation framework that integrates both. Furthermore, EQA tasks target action execution rather than purely visual-semantic reasoning.
Key Challenge: Spatial logical reasoning demands that a model simultaneously possess precise spatial perception and rigorous multi-step causal reasoning; the fusion of these two capabilities has not been systematically studied in existing VLMs.
Goal: (a) Construct a comprehensive spatial logical reasoning benchmark; (b) systematically evaluate existing VLMs on this task; (c) propose improvement methods.
Key Insight: Decompose complex scenes into task-relevant scene graphs, enabling VLMs to focus on the spatial context surrounding target objects.
Core Idea: A recursive scene graph construction method progressively decomposes complex indoor scenes into task-relevant spatial relation graphs, thereby enhancing VLMs' multi-step spatial reasoning capabilities.
Method¶
Overall Architecture¶
Given an indoor scene image and a question requiring multi-step spatial reasoning, the method produces a sequence of logically coherent operation steps. The pipeline consists of three stages: (1) obtaining depth maps and segmentation maps via visual foundation models; (2) recursively constructing a scene graph centered on the target object; (3) feeding the scene graph together with the question into a VLM to generate the final answer.
Key Designs¶
-
SpatiaLQA Benchmark Construction:
- Function: Constructs 9,605 QA pairs derived from 241 real-world indoor scenes.
- Mechanism: Three-stage data collection — 2,401 pairs annotated manually, 2,251 pairs obtained via subgraph extraction augmentation, and 4,953 pairs generated via graph expansion augmentation.
- Design Motivation: Directly constructing large-scale spatial logical reasoning data is prohibitively costly; subgraph extraction based on logical dependencies and graph expansion enable efficient augmentation.
-
Evaluation Metric Design:
- Function: Step-level matching based on GPT-4o and the Hungarian algorithm.
- Mechanism: GPT-4o first generates a matching matrix between predicted steps and annotated steps; the Hungarian algorithm then determines the optimal one-to-one assignment; precision and recall are finally computed for both content and preconditions.
- Design Motivation: Open-ended multi-step answers cannot be evaluated with conventional accuracy metrics, necessitating step-level semantic matching.
-
Recursive Scene Graph-Assisted Reasoning (RSGAR):
- Function: Leverages Depth Anything V2 and SAM to obtain depth and segmentation information, then recursively constructs a scene graph centered on the target object.
- Mechanism: The task-specified object serves as the initial source node; the VLM identifies objects in direct contact with it along with their spatial relations, forming scene graph nodes and edges; the process iterates until a maximum number of iterations is reached.
- Design Motivation: Directly processing complex scenes often causes VLMs to overlook critical spatial relations; progressive decomposition directs the model's attention to local spatial contexts.
Loss & Training¶
RSGAR is an inference-time method that requires no additional training, augmenting reasoning directly using pretrained VLMs and visual foundation models.
Key Experimental Results¶
Main Results¶
| Model | \(F_c\) (Content F1) | \(F_p\) (Precondition F1) |
|---|---|---|
| Human | 97.6 | 92.5 |
| GPT-4o | 52.5 | 19.2 |
| Claude 3.5 Sonnet | 46.3 | 15.8 |
| Gemini 2.0 Flash | 44.1 | 14.7 |
| GPT-4o + RSGAR | 56.8 | 22.4 |
| InternVL2-26B | 38.2 | 12.1 |
Ablation Study¶
| Configuration | \(F_c\) | \(F_p\) | Note |
|---|---|---|---|
| GPT-4o (baseline) | 52.5 | 19.2 | No scene graph assistance |
| + Depth map | 53.8 | 20.1 | Depth information only |
| + Segmentation map | 54.2 | 20.5 | Segmentation information only |
| + RSGAR (1 round) | 55.1 | 21.3 | Single-round scene graph |
| + RSGAR (3 rounds) | 56.8 | 22.4 | 3 recursive rounds, best performance |
Key Findings¶
- Even the strongest model, GPT-4o, achieves only approximately \(F_c = 52.5\%\) on spatial logical reasoning, a substantial gap from the human score of 97.6%.
- All VLMs perform considerably worse on precondition reasoning \(F_p\), indicating that understanding inter-step dependencies is the core challenge.
- Model performance degrades sharply as the number of answer steps increases, confirming that multi-step reasoning is the primary bottleneck.
- RSGAR yields consistent improvements across multiple VLMs, validating the effectiveness of scene graph decomposition.
Highlights & Insights¶
- Evaluation Framework Design: Combining GPT-4o for semantic matching with the Hungarian algorithm for optimal alignment elegantly addresses the challenge of evaluating open-ended multi-step responses. This two-stage evaluation paradigm is transferable to other multi-step reasoning tasks.
- Recursive Scene Graph Decomposition: Transforming end-to-end complex spatial reasoning into progressively focused sub-problem solving is analogous to a spatial counterpart of chain-of-thought reasoning, and cleverly exploits the complementary strengths of visual foundation models.
- Data Augmentation Strategy: Subgraph extraction and graph expansion efficiently generate large volumes of training data from limited annotations while preserving logical consistency.
Limitations & Future Work¶
- The benchmark covers only indoor scenes; complex outdoor environments (e.g., traffic scenarios, construction sites) are not addressed.
- RSGAR depends on external visual models (SAM, Depth Anything), introducing additional computational overhead and error propagation.
- The maximum number of iterations in scene graph construction is fixed, lacking an adaptive termination mechanism.
- Incorporating spatial logical reasoning capabilities into VLM training has not been explored; the proposed method enhances reasoning only at inference time.
Related Work & Insights¶
- vs. SpatialRGPT-Bench: SpatialRGPT focuses solely on spatial understanding without involving multi-step logical reasoning; SpatiaLQA extends this by incorporating step-level dependency relations.
- vs. EmbodiedBench: EmbodiedBench targets embodied execution with a predefined action primitive output space; SpatiaLQA focuses on open-vocabulary reasoning processes.
Rating¶
- Novelty: ⭐⭐⭐⭐ Proposes a new task definition and large-scale benchmark, filling the gap in spatial logical reasoning evaluation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluates 41 VLMs covering mainstream models with comprehensive analysis.
- Writing Quality: ⭐⭐⭐⭐ Well-structured, though some descriptions are verbose.
- Value: ⭐⭐⭐⭐ The benchmark resource is of significant value to the community; the proposed method has considerable room for improvement.