HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models¶

Conference: CVPR 2026 arXiv: 2603.25411 Code: N/A Area: Multimodal VLM Keywords: 3D spatial understanding, vision-language models, hierarchical task design, point cloud maps, spatial reasoning

TL;DR¶

HiSpatial decomposes 3D spatial intelligence into four cognitive levels (geometric perception → object attributes → inter-object relations → abstract reasoning), constructs an automated data pipeline processing ~5M images, 45M objects, and 2B QA pairs, and designs an RGB-D VLM that takes metric-scale point cloud maps as auxiliary input. With only 3B parameters, it surpasses GPT-5 and Gemini-2.5-Pro on multiple spatial reasoning benchmarks.

Background & Motivation¶

Background: VLMs achieve strong performance on 2D tasks such as VQA and image captioning, but extending these capabilities to 3D spatial understanding remains highly challenging. Recent works introduce spatially-oriented VQA tasks for SFT or RFT, yet face two primary challenges.
Limitations of Prior Work: (a) The absence of a unified, systematic task hierarchy—existing tasks provide incomplete coverage, and the dependencies among spatial reasoning skills at different levels remain unclear; (b) Large-scale, diverse, 3D-annotated data is difficult to obtain—existing 3D-annotated datasets are confined to indoor scenes, while large-scale web data lacks 3D supervision.
Key Challenge: Prior work addresses isolated aspects of spatial understanding (qualitative relation comparison, quantitative distance prediction, etc.), but no work has systematically investigated the hierarchical dependencies among these tasks: does training on lower-level tasks facilitate the emergence of higher-level capabilities?
Goal: (a) Define a comprehensive 3D spatial understanding task taxonomy with explicit hierarchical dependencies; (b) Construct a large-scale spatial VQA dataset; (c) Empirically validate inter-level dependencies and provide guidance for training strategies.
Key Insight: 3D spatial intelligence is analogized to four progressive stages of human cognition: perceiving depth and geometry → understanding intrinsic 3D properties of objects → reasoning about inter-object spatial relations → performing abstract spatial reasoning (viewpoint transformation, spatial counting, spatial problem solving).
Core Idea: A four-level cognitive hierarchy combined with a large-scale automated data pipeline and a metric-scale point-cloud-augmented RGB-D VLM, enabling systematic construction and validation of 3D spatial intelligence in VLMs.

Method¶

Overall Architecture¶

The method comprises three components: (1) defining a four-level spatial understanding task hierarchy (L0–L3); (2) constructing an automated pipeline to generate hierarchical spatial VQA pairs from large-scale images; (3) designing a VLM architecture with point cloud map auxiliary input and performing SFT on the generated data.

Key Designs¶

Four-Level Cognitive Spatial Task Taxonomy
- Function: Systematically covers the full spectrum of spatial understanding capabilities, from low-level perception to high-level reasoning.
- Mechanism:
  - Level 0 (Basic Geometric Perception): Pixel-level 3D point queries (outputting 3D coordinates given a 2D location) and pairwise depth ordering (judging the relative depth of two points). No semantic information required.
  - Level 1 (Object-Level Spatial Understanding): Object localization (predicting 3D position), orientation estimation (describing yaw direction in natural language), and size estimation (physical dimensions such as width and height). Requires integrating geometric perception with semantic grounding.
  - Level 2 (Inter-Object Relation Understanding): Relative direction estimation (qualitative, e.g., left/right/front/back, or precise 3D direction vectors), relative distance estimation (Euclidean distance and its components), and relational comparison (ranking multiple objects by attributes, judging directional consistency).
  - Level 3 (Abstract Spatial Reasoning): Viewpoint transformation (inferring directions/distances of other objects from a given object's perspective), spatial object counting (counting objects satisfying spatial constraints), and spatial problem solving (multi-step reasoning that maps high-level goals to spatial attributes).
- Design Motivation: Ablation experiments clearly confirm inter-level dependencies—removing L0+L1 training data causes an average 25% drop in L2 performance (EmbSpatial drops from 80.71% to 37.53%) and an average 14.51% drop in L3 performance, demonstrating that lower-level tasks provide an irreplaceable implicit spatial knowledge foundation for higher-level reasoning.
Automated Spatial VQA Data Pipeline
- Function: End-to-end generation of hierarchical spatial VQA pairs from large-scale image data.
- Mechanism: A three-stage pipeline—(a) Spatial information estimation: MoGe-2 generates pixel-level 3D point cloud maps; RAM→GroundingDINO→SAM detects objects and, combined with the point cloud, obtains 3D bounding boxes and dimensions; OrientAnythingv2 estimates orientations; Perspective Fields establishes a gravity-aligned world coordinate system. (b) Textual reference generation: Describe Anything/Qwen2.5-VL/Qwen3-VL generate object descriptions, which are verified via VLM grounding (references with IoU below a threshold are discarded). (c) QA synthesis: Three formats are generated per task level (open-ended, multiple-choice, true/false); L3 spatial problem-solving questions are generated by GPT to require multi-step reasoning.
- Design Motivation: The verification step in textual reference generation eliminates ambiguity (one description matching multiple objects), ensuring QA accuracy. Three formats provide complementary learning signals. The pipeline ultimately yields a large-scale dataset of 5M images, 45M objects, and 2B QA pairs.
Point Cloud Map-Augmented RGB-D VLM
- Function: Enhances spatial reasoning by introducing metric-scale 3D point cloud maps as auxiliary input.
- Mechanism: Built upon PaliGemma2-3B, the model takes a point cloud map \(\mathbf{X} \in \mathbb{R}^{H \times W \times 4}\) as input (first 3 channels encode 3D coordinates; the 4th channel is a validity mask). Sinusoidal positional encodings and a learnable patchify convolutional layer produce feature maps, which are concatenated with SigLIP visual features along the feature dimension and fused via a linear projector before being fed into the language model. During training, the visual encoder is frozen; the patchify layer, fusion projector, and LLM are jointly fine-tuned.
- Design Motivation: Unlike prior methods that use relative depth maps, this work employs metric-scale point cloud maps to provide richer 3D information. Ablations show that metric point clouds outperform relative depth by 6.76% on quantitative tasks (75.26% → 82.02%), as metric-scale information directly supports precise distance and size estimation. Using GT point clouds yields further gains, indicating greater potential in settings with depth sensors (e.g., embodied AI).

Loss & Training¶

Standard VLM SFT cross-entropy loss. AdamW optimizer, learning rate \(2 \times 10^{-5}\), batch size 256, trained for 70K steps. Spatial VQA data is mixed with LLaVA-Next general VQA data at a 1:7 sampling ratio to preserve general capabilities.

Key Experimental Results¶

Main Results¶

Quantitative spatial VQA benchmarks (L1–L2 tasks):

Model	Input	SpatialRGPT Avg	QSpatial Avg
GPT-5	RGB	40.47	68.45
Gemini-2.5-Pro	RGB	26.57	49.92
MM-Spatial-3B	RGB-D	68.70	-
HiSpatial-3B	RGB-XYZ	79.28	85.16

Qualitative spatial VQA benchmarks (L1–L3 tasks):

Model	EmbSpatial	RoboSpatial	CV-Bench-3D	3DSRBench
GPT-4o	63.38	77.20	84.90	44.20
Gemini-2.5-Pro	76.67	77.24	90.80	48.47
Qwen-3-VL-8B	78.50	82.11	90.66	52.80
HiSpatial-3B	80.71	86.18	97.58	63.81

In-house benchmark (L1–L3):

Model	Object Distance (L1)	Object Direction (L2)	Spatial Problem Solving (L3)
GPT-5	47.19%	59.27%	33.33%
HiSpatial-3B	92.18%	67.21%	47.44%

Ablation Study¶

Inter-level dependency analysis:

L0	L1	L2	L3	L2 Tasks Avg	L3 Tasks Avg	Note
✓	✓	✓	✓	81.21	56.29	Full model
✓	✓		✓	79.69 (−1.52)	48.15 (−8.14)	w/o L2, L3 drops 8%
✓		✓		56.21 (−25.00)	41.78 (−14.51)	w/o L0+L1, L2 drops 25%

Effect of auxiliary 3D input:

Input	Qualitative	Quantitative
RGB only	83.70	74.16
RGB + relative depth	84.29 (+0.59)	75.26 (+0.90)
RGB + XYZ point cloud	84.79	82.02 (+6.76)
RGB + GT XYZ	—	82.79 (+0.77)

Key Findings¶

Hierarchical dependencies are highly pronounced: Even when L2 training data far exceeds L0+L1 in volume, removing the latter causes a dramatic drop in L2 performance (EmbSpatial: 80.71% → 37.53%), confirming that low-level geometric perception provides an irreplaceable implicit knowledge foundation for higher-level reasoning.
Impact on L3 follows a hierarchical gradient: Removing L1+L2 harms L3 more severely than removing L0+L1 (−14.51% vs. −8.14%), reflecting L3's direct dependence on L1/L2 skills.
Metric-scale point clouds substantially outperform relative depth: The gap reaches 6.76% on quantitative tasks, as metric information directly enables precise distance and size estimation.
Spatial SFT does not degrade general capabilities: Training on 88% spatial + 12% general data improves MMBench from 49.86% to 69.67%, indicating that spatial understanding and general VQA capabilities are mutually reinforcing.

Highlights & Insights¶

The systematic four-level cognitive hierarchy is the paper's central contribution—it not only proposes a task set but also reveals hierarchical dependencies among tasks, providing clear guidance for future training strategies (training lower levels first is more effective).
The large-scale automated data pipeline has strong reuse value: from MoGe-2 point cloud estimation to multi-model object detection, textual reference generation and verification, and multi-format QA synthesis, the entire workflow can be directly applied to new image datasets.
A 3B model surpassing GPT-5 and Gemini-2.5-Pro demonstrates that strong spatial understanding can be achieved in small models through high-quality domain data and principled architectural design, without requiring massive model scale.
The finding that metric-scale point cloud maps outperform relative depth maps points toward promising directions for downstream tasks with depth sensors, such as embodied AI.

Limitations & Future Work¶

The pipeline relies on MoGe-2 for point cloud estimation; quality may degrade in scenes with sparse texture or heavy occlusion.
The textual reference verification step has a limited pass rate, falling back to category labels and bounding boxes upon verification failure.
L3 spatial problem-solving questions are generated by GPT, potentially introducing bias and insufficient diversity.
Validation is conducted only on PaliGemma2-3B; whether the findings generalize to larger models remains unclear.
Evaluation is limited to static single-image scenes; 3D spatial understanding in video or multi-view settings is not addressed.

vs. SpatialRGPT: SpatialRGPT addresses only L1–L2 quantitative tasks and uses relative depth. HiSpatial covers all four levels and uses metric-scale point clouds, improving quantitative accuracy from 56.22% to 79.28%.
vs. MM-Spatial: Both use RGB-D input and a 3B model, but HiSpatial's hierarchical data is more comprehensive (2B QA pairs), improving quantitative average from 68.70% to 79.28%.
vs. RoboRefer: RoboRefer focuses on spatial referencing in embodied settings and lacks systematic hierarchical design. HiSpatial also achieves superior performance on RoboSpatial (86.18% vs. 84.55%).

Rating¶

Novelty: ⭐⭐⭐⭐ The four-level hierarchy is conceptually intuitive but rigorously executed; the inter-level dependency analysis constitutes a genuine novel contribution.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Seven external benchmarks, an in-house benchmark, detailed ablations, and general capability evaluations—exceptionally comprehensive.
Writing Quality: ⭐⭐⭐⭐ Clear structure, information-rich figures and tables, and detailed pipeline descriptions.
Value: ⭐⭐⭐⭐⭐ The data pipeline and hierarchical framework offer high reference value to the community; the 3B model surpassing GPT-5 serves as a compelling demonstration.