Skip to content

HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

Conference: CVPR 2026 arXiv: 2603.25411 Code: N/A Area: Multimodal VLM Keywords: 3D spatial understanding, vision-language models, hierarchical task design, point cloud maps, spatial reasoning

TL;DR

HiSpatial decomposes 3D spatial intelligence into four cognitive levels (geometric perception → object attributes → inter-object relations → abstract reasoning), constructs an automated data pipeline processing ~5M images, 45M objects, and 2B QA pairs, and designs an RGB-D VLM that takes metric-scale point cloud maps as auxiliary input. With only 3B parameters, it surpasses GPT-5 and Gemini-2.5-Pro on multiple spatial reasoning benchmarks.

Background & Motivation

  1. Background: VLMs achieve strong performance on 2D tasks such as VQA and image captioning, but extending these capabilities to 3D spatial understanding remains highly challenging. Recent works introduce spatially-oriented VQA tasks for SFT or RFT, yet face two primary challenges.

  2. Limitations of Prior Work: (a) The absence of a unified, systematic task hierarchy—existing tasks provide incomplete coverage, and the dependencies among spatial reasoning skills at different levels remain unclear; (b) Large-scale, diverse, 3D-annotated data is difficult to obtain—existing 3D-annotated datasets are confined to indoor scenes, while large-scale web data lacks 3D supervision.

  3. Key Challenge: Prior work addresses isolated aspects of spatial understanding (qualitative relation comparison, quantitative distance prediction, etc.), but no work has systematically investigated the hierarchical dependencies among these tasks: does training on lower-level tasks facilitate the emergence of higher-level capabilities?

  4. Goal: (a) Define a comprehensive 3D spatial understanding task taxonomy with explicit hierarchical dependencies; (b) Construct a large-scale spatial VQA dataset; (c) Empirically validate inter-level dependencies and provide guidance for training strategies.

  5. Key Insight: 3D spatial intelligence is analogized to four progressive stages of human cognition: perceiving depth and geometry → understanding intrinsic 3D properties of objects → reasoning about inter-object spatial relations → performing abstract spatial reasoning (viewpoint transformation, spatial counting, spatial problem solving).

  6. Core Idea: A four-level cognitive hierarchy combined with a large-scale automated data pipeline and a metric-scale point-cloud-augmented RGB-D VLM, enabling systematic construction and validation of 3D spatial intelligence in VLMs.

Method

Overall Architecture

The method comprises three components: (1) defining a four-level spatial understanding task hierarchy (L0–L3); (2) constructing an automated pipeline to generate hierarchical spatial VQA pairs from large-scale images; (3) designing a VLM architecture with point cloud map auxiliary input and performing SFT on the generated data.

Key Designs

  1. Four-Level Cognitive Spatial Task Taxonomy

    • Function: Systematically covers the full spectrum of spatial understanding capabilities, from low-level perception to high-level reasoning.
    • Mechanism:
      • Level 0 (Basic Geometric Perception): Pixel-level 3D point queries (outputting 3D coordinates given a 2D location) and pairwise depth ordering (judging the relative depth of two points). No semantic information required.
      • Level 1 (Object-Level Spatial Understanding): Object localization (predicting 3D position), orientation estimation (describing yaw direction in natural language), and size estimation (physical dimensions such as width and height). Requires integrating geometric perception with semantic grounding.
      • Level 2 (Inter-Object Relation Understanding): Relative direction estimation (qualitative, e.g., left/right/front/back, or precise 3D direction vectors), relative distance estimation (Euclidean distance and its components), and relational comparison (ranking multiple objects by attributes, judging directional consistency).
      • Level 3 (Abstract Spatial Reasoning): Viewpoint transformation (inferring directions/distances of other objects from a given object's perspective), spatial object counting (counting objects satisfying spatial constraints), and spatial problem solving (multi-step reasoning that maps high-level goals to spatial attributes).
    • Design Motivation: Ablation experiments clearly confirm inter-level dependencies—removing L0+L1 training data causes an average 25% drop in L2 performance (EmbSpatial drops from 80.71% to 37.53%) and an average 14.51% drop in L3 performance, demonstrating that lower-level tasks provide an irreplaceable implicit spatial knowledge foundation for higher-level reasoning.
  2. Automated Spatial VQA Data Pipeline

    • Function: End-to-end generation of hierarchical spatial VQA pairs from large-scale image data.
    • Mechanism: A three-stage pipeline—(a) Spatial information estimation: MoGe-2 generates pixel-level 3D point cloud maps; RAM→GroundingDINO→SAM detects objects and, combined with the point cloud, obtains 3D bounding boxes and dimensions; OrientAnythingv2 estimates orientations; Perspective Fields establishes a gravity-aligned world coordinate system. (b) Textual reference generation: Describe Anything/Qwen2.5-VL/Qwen3-VL generate object descriptions, which are verified via VLM grounding (references with IoU below a threshold are discarded). (c) QA synthesis: Three formats are generated per task level (open-ended, multiple-choice, true/false); L3 spatial problem-solving questions are generated by GPT to require multi-step reasoning.
    • Design Motivation: The verification step in textual reference generation eliminates ambiguity (one description matching multiple objects), ensuring QA accuracy. Three formats provide complementary learning signals. The pipeline ultimately yields a large-scale dataset of 5M images, 45M objects, and 2B QA pairs.
  3. Point Cloud Map-Augmented RGB-D VLM

    • Function: Enhances spatial reasoning by introducing metric-scale 3D point cloud maps as auxiliary input.
    • Mechanism: Built upon PaliGemma2-3B, the model takes a point cloud map \(\mathbf{X} \in \mathbb{R}^{H \times W \times 4}\) as input (first 3 channels encode 3D coordinates; the 4th channel is a validity mask). Sinusoidal positional encodings and a learnable patchify convolutional layer produce feature maps, which are concatenated with SigLIP visual features along the feature dimension and fused via a linear projector before being fed into the language model. During training, the visual encoder is frozen; the patchify layer, fusion projector, and LLM are jointly fine-tuned.
    • Design Motivation: Unlike prior methods that use relative depth maps, this work employs metric-scale point cloud maps to provide richer 3D information. Ablations show that metric point clouds outperform relative depth by 6.76% on quantitative tasks (75.26% → 82.02%), as metric-scale information directly supports precise distance and size estimation. Using GT point clouds yields further gains, indicating greater potential in settings with depth sensors (e.g., embodied AI).

Loss & Training

Standard VLM SFT cross-entropy loss. AdamW optimizer, learning rate \(2 \times 10^{-5}\), batch size 256, trained for 70K steps. Spatial VQA data is mixed with LLaVA-Next general VQA data at a 1:7 sampling ratio to preserve general capabilities.

Key Experimental Results

Main Results

Quantitative spatial VQA benchmarks (L1–L2 tasks):

Model Input SpatialRGPT Avg QSpatial Avg
GPT-5 RGB 40.47 68.45
Gemini-2.5-Pro RGB 26.57 49.92
MM-Spatial-3B RGB-D 68.70 -
HiSpatial-3B RGB-XYZ 79.28 85.16

Qualitative spatial VQA benchmarks (L1–L3 tasks):

Model EmbSpatial RoboSpatial CV-Bench-3D 3DSRBench
GPT-4o 63.38 77.20 84.90 44.20
Gemini-2.5-Pro 76.67 77.24 90.80 48.47
Qwen-3-VL-8B 78.50 82.11 90.66 52.80
HiSpatial-3B 80.71 86.18 97.58 63.81

In-house benchmark (L1–L3):

Model Object Distance (L1) Object Direction (L2) Spatial Problem Solving (L3)
GPT-5 47.19% 59.27% 33.33%
HiSpatial-3B 92.18% 67.21% 47.44%

Ablation Study

Inter-level dependency analysis:

L0 L1 L2 L3 L2 Tasks Avg L3 Tasks Avg Note
81.21 56.29 Full model
79.69 (−1.52) 48.15 (−8.14) w/o L2, L3 drops 8%
56.21 (−25.00) 41.78 (−14.51) w/o L0+L1, L2 drops 25%

Effect of auxiliary 3D input:

Input Qualitative Quantitative
RGB only 83.70 74.16
RGB + relative depth 84.29 (+0.59) 75.26 (+0.90)
RGB + XYZ point cloud 84.79 82.02 (+6.76)
RGB + GT XYZ 82.79 (+0.77)

Key Findings

  • Hierarchical dependencies are highly pronounced: Even when L2 training data far exceeds L0+L1 in volume, removing the latter causes a dramatic drop in L2 performance (EmbSpatial: 80.71% → 37.53%), confirming that low-level geometric perception provides an irreplaceable implicit knowledge foundation for higher-level reasoning.
  • Impact on L3 follows a hierarchical gradient: Removing L1+L2 harms L3 more severely than removing L0+L1 (−14.51% vs. −8.14%), reflecting L3's direct dependence on L1/L2 skills.
  • Metric-scale point clouds substantially outperform relative depth: The gap reaches 6.76% on quantitative tasks, as metric information directly enables precise distance and size estimation.
  • Spatial SFT does not degrade general capabilities: Training on 88% spatial + 12% general data improves MMBench from 49.86% to 69.67%, indicating that spatial understanding and general VQA capabilities are mutually reinforcing.

Highlights & Insights

  • The systematic four-level cognitive hierarchy is the paper's central contribution—it not only proposes a task set but also reveals hierarchical dependencies among tasks, providing clear guidance for future training strategies (training lower levels first is more effective).
  • The large-scale automated data pipeline has strong reuse value: from MoGe-2 point cloud estimation to multi-model object detection, textual reference generation and verification, and multi-format QA synthesis, the entire workflow can be directly applied to new image datasets.
  • A 3B model surpassing GPT-5 and Gemini-2.5-Pro demonstrates that strong spatial understanding can be achieved in small models through high-quality domain data and principled architectural design, without requiring massive model scale.
  • The finding that metric-scale point cloud maps outperform relative depth maps points toward promising directions for downstream tasks with depth sensors, such as embodied AI.

Limitations & Future Work

  • The pipeline relies on MoGe-2 for point cloud estimation; quality may degrade in scenes with sparse texture or heavy occlusion.
  • The textual reference verification step has a limited pass rate, falling back to category labels and bounding boxes upon verification failure.
  • L3 spatial problem-solving questions are generated by GPT, potentially introducing bias and insufficient diversity.
  • Validation is conducted only on PaliGemma2-3B; whether the findings generalize to larger models remains unclear.
  • Evaluation is limited to static single-image scenes; 3D spatial understanding in video or multi-view settings is not addressed.
  • vs. SpatialRGPT: SpatialRGPT addresses only L1–L2 quantitative tasks and uses relative depth. HiSpatial covers all four levels and uses metric-scale point clouds, improving quantitative accuracy from 56.22% to 79.28%.
  • vs. MM-Spatial: Both use RGB-D input and a 3B model, but HiSpatial's hierarchical data is more comprehensive (2B QA pairs), improving quantitative average from 68.70% to 79.28%.
  • vs. RoboRefer: RoboRefer focuses on spatial referencing in embodied settings and lacks systematic hierarchical design. HiSpatial also achieves superior performance on RoboSpatial (86.18% vs. 84.55%).

Rating

  • Novelty: ⭐⭐⭐⭐ The four-level hierarchy is conceptually intuitive but rigorously executed; the inter-level dependency analysis constitutes a genuine novel contribution.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Seven external benchmarks, an in-house benchmark, detailed ablations, and general capability evaluations—exceptionally comprehensive.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure, information-rich figures and tables, and detailed pipeline descriptions.
  • Value: ⭐⭐⭐⭐⭐ The data pipeline and hierarchical framework offer high reference value to the community; the 3B model surpassing GPT-5 serves as a compelling demonstration.