HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models¶
Conference: CVPR 2026 arXiv: 2603.25411 Code: N/A Area: Multimodal VLM Keywords: 3D spatial understanding, vision-language models, hierarchical task design, point cloud maps, spatial reasoning
TL;DR¶
HiSpatial decomposes 3D spatial intelligence into four cognitive levels (geometric perception → object attributes → inter-object relations → abstract reasoning), constructs an automated data pipeline processing ~5M images, 45M objects, and 2B QA pairs, and designs an RGB-D VLM that takes metric-scale point cloud maps as auxiliary input. With only 3B parameters, it surpasses GPT-5 and Gemini-2.5-Pro on multiple spatial reasoning benchmarks.
Background & Motivation¶
-
Background: VLMs achieve strong performance on 2D tasks such as VQA and image captioning, but extending these capabilities to 3D spatial understanding remains highly challenging. Recent works introduce spatially-oriented VQA tasks for SFT or RFT, yet face two primary challenges.
-
Limitations of Prior Work: (a) The absence of a unified, systematic task hierarchy—existing tasks provide incomplete coverage, and the dependencies among spatial reasoning skills at different levels remain unclear; (b) Large-scale, diverse, 3D-annotated data is difficult to obtain—existing 3D-annotated datasets are confined to indoor scenes, while large-scale web data lacks 3D supervision.
-
Key Challenge: Prior work addresses isolated aspects of spatial understanding (qualitative relation comparison, quantitative distance prediction, etc.), but no work has systematically investigated the hierarchical dependencies among these tasks: does training on lower-level tasks facilitate the emergence of higher-level capabilities?
-
Goal: (a) Define a comprehensive 3D spatial understanding task taxonomy with explicit hierarchical dependencies; (b) Construct a large-scale spatial VQA dataset; (c) Empirically validate inter-level dependencies and provide guidance for training strategies.
-
Key Insight: 3D spatial intelligence is analogized to four progressive stages of human cognition: perceiving depth and geometry → understanding intrinsic 3D properties of objects → reasoning about inter-object spatial relations → performing abstract spatial reasoning (viewpoint transformation, spatial counting, spatial problem solving).
-
Core Idea: A four-level cognitive hierarchy combined with a large-scale automated data pipeline and a metric-scale point-cloud-augmented RGB-D VLM, enabling systematic construction and validation of 3D spatial intelligence in VLMs.
Method¶
Overall Architecture¶
The method comprises three components: (1) defining a four-level spatial understanding task hierarchy (L0–L3); (2) constructing an automated pipeline to generate hierarchical spatial VQA pairs from large-scale images; (3) designing a VLM architecture with point cloud map auxiliary input and performing SFT on the generated data.
Key Designs¶
-
Four-Level Cognitive Spatial Task Taxonomy
- Function: Systematically covers the full spectrum of spatial understanding capabilities, from low-level perception to high-level reasoning.
- Mechanism:
- Level 0 (Basic Geometric Perception): Pixel-level 3D point queries (outputting 3D coordinates given a 2D location) and pairwise depth ordering (judging the relative depth of two points). No semantic information required.
- Level 1 (Object-Level Spatial Understanding): Object localization (predicting 3D position), orientation estimation (describing yaw direction in natural language), and size estimation (physical dimensions such as width and height). Requires integrating geometric perception with semantic grounding.
- Level 2 (Inter-Object Relation Understanding): Relative direction estimation (qualitative, e.g., left/right/front/back, or precise 3D direction vectors), relative distance estimation (Euclidean distance and its components), and relational comparison (ranking multiple objects by attributes, judging directional consistency).
- Level 3 (Abstract Spatial Reasoning): Viewpoint transformation (inferring directions/distances of other objects from a given object's perspective), spatial object counting (counting objects satisfying spatial constraints), and spatial problem solving (multi-step reasoning that maps high-level goals to spatial attributes).
- Design Motivation: Ablation experiments clearly confirm inter-level dependencies—removing L0+L1 training data causes an average 25% drop in L2 performance (EmbSpatial drops from 80.71% to 37.53%) and an average 14.51% drop in L3 performance, demonstrating that lower-level tasks provide an irreplaceable implicit spatial knowledge foundation for higher-level reasoning.
-
Automated Spatial VQA Data Pipeline
- Function: End-to-end generation of hierarchical spatial VQA pairs from large-scale image data.
- Mechanism: A three-stage pipeline—(a) Spatial information estimation: MoGe-2 generates pixel-level 3D point cloud maps; RAM→GroundingDINO→SAM detects objects and, combined with the point cloud, obtains 3D bounding boxes and dimensions; OrientAnythingv2 estimates orientations; Perspective Fields establishes a gravity-aligned world coordinate system. (b) Textual reference generation: Describe Anything/Qwen2.5-VL/Qwen3-VL generate object descriptions, which are verified via VLM grounding (references with IoU below a threshold are discarded). (c) QA synthesis: Three formats are generated per task level (open-ended, multiple-choice, true/false); L3 spatial problem-solving questions are generated by GPT to require multi-step reasoning.
- Design Motivation: The verification step in textual reference generation eliminates ambiguity (one description matching multiple objects), ensuring QA accuracy. Three formats provide complementary learning signals. The pipeline ultimately yields a large-scale dataset of 5M images, 45M objects, and 2B QA pairs.
-
Point Cloud Map-Augmented RGB-D VLM
- Function: Enhances spatial reasoning by introducing metric-scale 3D point cloud maps as auxiliary input.
- Mechanism: Built upon PaliGemma2-3B, the model takes a point cloud map \(\mathbf{X} \in \mathbb{R}^{H \times W \times 4}\) as input (first 3 channels encode 3D coordinates; the 4th channel is a validity mask). Sinusoidal positional encodings and a learnable patchify convolutional layer produce feature maps, which are concatenated with SigLIP visual features along the feature dimension and fused via a linear projector before being fed into the language model. During training, the visual encoder is frozen; the patchify layer, fusion projector, and LLM are jointly fine-tuned.
- Design Motivation: Unlike prior methods that use relative depth maps, this work employs metric-scale point cloud maps to provide richer 3D information. Ablations show that metric point clouds outperform relative depth by 6.76% on quantitative tasks (75.26% → 82.02%), as metric-scale information directly supports precise distance and size estimation. Using GT point clouds yields further gains, indicating greater potential in settings with depth sensors (e.g., embodied AI).
Loss & Training¶
Standard VLM SFT cross-entropy loss. AdamW optimizer, learning rate \(2 \times 10^{-5}\), batch size 256, trained for 70K steps. Spatial VQA data is mixed with LLaVA-Next general VQA data at a 1:7 sampling ratio to preserve general capabilities.
Key Experimental Results¶
Main Results¶
Quantitative spatial VQA benchmarks (L1–L2 tasks):
| Model | Input | SpatialRGPT Avg | QSpatial Avg |
|---|---|---|---|
| GPT-5 | RGB | 40.47 | 68.45 |
| Gemini-2.5-Pro | RGB | 26.57 | 49.92 |
| MM-Spatial-3B | RGB-D | 68.70 | - |
| HiSpatial-3B | RGB-XYZ | 79.28 | 85.16 |
Qualitative spatial VQA benchmarks (L1–L3 tasks):
| Model | EmbSpatial | RoboSpatial | CV-Bench-3D | 3DSRBench |
|---|---|---|---|---|
| GPT-4o | 63.38 | 77.20 | 84.90 | 44.20 |
| Gemini-2.5-Pro | 76.67 | 77.24 | 90.80 | 48.47 |
| Qwen-3-VL-8B | 78.50 | 82.11 | 90.66 | 52.80 |
| HiSpatial-3B | 80.71 | 86.18 | 97.58 | 63.81 |
In-house benchmark (L1–L3):
| Model | Object Distance (L1) | Object Direction (L2) | Spatial Problem Solving (L3) |
|---|---|---|---|
| GPT-5 | 47.19% | 59.27% | 33.33% |
| HiSpatial-3B | 92.18% | 67.21% | 47.44% |
Ablation Study¶
Inter-level dependency analysis:
| L0 | L1 | L2 | L3 | L2 Tasks Avg | L3 Tasks Avg | Note |
|---|---|---|---|---|---|---|
| ✓ | ✓ | ✓ | ✓ | 81.21 | 56.29 | Full model |
| ✓ | ✓ | ✓ | 79.69 (−1.52) | 48.15 (−8.14) | w/o L2, L3 drops 8% | |
| ✓ | ✓ | 56.21 (−25.00) | 41.78 (−14.51) | w/o L0+L1, L2 drops 25% |
Effect of auxiliary 3D input:
| Input | Qualitative | Quantitative |
|---|---|---|
| RGB only | 83.70 | 74.16 |
| RGB + relative depth | 84.29 (+0.59) | 75.26 (+0.90) |
| RGB + XYZ point cloud | 84.79 | 82.02 (+6.76) |
| RGB + GT XYZ | — | 82.79 (+0.77) |
Key Findings¶
- Hierarchical dependencies are highly pronounced: Even when L2 training data far exceeds L0+L1 in volume, removing the latter causes a dramatic drop in L2 performance (EmbSpatial: 80.71% → 37.53%), confirming that low-level geometric perception provides an irreplaceable implicit knowledge foundation for higher-level reasoning.
- Impact on L3 follows a hierarchical gradient: Removing L1+L2 harms L3 more severely than removing L0+L1 (−14.51% vs. −8.14%), reflecting L3's direct dependence on L1/L2 skills.
- Metric-scale point clouds substantially outperform relative depth: The gap reaches 6.76% on quantitative tasks, as metric information directly enables precise distance and size estimation.
- Spatial SFT does not degrade general capabilities: Training on 88% spatial + 12% general data improves MMBench from 49.86% to 69.67%, indicating that spatial understanding and general VQA capabilities are mutually reinforcing.
Highlights & Insights¶
- The systematic four-level cognitive hierarchy is the paper's central contribution—it not only proposes a task set but also reveals hierarchical dependencies among tasks, providing clear guidance for future training strategies (training lower levels first is more effective).
- The large-scale automated data pipeline has strong reuse value: from MoGe-2 point cloud estimation to multi-model object detection, textual reference generation and verification, and multi-format QA synthesis, the entire workflow can be directly applied to new image datasets.
- A 3B model surpassing GPT-5 and Gemini-2.5-Pro demonstrates that strong spatial understanding can be achieved in small models through high-quality domain data and principled architectural design, without requiring massive model scale.
- The finding that metric-scale point cloud maps outperform relative depth maps points toward promising directions for downstream tasks with depth sensors, such as embodied AI.
Limitations & Future Work¶
- The pipeline relies on MoGe-2 for point cloud estimation; quality may degrade in scenes with sparse texture or heavy occlusion.
- The textual reference verification step has a limited pass rate, falling back to category labels and bounding boxes upon verification failure.
- L3 spatial problem-solving questions are generated by GPT, potentially introducing bias and insufficient diversity.
- Validation is conducted only on PaliGemma2-3B; whether the findings generalize to larger models remains unclear.
- Evaluation is limited to static single-image scenes; 3D spatial understanding in video or multi-view settings is not addressed.
Related Work & Insights¶
- vs. SpatialRGPT: SpatialRGPT addresses only L1–L2 quantitative tasks and uses relative depth. HiSpatial covers all four levels and uses metric-scale point clouds, improving quantitative accuracy from 56.22% to 79.28%.
- vs. MM-Spatial: Both use RGB-D input and a 3B model, but HiSpatial's hierarchical data is more comprehensive (2B QA pairs), improving quantitative average from 68.70% to 79.28%.
- vs. RoboRefer: RoboRefer focuses on spatial referencing in embodied settings and lacks systematic hierarchical design. HiSpatial also achieves superior performance on RoboSpatial (86.18% vs. 84.55%).
Rating¶
- Novelty: ⭐⭐⭐⭐ The four-level hierarchy is conceptually intuitive but rigorously executed; the inter-level dependency analysis constitutes a genuine novel contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Seven external benchmarks, an in-house benchmark, detailed ablations, and general capability evaluations—exceptionally comprehensive.
- Writing Quality: ⭐⭐⭐⭐ Clear structure, information-rich figures and tables, and detailed pipeline descriptions.
- Value: ⭐⭐⭐⭐⭐ The data pipeline and hierarchical framework offer high reference value to the community; the 3B model surpassing GPT-5 serves as a compelling demonstration.