MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs¶

Conference: ICCV 2025 arXiv: 2503.13111 Code: GitHub Area: Multimodal VLM Keywords: 3D spatial understanding, multimodal LLM, depth estimation, multi-view, spatial reasoning

TL;DR¶

Apple proposes the CA-VQA dataset and MM-Spatial model, leveraging high-quality 3D scene data and open-set annotations to generate training/evaluation data covering spatial relation prediction, metric estimation, and 3D grounding. The resulting general-purpose MLLM achieves SOTA on 3D spatial understanding benchmarks while remaining competitive on other tasks.

Background & Motivation¶

MLLMs excel at 2D visual understanding, yet their 3D spatial reasoning capabilities remain limited:

Critical gaps in 3D perception: Existing MLLMs struggle with: (1) relative depth judgments ("in front of" vs. "behind"); (2) metric distance/size estimation ("A is 2.74m away"); (3) precise 3D bounding boxes.

Limitations of Prior Work: Existing 3D spatial datasets suffer from various constraints — they cover only a subset of tasks, lack high-quality 3D ground truth, omit depth maps, do not support multi-view inputs, and fail to provide both training and evaluation splits simultaneously.

Underexplored depth and multi-view inputs: Few works systematically evaluate the impact of different depth map types (sensor vs. monocular estimation vs. GT) and multi-view inputs on 3D spatial understanding.

The authors aim to systematically advance 3D spatial understanding research in MLLMs through a comprehensive dataset.

Method¶

Overall Architecture¶

Data generation pipeline: Built upon the Cubify Anything 1M (CA-1M) dataset (containing 7-DOF 3D bounding boxes per object in ARKitScenes with open-set semantic annotations), automatically generating templated QA pairs.
CA-VQA dataset: ~10M QA pairs covering 220K frames and 2K videos for training; ~62K QA pairs over 2.6K frames for evaluation.
MM-Spatial model: Based on the MM1.5-3B architecture (DFN-CLIP visual encoder + decoder-only LLM), acquiring spatial understanding capabilities via SFT.

Key Designs¶

CA-VQA Dataset and Benchmark

Covers six spatial task categories: - Counting: "How many chairs are in the scene?" - Viewpoint-dependent relations: "Is X behind Y?" (depends on camera pose) - Metric regression: "What is the distance from X to Y/the camera?" "How wide/tall is X?" - 2D/3D Referring & Grounding - Binary & multiple-choice questions

Distinguishing features: - 3D ground truth from high-precision FARO laser scanners (not pseudo-labels) - Three depth map types per frame: GT depth, ARKit depth (iPad LiDAR), monocular depth (DepthPro) - Multi-view support: up to 4 support frames per reference frame, with relative poses and camera intrinsics - Blind filtering strategy: 7 MLLMs serve as judges to remove samples answerable without visual input, reducing language prior bias

Depth Utilization: CoT / Tool-Use

Two strategies for leveraging metric depth (without directly encoding depth maps): - Tool-Use: The model predicts 2D bounding boxes and issues a function call; the tool returns the median depth within the box as text, upon which the model reasons. - CoT (Chain-of-Thought): During training, step-by-step reasoning examples with GT depth are provided; at test time, the model predicts depth values on its own.

Design Motivation: Encoding full-image depth yields only normalized relative depth, whereas CoT/Tool-Use can exploit absolute metric depth. Moreover, CoT requires no external tools — the model learns to predict depth accurately through SFT.

Multi-View Input

Leveraging MM1.5's multi-image input capability, support frames and the reference frame are concatenated as a sequence \(I_{t-N}, ..., I_{t-1}, I_t\), with per-frame camera intrinsics and relative poses provided in JSON format. Image splitting is applied only to the reference frame.

Design Motivation: Multi-view inputs provide additional geometric constraints that help resolve depth ambiguity inherent to single-view observations.

Loss & Training¶

Follows MM1.5's three-stage training: pretraining → continual pretraining → SFT.
In the SFT stage, a Spatial category (CA-VQA data) is added to MM1.5's base data mixture (General VQA, Knowledge, Text-rich, 2D Ref./Grounding).
Mixing ratios are tuned to ensure spatial task gains do not degrade other capabilities.
Image resolution: 672×672, with 4 sub-images + 1 global image.
Both the visual encoder and LLM are unfrozen.

Key Experimental Results¶

Main Results¶

CA-VQA Benchmark results (average scores per task):

Model	Binary	Count.	2D AP@50	3D AP@15	Multi-c.	Ego-Dist.	Obj-Size	Avg.
GPT-4o	44.2	69.0	0.0	0.0	36.6	11.7	11.0	22.8
SpatialRGPT-8B	53.6	68.8	5.5	0.0	37.2	10.5	7.0	23.9
MM1.5-3B	59.1	9.1	32.6	0.0	38.6	0.6	3.4	18.2
MM-Spatial-3B	68.8	75.8	53.2	20.7	74.2	40.0	24.4	47.0
+CoT	69.6	75.9	54.5	21.9	74.7	46.0	26.7	49.1
+Multi-view+CoT	69.2	76.1	55.0	23.6	75.3	46.1	28.2	49.7
+Multi-view+Tool(GT)	69.2	76.1	55.0	23.6	75.3	65.8	27.3	52.4

MM-Spatial-3B (only 3B parameters) substantially outperforms GPT-4o and SpatialRGPT-8B on all tasks.

Cross-benchmark category results:

Model	Spatial	General	Knowledge	Text-rich	Ref./Ground	Avg.
MM1.5-3B	39.9	64.7	46.2	62.1	77.7	58.1
MM-Spatial-3B	70.1	65.0	46.2	62.1	79.1	64.5

Spatial capability improves substantially (+30.2), while other categories remain unchanged or improve marginally.

Ablation Study¶

Specialist model configuration comparison (CA-VQA):

Configuration	Ego-Dist. @10%	Obj-Dist. @10%	Obj-Size @10%	Avg.
MM-Spatial	47.3	24.4	24.3	49.4
+CoT (self-predicted depth)	49.5	27.9	26.7	50.8
+Depth (Tool; Mon.)	42.1	26.1	26.1	49.5
+Depth (Tool; GT)	74.0	32.4	27.4	54.5
+Depth (Encoded; GT)	48.3	25.4	24.5	49.9
+Multi-view	52.4	26.2	26.1	51.4
+Multi-view+CoT	55.2	29.7	28.6	52.7

The CoT model's self-predicted depth nearly matches the performance of GT depth tool-use, demonstrating that the model successfully learns monocular depth estimation.

Key Findings¶

Data-driven depth estimation: Through SFT data alone, MM-Spatial achieves performance approaching that of dedicated monocular depth estimation models — a surprising finding.
Multi-view consistency is beneficial: Multi-view inputs provide consistent gains across all configurations, particularly in 3D grounding (AP@15: 24.2→27.5) and distance estimation.
Encoded depth underperforms CoT: Encoding depth maps through the visual encoder (yielding only relative depth) is inferior to CoT's textualized absolute depth.
Monocular depth < GT depth: Tool-use with DepthPro monocular depth underperforms GT depth, indicating that depth accuracy is a ceiling factor.
Blind filtering is effective: Removing samples answerable by blind models yields a more challenging and reliable benchmark.

Highlights & Insights¶

Unmatched comprehensiveness: CA-VQA is the first spatial understanding dataset to simultaneously provide high-quality 3D ground truth, three depth map types, multi-view inputs, diverse task categories, and train/evaluation splits.
General capability preserved: MM-Spatial at only 3B parameters substantially surpasses GPT-4o on spatial tasks while maintaining performance on general, knowledge, and text-rich tasks.
Implications of CoT depth estimation: Models can learn depth perception from data without explicit depth sensors — a significant insight for edge device deployment.
Blind filtering strategy is broadly applicable: The 7-model joint filtering approach can be generalized to other benchmark construction efforts to reduce language prior bias.

Limitations & Future Work¶

The dataset is confined to indoor scenes (ARKitScenes); generalization to outdoor environments remains unverified.
Only the 3B model is explored; the effects of larger-scale models (7B, 70B) are unknown.
The absolute 3D grounding AP@15 remains low (maximum 27.5), leaving substantial room for improvement.
Object-to-object distance estimation accuracy under the 10% relative error threshold is only ~30%, which is insufficient for practical applications.
Whether the multi-view frame selection strategy (angle ≥15° or translation ≥30cm) is optimal has not been thoroughly ablated.

This work represents a comprehensive upgrade over prior efforts including SpatialRGPT, Cube-LLM, and SpatialBot. The key distinctions lie in data quality (precise 3D annotations vs. pseudo-labels) and input signal diversity (multi-view + three depth types). The success of CoT depth estimation suggests the broader possibility of "learning visual perception through language."

Rating¶

Novelty: ⭐⭐⭐⭐ The dataset construction pipeline and CoT depth estimation are highlights, though the model architecture itself introduces no new design.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive: multiple model variants, multiple benchmarks, ablation analysis, and blind filtering validation.
Writing Quality: ⭐⭐⭐⭐⭐ Well-organized, with rich tables and figures; the dataset comparison table is particularly clear.
Value: ⭐⭐⭐⭐⭐ The dataset and benchmark will drive subsequent research on 3D spatial understanding in MLLMs; code is open-sourced.