Skip to content

Scaling Spatial Intelligence with Multimodal Foundation Models

Conference: CVPR 2026 arXiv: 2511.13719 Code: https://github.com/OpenSenseNova/SenseNova-SI Area: Multimodal VLM / Spatial Intelligence Keywords: Spatial intelligence, multimodal foundation models, data scaling, spatial reasoning, benchmarking

TL;DR

SenseNova-SI systematically constructs an 8M-scale diverse spatial dataset (SenseNova-SI-8M) to cultivate spatial intelligence in multimodal foundation models including Qwen3-VL, InternVL3, and Bagel, achieving unprecedented performance on multiple spatial benchmarks such as VSI-Bench and MMSI while preserving general multimodal understanding capabilities.

Background & Motivation

Background: Multimodal foundation models (e.g., GPT-4V, Gemini, Qwen-VL) excel at visual understanding and text generation, yet exhibit notable deficiencies in spatial intelligence — including depth estimation, spatial relationship reasoning, 3D scene understanding, and perspective transformation reasoning.

Limitations of Prior Work: (1) Existing multimodal models perform significantly worse on spatial reasoning tasks than on general visual question answering, indicating insufficient spatial information in internet-scale training data; (2) No systematic spatial capability taxonomy exists to guide data construction; (3) The effects of data scaling on spatial intelligence, risks of overfitting, and language shortcut issues remain underexplored.

Key Challenge: The spatial intelligence deficiency in multimodal models stems from insufficient quantity and diversity of spatially-relevant training samples, rather than architectural limitations.

Goal: To cultivate spatial intelligence in existing multimodal foundation models through large-scale data scaling, and to rigorously analyze the effects of data scale, diversity, overfitting risk, and related factors.

Key Insight: Rather than modifying model architectures, the paper systematically constructs a large-scale dataset (8M samples) covering diverse spatial capabilities and leverages a data-driven approach to enhance spatial intelligence.

Core Idea: Guided by a rigorous spatial capability taxonomy, an 8M-scale diverse spatial dataset is constructed to significantly improve spatial intelligence in existing foundation models via fine-tuning.

Method

Overall Architecture

SenseNova-SI builds upon existing multimodal foundation models and adopts a data-driven strategy to enhance spatial intelligence. The overall pipeline proceeds as follows: (1) establish a spatial capability taxonomy; (2) systematically collect, generate, and augment 8M-scale data samples (SenseNova-SI-8M) guided by this taxonomy; (3) fine-tune Qwen3-VL and InternVL3 (visual understanding models) as well as Bagel (a unified understanding-generation model); (4) evaluate on multiple spatial benchmarks and analyze the impact of various factors.

Key Designs

  1. Spatial Capability Taxonomy:

    • Function: Provides systematic guidance for data construction.
    • Mechanism: Spatial intelligence is decomposed into multiple capability dimensions, including depth perception, distance/size estimation, spatial relationship reasoning (above/below, left/right, front/back, near/far), perspective transformation reasoning, 3D shape understanding, and spatial navigation. Each dimension corresponds to a specific type of training sample.
    • Design Motivation: Spatial intelligence is a composite of multiple sub-capabilities rather than a monolithic skill. Without an explicit taxonomy, data collection tends to over-represent certain sub-capabilities while neglecting others.
  2. SenseNova-SI-8M Dataset Construction:

    • Function: Provides large-scale, diverse spatial training data.
    • Mechanism: Data is collected through multiple channels — (a) extracting and converting existing 3D/spatial datasets into multimodal QA format; (b) synthesizing spatial scenes via 3D engines and generating question–answer pairs; (c) generating spatially-grounded QA pairs from real images using large models; (d) augmenting and expanding existing data. The result is 8M samples covering all dimensions of the taxonomy.
    • Design Motivation: Data scale and diversity are key factors for spatial intelligence. Prior work was insufficient in both data volume and coverage of spatial capability dimensions.
  3. Multi-Foundation Model Adaptation:

    • Function: Validates the generalizability of the data-driven approach across different architectures.
    • Mechanism: Three representative foundation models are selected — pure visual understanding models Qwen3-VL and InternVL3, and the unified understanding-generation model Bagel. Fine-tuning with SenseNova-SI-8M on all three demonstrates that spatial intelligence gains are not model-specific.
    • Design Motivation: A data scaling approach that works only for a single model architecture has limited value. Cross-model validation demonstrates that this is a broadly applicable strategy.

Loss & Training

  • Training Paradigm: Standard instruction tuning — existing foundation models are further fine-tuned on SenseNova-SI-8M.
  • Preventing Catastrophic Forgetting: A proportion of general multimodal data is mixed into fine-tuning to prevent degradation of general capabilities while improving spatial intelligence.
  • Data Mixing Strategy: Samples from different spatial capability dimensions are mixed at specified ratios to ensure adequate training coverage across all capabilities.

Key Experimental Results

Main Results

Benchmark Metric SenseNova-SI Prev. SOTA Gain
VSI-Bench Accuracy 68.8% ~55% +13.8%
MMSI Accuracy 43.3% ~35% +8.3%
MindCube Accuracy 85.7% ~70% +15.7%
ViewSpatial Accuracy 54.7% ~45% +9.7%
SITE Accuracy 47.7% ~40% +7.7%
BLINK Accuracy 63.9% ~55% +8.9%
3DSR Accuracy 55.5% ~45% +10.5%
EmbSpatial Accuracy 72.0% ~60% +12.0%
MMBench-En Accuracy 84.9% 84.9% No change

Ablation Study

Configuration VSI-Bench MMBench-En Note
Base model (no fine-tuning) ~55% 84.9% Weak spatial, strong general
+ 1M spatial data ~60% 84.5% Limited spatial improvement
+ 4M spatial data ~65% 84.7% Continued gains with scaling
+ 8M spatial data (Full) 68.8% 84.9% Best spatial + preserved general

Key Findings

  • Data Scaling Curve: Spatial intelligence improves approximately logarithmically with data volume, indicating continued but diminishing marginal returns from additional data.
  • Emergent Generalization: After training on diverse data, the model exhibits a degree of generalization to unseen spatial task types.
  • Overfitting Risk: Signs of overfitting appear on certain spatial benchmarks, particularly when the training data distribution closely resembles the test set.
  • Language Shortcuts: Some spatial reasoning tasks contain language shortcuts (allowing correct answers without image inspection), highlighting the need for dataset debiasing.
  • Preservation of General Capabilities: With an appropriate data mixing strategy, gains in spatial intelligence do not compromise general multimodal understanding performance.

Highlights & Insights

  • Data-Driven Rather Than Architecture-Driven Paradigm: The work demonstrates that spatial intelligence can be acquired through large-scale data scaling without modifying model architectures, providing a reference approach for addressing other capability deficiencies.
  • Systematic Spatial Capability Taxonomy: The proposed taxonomy not only guides data construction but also provides the community with a framework for evaluating spatial intelligence.
  • Early Signals of Emergent Generalization: The emergent generalization arising from data diversity is a direction worthy of deeper investigation, suggesting that large-scale diverse data may be a key pathway toward general spatial intelligence.

Limitations & Future Work

  • Constructing 8M samples incurs substantial cost, and the approach partially relies on synthetic data whose realism and diversity warrant further evaluation.
  • Spatial chain-of-thought reasoning (Spatial CoT) remains preliminary; more complex multi-step spatial reasoning capabilities are limited.
  • Evaluation is conducted primarily on static images; assessment of spatial intelligence in video and interactive settings is absent.
  • For spatial tasks requiring precise numerical outputs (e.g., accurate depth estimation), the precision of the current approach still has room for improvement.
  • Diminishing marginal returns from data scaling suggest that future work may need to integrate architectural improvements or novel training paradigms.
  • vs. SpatialVLM: SpatialVLM focuses on 2D spatial relationships, whereas SenseNova-SI covers a broader range of spatial capability dimensions (including 3D) at a substantially larger data scale.
  • vs. SpatialRGPT: SpatialRGPT addresses region-level spatial reasoning; SenseNova-SI takes a more comprehensive approach to systematically improving overall spatial intelligence.
  • vs. Specialized 3D Models: Dedicated 3D vision models may outperform on specific tasks, but SenseNova-SI preserves general multimodal capabilities, representing a generalist model approach.

Rating

  • Novelty: ⭐⭐⭐ The core approach is data scaling with limited methodological innovation, though the systematic nature of the work is valuable.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on 8 spatial benchmarks plus general benchmarks, with comprehensive ablation and analysis.
  • Writing Quality: ⭐⭐⭐⭐ Report-style writing with rich content, though somewhat verbose.
  • Value: ⭐⭐⭐⭐ Open-sourced models and data make a clear contribution to the community.