CVPR 2026 Multimodal VLM 3D LVLM positional encoding spherical coordinates RoPE SpatialLM spatial reasoning

SoPE: Spherical Coordinate-Based Positional Embedding for 3D LVLMs¶

Conference: CVPR 2026 arXiv: 2602.22716 Code: None Area: 3D Vision / Multimodal VLM / Positional Encoding Keywords: 3D LVLM, positional encoding, spherical coordinates, RoPE, SpatialLM, spatial reasoning

TL;DR¶

This paper identifies spatial perception bias in RoPE when applied to 3D LVLMs (1D indexing disrupts 3D locality and ignores directionality), and proposes SoPE, a spherical coordinate-based positional embedding using a four-dimensional index \((t, r, \theta, \phi)\) with multi-dimensional frequency allocation and multi-scale mixing. SoPE achieves state-of-the-art performance on 3D layout estimation and object detection benchmarks built upon SpatialLM.

Background & Motivation¶

Background: 3D LVLMs encode point clouds and process them jointly with an LLM for 3D scene understanding. Mainstream approaches inherit RoPE from LLMs, flattening point cloud tokens into a 1D sequence via raster-scan ordering.

Limitations of Prior Work: Information flow visualization reveals severe spatial perception bias — cross-modal attention concentrates on a few hotspot tokens, the majority of 3D tokens receive approximately uniform weights, and small objects along with structural boundaries are systematically suppressed. Two root causes are identified: (i) 1D raster indexing destroys the 3D spatial continuity of point clouds, causing spatially adjacent tokens to receive non-adjacent positional indices; (ii) the relative distance \(\Delta t = t_1 - t_2\) captures only sequential order, with no sensitivity to spatial position or directional change.

Key Challenge: RoPE is designed for 1D text and, when naively applied to 3D point clouds, inherently neglects spatial structure and directional information. Existing 2D/video extensions (VideoRoPE, M-RoPE) target image grids and are unsuitable for irregular point clouds.

Key Insight: Spherical coordinates \((r, \theta, \phi)\) naturally decouple distance from direction. Mapping 3D tokens into spherical space enables simultaneous encoding of position and orientation.

Core Idea: Replace the 1D raster index with spherical coordinates \((t, r, \theta, \phi)\), and allocate RoPE frequency bands functionally across different coordinate components.

Method¶

Overall Architecture¶

SpatialLM baseline → extract \((x, y, z)\) coordinates of point cloud tokens → convert to spherical coordinates \((r, \theta, \phi)\) while retaining temporal index \(t\) → allocate 128-dimensional RoPE frequency bands at ratio \(t:r:\theta:\phi = 24:2:3:3\) → apply multi-scale frequency mixing to each component → replace original RoPE → end-to-end training.

Key Designs¶

Spherical Coordinate Positional Projection
- Function: Remaps 3D tokens from 1D raster indices to geometrically-aware four-dimensional positions \((t, r, \theta, \phi)\).
- Mechanism: \(r = \sqrt{x^2+y^2+z^2}\), \(\theta = \arccos(z/r)\), \(\phi = \text{atan2}(y, x)\). The relative displacement is decomposed into four components \(\Delta t, \Delta r, \Delta\theta, \Delta\phi\), naturally encoding both spatial position change and directional angular change.
- Design Motivation: Cartesian 3D coordinates (RoPE-3D) encode position but cannot distinguish angular relationships; spherical decomposition renders radial distance and angular direction orthogonal, making directional information explicit.
Multi-Dimensional Frequency Allocation
- Function: Distributes 128-dimensional RoPE frequency bands across four coordinate components at ratio \(t:r:\theta:\phi = 24:2:3:3\).
- Mechanism: Spherical components \((r, \theta, \phi)\) are mapped to high-frequency sub-bands (capturing fine-grained spatial and angular variation), while temporal index \(t\) is mapped to low-frequency sub-bands (preserving long-range temporal coherence). The rotation matrix is block-diagonalized, with each component encoded independently and combined additively.
- Design Motivation: The value range of \(t\) greatly exceeds that of the angular components, necessitating more low-frequency bands for temporal smoothness; angular variations are typically small and fine-grained, requiring high-frequency bands for discrimination. The ratio is determined via large-scale ablation experiments (Uniform, Angular-Biased, Temporal-Biased).
Multi-Scale Frequency Mixing
- Function: Fuses linear, logarithmic, and periodic transformations at the RoPE phase level for each coordinate component.
- Mechanism: \(\varphi_k(u) = \frac{1}{3}(\omega_k^{lin}g^{lin}(u) + \omega_k^{log}g^{log}(u) + \omega_k^{per}g^{per}(u))\). The linear term preserves absolute precision, the logarithmic term emphasizes local neighborhoods, and the periodic term captures global structure. Equal-weight mixing introduces no additional learnable parameters.
- Design Motivation: Single-scale encoding struggles to simultaneously capture fine-grained geometry and large-scale layout; multi-scale fusion endows the model with discriminative power across different spatial ranges.

Loss & Training¶

Training follows the SpatialLM setup. Architecture: Sonata encoder + Qwen2.5-0.5B LLM + 2-layer MLP. Single-stage training on 4 × NVIDIA H20 GPUs. SoPE serves as a drop-in replacement for RoPE with no additional inference overhead.

Key Experimental Results¶

Main Results¶

Method	ARKitScenes F1@0.25	F1@0.50	SpatialLM Dataset F1@0.25	F1@0.50
SpatialLM (RoPE)	63.9	60.7	69.7	62.0
+ CCA	64.1	60.5	69.8	62.5
+ RoPE-3D	64.2	61.4	69.7	62.4
SpatialSoPE	66.1	63.2	71.4	63.4

Method	Structured3D IoU2D@0.25	IoU2D@0.50
RoomFormer	70.4	67.2
SceneScript	83.1	80.8
SpatialLM (ft.)	86.5	84.6
SpatialSoPE (ft.)	88.7	86.2

Ablation Study¶

Configuration	ARKit F1@0.25	F1@0.50	Note
Ratio 24:2:3:3 (optimal)	66.1	63.2	Ours
Ratio 8:6:9:9 (Angular-Biased)	65.5	62.7	Over-allocation to spherical
Ratio 1:1:1:1 (Uniform)	63.0	59.0	−3 points
Ratio 5:1:1:1 (Temporal-Biased)	65.0	62.7	Temporal-dominant
SoPE w/o multi-scale mixing	65.4	61.4	Multi-scale contributes +1.8
RoPE-3D + multi-scale	64.8	62.1	Spherical > Cartesian

Key Findings¶

Multi-scale mixing yields larger gains for SoPE (+0.7/+1.8) than for RoPE-3D — spherical coordinates are a prerequisite for fully benefiting from multi-scale mixing.
Spherical > Cartesian > 2D projection; directional/angular encoding is the key differentiating factor.
Information flow visualization confirms that SoPE produces more balanced cross-modal attention, eliminating the hotspot concentration observed with RoPE.

Highlights & Insights¶

Spherical coordinates naturally decouple distance from orientation — geometrically more appropriate than Cartesian coordinates for 3D positional encoding. The idea is direct and effective, yet previously unexplored.
A simple modification (coordinate transformation + frequency reallocation) yields substantial gains (ARKitScenes +2.2/+2.5), demonstrating that positional encoding is indeed a critical bottleneck in 3D LVLMs.
Information flow visualization as a diagnostic tool merits broader adoption — identifying which tokens are under-attended before designing targeted encoding improvements.

Limitations & Future Work¶

Validation is limited to a 0.5B small model; effectiveness on larger models (7B+) remains to be confirmed.
The choice of spherical origin (scene geometric center vs. camera position) is not thoroughly investigated, which may affect encoding quality.
The frequency allocation ratio is determined manually; adaptive or learnable schemes may yield further improvement.
Evaluation is restricted to indoor 3D scenes; outdoor and large-scale settings such as autonomous driving remain untested.

vs. RoPE-3D: Cartesian coordinate encoding improves spatial awareness but lacks directional information; SoPE's spherical decomposition encodes both simultaneously.
vs. VideoRoPE/M-RoPE: These methods perform spatiotemporal decomposition for 2D images/video and are not applicable to the irregular structure of 3D point clouds.
vs. DRoPE: The polar-coordinate directional extension targets task-specific properties such as heading periodicity; SoPE's spherical formulation is more general-purpose.

Rating¶

Novelty: ⭐⭐⭐⭐ First application of spherical coordinate PE in 3D LVLMs
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-benchmark full ablation + real-device deployment latency testing
Writing Quality: ⭐⭐⭐⭐ Thorough motivation analysis with outstanding information flow visualization
Value: ⭐⭐⭐⭐ Drop-in replacement for RoPE with high cross-domain reference value