SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation¶

Conference: AAAI 2026 arXiv: 2511.09555 Authors: Hao Shi, Bin Xie, Yingfei Liu, Yang Yue, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, Gao Huang Code: GitHub Area: Robotic Manipulation / Spatial Representation Learning Keywords: Semantic-geometric disentanglement, depth estimation prior, spatial Transformer, robust manipulation, RLBench

TL;DR¶

This paper proposes SpatialActor, a framework that explicitly disentangles semantic and geometric representations. It introduces a Semantic-Guided Geometry Module (SGM) that adaptively fuses noisy depth features with a pretrained depth estimation expert prior, and a Spatial Transformer (SPT) that encodes low-level spatial position cues. SpatialActor achieves 87.4% success rate on 50+ RLBench tasks (SOTA +6.0%) and outperforms RVT-2 by 19.4% under heavy-noise conditions.

Background & Motivation¶

3D spatial understanding is critical for robotic manipulation: Real-world manipulation tasks occur in three-dimensional space, requiring precise spatial reasoning, occlusion handling, and fine-grained object interaction. Pure 2D visual methods are insufficient for these scenarios.

Sparsity issues in point cloud methods: Point cloud-based methods (PolarNet, PerAct, etc.) can explicitly represent 3D geometry, but sparse sampling leads to loss of fine-grained semantic information, and the high cost of 3D annotation limits pretraining scale.

Semantic-geometric entanglement in image-based methods: RGB-D image methods (RVT, RVT-2, etc.) feed RGB and depth into a shared feature space, where entanglement of semantics and geometry makes the model highly sensitive to depth noise — RVT-2's success rate drops by 8.9% under mild noise.

Inevitable real-world depth noise: Sensor noise, illumination variation, and surface reflectance ensure that depth measurements in the real world always contain noise, severely limiting the practical deployment of existing methods.

Lack of low-level spatial cue modeling: Existing joint modeling approaches primarily retain high-level geometric information while neglecting low-level spatial cues — such as precise 2D-3D correspondences — that are critical for accurate interaction.

Core Problem: How to construct robust spatial representations that simultaneously support fine-grained spatial understanding, robustness to sensor noise, and low-level spatial cues?

Method¶

Overall Architecture¶

SpatialActor takes as input \(V\)-view RGB images \(I^v \in \mathbb{R}^{H \times W \times 3}\), depth maps \(D^v \in \mathbb{R}^{H \times W}\), proprioceptive state \(P\), and a language instruction \(L\). The framework adopts a three-branch disentangled design:

Semantic branch: RGB + language instruction → CLIP vision-language encoder → semantic features \(F_{\text{sem}}^v\) and text features \(F_{\text{text}}\)
Geometric branch: Raw depth \(D^v\) → depth encoder (ResNet-50) → noisy geometric features \(F_{\text{geo}}^v\), enhanced via SGM to yield fused geometric features \(F_{\text{fuse-geo}}^v\)
Spatial branch: Semantic and fused geometric features are concatenated into \(H^v\), then processed by SPT for spatial position encoding and multi-layer interaction

An action head finally predicts the 3D end-effector pose \(A = (x, y, z, \theta_x, \theta_y, \theta_z, g)\).

Semantic-Guided Geometry Module (SGM)¶

The core idea of SGM is to fuse two complementary geometric representations:

Source	Characteristic	Advantage	Disadvantage
Pretrained depth estimation expert (Depth Anything v2)	Infers geometry from RGB	Robust, noise-resistant	Coarse-grained
Raw depth encoder (ResNet-50)	Extracts geometry from \(D^v\)	Fine-grained, pixel-level	Sensitive to noise

The fusion mechanism employs multi-scale gating:

\[G^v = \sigma\bigl(\text{MLP}(\text{Concat}(\hat{F}_{\text{geo}}^v, F_{\text{geo}}^v))\bigr)\]

\[F_{\text{fuse-geo}}^v = G^v \odot F_{\text{geo}}^v + (1 - G^v) \odot \hat{F}_{\text{geo}}^v\]

The gate \(G^v\) is adaptively learned: it preserves raw fine-grained details in reliable depth regions and falls back to the expert prior in highly noisy regions, achieving a balance between granularity and robustness.

Spatial Transformer (SPT)¶

SPT assigns precise 3D positional information to each spatial token:

3D coordinate computation: Camera intrinsics \(K^v\) and extrinsics \(E^v\) are used to back-project pixel \((x', y')\) and corresponding depth \(d\) into 3D coordinates \([x,y,z]^\top\) in the robot frame.
Rotary Position Encoding (RoPE): \(D/3\) dimensions are allocated to each axis of the 3D coordinates, generating sinusoidal/cosine position embeddings so that tokens at different spatial positions carry unique spatial indices.
View-level interaction: Self-attention + FFN refines token representations within each view.
Scene-level interaction: Tokens from all views are concatenated with language features \(F_{\text{text}}\), and cross-view, cross-modal information is fused via self-attention + FFN.

Action Prediction and Supervision¶

A decoder (ConvexUp) generates per-view 2D heatmaps; argmax retrieves the target 2D location, which is then lifted to 3D via the camera model.
An MLP regresses rotation angles \((\theta_x, \theta_y, \theta_z)\) and gripper state \(g\).
Loss functions: cross-entropy on 2D heatmaps (translation) + cross-entropy on discretized Euler angles (rotation) + binary classification loss (gripper).

Key Experimental Results¶

Experiment 1: RLBench 18-Task Performance Comparison¶

Setup: RLBench benchmark, 18 tasks with 249 variations, 4 fixed RGB-D cameras (128×128 resolution), 100 expert demonstrations per task for training and 25 unseen episodes for testing. Trained on 8 GPUs for ~40k iterations with batch size 192.

Method	Avg. Success Rate ↑	Avg Rank ↓	Insert Peg	Sort Shape	Drag Stick
PerAct	49.4%	7.1	5.6%	16.8%	70.4%
RVT	62.9%	5.3	11.2%	36.0%	88.0%
3D Diffuser Actor	81.3%	2.8	65.6%	44.0%	96.8%
RVT-2	81.4%	2.8	40.0%	35.0%	99.0%
SpatialActor	87.4%	2.3	93.3%	73.3%	98.7%

Key Findings: SpatialActor achieves an average success rate of 87.4%, surpassing RVT-2 by 6.0%. On tasks requiring high spatial precision — Insert Peg and Sort Shape — it outperforms RVT-2 by 53.3% and 38.3%, respectively, demonstrating the advantage of disentangled spatial representations for fine-grained manipulation.

Experiment 2: Noise Robustness Evaluation¶

Setup: Gaussian noise is injected into the reconstructed point cloud at three levels — Light (20% points, std=0.05), Middle (50% points, std=0.1), and Heavy (80% points, std=0.1).

Method	Light ↑	Middle ↑	Heavy ↑
RVT-2	72.5%	68.4%	57.0%
SpatialActor	86.4% (+13.9%)	85.3% (+16.9%)	76.4% (+19.4%)

Key Findings: As noise severity increases, SpatialActor's advantage grows (13.9% → 16.9% → 19.4%). On the Insert Peg task under heavy noise, it surpasses RVT-2 by 61.3%, validating the noise resistance of the SGM gating fusion mechanism.

Additional Results¶

Few-shot generalization (19 new tasks, 10 demonstrations each): SpatialActor 79.2% vs. RVT-2 46.9% (+32.3%), indicating that disentangled representations significantly improve transfer capability.
ColosseumBench spatial perturbation (20 tasks): SpatialActor achieves best results under object size, container size, and camera pose perturbations (baseline 57.4%; camera perturbation 54.2%).
Ablation study: Disentanglement alone +3.7% (85.1%); +SGM +1.3% (86.4% / Heavy noise 73.9%); +SPT +1.0% (87.4% / Heavy noise 76.4%).
Real-world experiment: WidowX arm + RealSense D435i, 8 tasks with 15 variations; SpatialActor 63% vs. RVT-2 43% (+20%).

Highlights & Insights¶

Explicit semantic-geometric disentanglement: This breaks the paradigm of sharing a feature space for semantics and geometry in image-based methods, preventing depth noise from interfering with semantic understanding and improving robustness at the root level.
Complementary geometric fusion: SGM cleverly combines the complementary strengths of a pretrained depth estimation expert (noise-resistant but coarse) and raw depth features (fine-grained but noisy); the gating mechanism adaptively modulates their relative trust.
Low-level spatial cue modeling: SPT encodes true 3D coordinates into the Transformer via RoPE, establishing precise 2D-3D correspondences so that interactions among spatial tokens carry geometric meaning.
Overwhelming advantage under noisy conditions: 76.4% success rate under heavy noise (vs. 57.0%), and 79.2% in few-shot settings (vs. 46.9%), demonstrating substantial practical deployment value.

Limitations & Future Work¶

Increased computational overhead: Incorporating the frozen Depth Anything v2 expert model increases inference-time computation and memory cost, which may become a bottleneck on resource-constrained robotic platforms.
Dependence on pretrained depth estimation quality: The robustness of SGM relies on the generalization ability of the depth estimation expert; when the scene distribution deviates significantly from the pretraining data, the quality of the expert prior may degrade.
Single-arm tabletop scenarios: Experiments are primarily validated on tabletop platforms (Franka / WidowX) and do not cover more complex settings such as bimanual coordination, dexterous hands, or mobile manipulation.
Fixed viewpoint assumption: Simulation uses 4 fixed RGB-D cameras, and the real-world setup uses only 1 static camera; applicability to dynamic viewpoints or eye-in-hand configurations has not been verified.

Point cloud methods: PolarNet (Chen 2023), PerAct (Shridhar 2023) — explicit 3D structure but sparse.
Image-based methods: RVT (Goyal 2023), RVT-2 (Goyal 2024), SAM-E (Zhang 2024) — dense semantics but with semantic-geometric entanglement.
3D diffusion policy: 3D Diffuser Actor (Ke 2024) — 3D scene representation + diffusion policy.
Visual foundation models: CLIP (Radford 2021) for semantic priors; Depth Anything v2 (Yang 2025) for geometric priors.
Voxel-based methods: C2F-ARM-BC (James 2022) — coarse-to-fine voxelization with high computational cost.

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐
Practicality	⭐⭐⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Recommended Reading	⭐⭐⭐⭐

Overall Recommendation: ⭐⭐⭐⭐ Rationale: This paper proposes a spatially disentangled representation framework with clear engineering intuition and strong empirical results in robotic manipulation. The design of semantic-geometric disentanglement combined with complementary geometric fusion demonstrates a pronounced advantage under noisy conditions. Comprehensive evaluation across 50+ tasks and real-robot experiments substantially enhance credibility. The core idea — disentanglement + complementary fusion — also serves as a valuable reference for other tasks requiring robust multimodal fusion.