SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation¶
Conference: AAAI 2026 arXiv: 2511.09555 Authors: Hao Shi, Bin Xie, Yingfei Liu, Yang Yue, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, Gao Huang Code: GitHub Area: Robotic Manipulation / Spatial Representation Learning Keywords: Semantic-geometric disentanglement, depth estimation prior, spatial Transformer, robust manipulation, RLBench
TL;DR¶
This paper proposes SpatialActor, a framework that explicitly disentangles semantic and geometric representations. It introduces a Semantic-Guided Geometry Module (SGM) that adaptively fuses noisy depth features with a pretrained depth estimation expert prior, and a Spatial Transformer (SPT) that encodes low-level spatial position cues. SpatialActor achieves 87.4% success rate on 50+ RLBench tasks (SOTA +6.0%) and outperforms RVT-2 by 19.4% under heavy-noise conditions.
Background & Motivation¶
3D spatial understanding is critical for robotic manipulation: Real-world manipulation tasks occur in three-dimensional space, requiring precise spatial reasoning, occlusion handling, and fine-grained object interaction. Pure 2D visual methods are insufficient for these scenarios.
Sparsity issues in point cloud methods: Point cloud-based methods (PolarNet, PerAct, etc.) can explicitly represent 3D geometry, but sparse sampling leads to loss of fine-grained semantic information, and the high cost of 3D annotation limits pretraining scale.
Semantic-geometric entanglement in image-based methods: RGB-D image methods (RVT, RVT-2, etc.) feed RGB and depth into a shared feature space, where entanglement of semantics and geometry makes the model highly sensitive to depth noise — RVT-2's success rate drops by 8.9% under mild noise.
Inevitable real-world depth noise: Sensor noise, illumination variation, and surface reflectance ensure that depth measurements in the real world always contain noise, severely limiting the practical deployment of existing methods.
Lack of low-level spatial cue modeling: Existing joint modeling approaches primarily retain high-level geometric information while neglecting low-level spatial cues — such as precise 2D-3D correspondences — that are critical for accurate interaction.
Core Problem: How to construct robust spatial representations that simultaneously support fine-grained spatial understanding, robustness to sensor noise, and low-level spatial cues?
Method¶
Overall Architecture¶
SpatialActor takes as input \(V\)-view RGB images \(I^v \in \mathbb{R}^{H \times W \times 3}\), depth maps \(D^v \in \mathbb{R}^{H \times W}\), proprioceptive state \(P\), and a language instruction \(L\). The framework adopts a three-branch disentangled design:
- Semantic branch: RGB + language instruction → CLIP vision-language encoder → semantic features \(F_{\text{sem}}^v\) and text features \(F_{\text{text}}\)
- Geometric branch: Raw depth \(D^v\) → depth encoder (ResNet-50) → noisy geometric features \(F_{\text{geo}}^v\), enhanced via SGM to yield fused geometric features \(F_{\text{fuse-geo}}^v\)
- Spatial branch: Semantic and fused geometric features are concatenated into \(H^v\), then processed by SPT for spatial position encoding and multi-layer interaction
An action head finally predicts the 3D end-effector pose \(A = (x, y, z, \theta_x, \theta_y, \theta_z, g)\).
Semantic-Guided Geometry Module (SGM)¶
The core idea of SGM is to fuse two complementary geometric representations:
| Source | Characteristic | Advantage | Disadvantage |
|---|---|---|---|
| Pretrained depth estimation expert (Depth Anything v2) | Infers geometry from RGB | Robust, noise-resistant | Coarse-grained |
| Raw depth encoder (ResNet-50) | Extracts geometry from \(D^v\) | Fine-grained, pixel-level | Sensitive to noise |
The fusion mechanism employs multi-scale gating:
The gate \(G^v\) is adaptively learned: it preserves raw fine-grained details in reliable depth regions and falls back to the expert prior in highly noisy regions, achieving a balance between granularity and robustness.
Spatial Transformer (SPT)¶
SPT assigns precise 3D positional information to each spatial token:
- 3D coordinate computation: Camera intrinsics \(K^v\) and extrinsics \(E^v\) are used to back-project pixel \((x', y')\) and corresponding depth \(d\) into 3D coordinates \([x,y,z]^\top\) in the robot frame.
- Rotary Position Encoding (RoPE): \(D/3\) dimensions are allocated to each axis of the 3D coordinates, generating sinusoidal/cosine position embeddings so that tokens at different spatial positions carry unique spatial indices.
- View-level interaction: Self-attention + FFN refines token representations within each view.
- Scene-level interaction: Tokens from all views are concatenated with language features \(F_{\text{text}}\), and cross-view, cross-modal information is fused via self-attention + FFN.
Action Prediction and Supervision¶
- A decoder (ConvexUp) generates per-view 2D heatmaps; argmax retrieves the target 2D location, which is then lifted to 3D via the camera model.
- An MLP regresses rotation angles \((\theta_x, \theta_y, \theta_z)\) and gripper state \(g\).
- Loss functions: cross-entropy on 2D heatmaps (translation) + cross-entropy on discretized Euler angles (rotation) + binary classification loss (gripper).
Key Experimental Results¶
Experiment 1: RLBench 18-Task Performance Comparison¶
Setup: RLBench benchmark, 18 tasks with 249 variations, 4 fixed RGB-D cameras (128×128 resolution), 100 expert demonstrations per task for training and 25 unseen episodes for testing. Trained on 8 GPUs for ~40k iterations with batch size 192.
| Method | Avg. Success Rate ↑ | Avg Rank ↓ | Insert Peg | Sort Shape | Drag Stick |
|---|---|---|---|---|---|
| PerAct | 49.4% | 7.1 | 5.6% | 16.8% | 70.4% |
| RVT | 62.9% | 5.3 | 11.2% | 36.0% | 88.0% |
| 3D Diffuser Actor | 81.3% | 2.8 | 65.6% | 44.0% | 96.8% |
| RVT-2 | 81.4% | 2.8 | 40.0% | 35.0% | 99.0% |
| SpatialActor | 87.4% | 2.3 | 93.3% | 73.3% | 98.7% |
Key Findings: SpatialActor achieves an average success rate of 87.4%, surpassing RVT-2 by 6.0%. On tasks requiring high spatial precision — Insert Peg and Sort Shape — it outperforms RVT-2 by 53.3% and 38.3%, respectively, demonstrating the advantage of disentangled spatial representations for fine-grained manipulation.
Experiment 2: Noise Robustness Evaluation¶
Setup: Gaussian noise is injected into the reconstructed point cloud at three levels — Light (20% points, std=0.05), Middle (50% points, std=0.1), and Heavy (80% points, std=0.1).
| Method | Light ↑ | Middle ↑ | Heavy ↑ |
|---|---|---|---|
| RVT-2 | 72.5% | 68.4% | 57.0% |
| SpatialActor | 86.4% (+13.9%) | 85.3% (+16.9%) | 76.4% (+19.4%) |
Key Findings: As noise severity increases, SpatialActor's advantage grows (13.9% → 16.9% → 19.4%). On the Insert Peg task under heavy noise, it surpasses RVT-2 by 61.3%, validating the noise resistance of the SGM gating fusion mechanism.
Additional Results¶
- Few-shot generalization (19 new tasks, 10 demonstrations each): SpatialActor 79.2% vs. RVT-2 46.9% (+32.3%), indicating that disentangled representations significantly improve transfer capability.
- ColosseumBench spatial perturbation (20 tasks): SpatialActor achieves best results under object size, container size, and camera pose perturbations (baseline 57.4%; camera perturbation 54.2%).
- Ablation study: Disentanglement alone +3.7% (85.1%); +SGM +1.3% (86.4% / Heavy noise 73.9%); +SPT +1.0% (87.4% / Heavy noise 76.4%).
- Real-world experiment: WidowX arm + RealSense D435i, 8 tasks with 15 variations; SpatialActor 63% vs. RVT-2 43% (+20%).
Highlights & Insights¶
- Explicit semantic-geometric disentanglement: This breaks the paradigm of sharing a feature space for semantics and geometry in image-based methods, preventing depth noise from interfering with semantic understanding and improving robustness at the root level.
- Complementary geometric fusion: SGM cleverly combines the complementary strengths of a pretrained depth estimation expert (noise-resistant but coarse) and raw depth features (fine-grained but noisy); the gating mechanism adaptively modulates their relative trust.
- Low-level spatial cue modeling: SPT encodes true 3D coordinates into the Transformer via RoPE, establishing precise 2D-3D correspondences so that interactions among spatial tokens carry geometric meaning.
- Overwhelming advantage under noisy conditions: 76.4% success rate under heavy noise (vs. 57.0%), and 79.2% in few-shot settings (vs. 46.9%), demonstrating substantial practical deployment value.
Limitations & Future Work¶
- Increased computational overhead: Incorporating the frozen Depth Anything v2 expert model increases inference-time computation and memory cost, which may become a bottleneck on resource-constrained robotic platforms.
- Dependence on pretrained depth estimation quality: The robustness of SGM relies on the generalization ability of the depth estimation expert; when the scene distribution deviates significantly from the pretraining data, the quality of the expert prior may degrade.
- Single-arm tabletop scenarios: Experiments are primarily validated on tabletop platforms (Franka / WidowX) and do not cover more complex settings such as bimanual coordination, dexterous hands, or mobile manipulation.
- Fixed viewpoint assumption: Simulation uses 4 fixed RGB-D cameras, and the real-world setup uses only 1 static camera; applicability to dynamic viewpoints or eye-in-hand configurations has not been verified.
Related Work & Insights¶
- Point cloud methods: PolarNet (Chen 2023), PerAct (Shridhar 2023) — explicit 3D structure but sparse.
- Image-based methods: RVT (Goyal 2023), RVT-2 (Goyal 2024), SAM-E (Zhang 2024) — dense semantics but with semantic-geometric entanglement.
- 3D diffusion policy: 3D Diffuser Actor (Ke 2024) — 3D scene representation + diffusion policy.
- Visual foundation models: CLIP (Radford 2021) for semantic priors; Depth Anything v2 (Yang 2025) for geometric priors.
- Voxel-based methods: C2F-ARM-BC (James 2022) — coarse-to-fine voxelization with high computational cost.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐ |
| Practicality | ⭐⭐⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐ |
| Recommended Reading | ⭐⭐⭐⭐ |
Overall Recommendation: ⭐⭐⭐⭐ Rationale: This paper proposes a spatially disentangled representation framework with clear engineering intuition and strong empirical results in robotic manipulation. The design of semantic-geometric disentanglement combined with complementary geometric fusion demonstrates a pronounced advantage under noisy conditions. Comprehensive evaluation across 50+ tasks and real-robot experiments substantially enhance credibility. The core idea — disentanglement + complementary fusion — also serves as a valuable reference for other tasks requiring robust multimodal fusion.