SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics¶
Conference: CVPR 2026 arXiv: 2603.12193 Code: Project Page Area: Robotics Keywords: Active Perception, Active Manipulation, Vision-Language-Action, Semantic Camera Control, Decoupled Action Space, 3D Spatial Awareness
TL;DR¶
SaPaVe is an end-to-end framework that decouples camera motion from manipulation actions via a two-stage bottom-up learning strategy, enabling semantics-driven active perception and viewpoint-invariant manipulation execution. It surpasses GR00T N1 and π₀ by 31.25% and 40%, respectively, on real-world tasks.
Background & Motivation¶
- Core challenge of active manipulation: Robots must simultaneously possess semantic active perception (strategically adjusting viewpoints to acquire task-critical information) and active viewpoint execution (robustly completing operations under dynamic viewpoints)—two complementary capabilities that existing methods struggle to unify.
- Limitations of VLM discretization: VLM-based methods model active perception as a VQA task, selecting the optimal viewpoint from discrete candidates, and are therefore incapable of continuous, fine-grained camera control.
- Fragility of fixed-viewpoint VLAs: Existing end-to-end VLA models (e.g., π₀, GR00T N1) are primarily trained under fixed near-optimal viewpoints, making them highly sensitive to viewpoint changes and ill-suited for active manipulation scenarios.
- High cost of data acquisition: Real-world data containing both head camera motion and manipulation action annotations is extremely scarce and expensive to collect; directly fine-tuning VLAs in a unified action space leads to conflicts and suboptimal performance.
- Lack of 3D geometric awareness: Current VLA models do not sufficiently exploit 3D geometric priors, resulting in high sensitivity to camera perturbations and an inability to reason effectively under viewpoint changes.
- Absence of evaluation benchmarks: Existing simulation benchmarks (e.g., RLBench, CALVIN) are restricted to fixed viewpoints, and no standardized benchmark exists for evaluating active manipulation capabilities.
Method¶
Overall Architecture¶
SaPaVe is built on a VLM backbone that receives RGB images and language instructions, and outputs camera motion and manipulation actions through a decoupled action space. The core design philosophy is that camera motion is embodiment-agnostic and easier to learn than manipulation actions; accordingly, a bottom-up two-stage training strategy is adopted.
Key Designs¶
-
Decoupled Action Heads: Two independent decoders are designed to handle camera motion (2-DoF pitch/yaw) and manipulation actions (26-DoF joint position deltas) separately, avoiding the learning conflicts caused by a unified action space.
-
Camera Adapter: LoRA adapters are added on top of the VLM to learn semantic camera motion priors without modifying the original VLM weights, thereby preserving general semantic understanding capabilities.
-
Universal Spatial Knowledge Injection (USKI): A pretrained 3D geometric encoder (inherited from MAPAnything) supports arbitrary types of 3D geometric inputs (depth maps, camera intrinsics/extrinsics, etc.). The encoded spatial tokens are element-wise added to the VLM output tokens and injected into the decoupled action heads to guide action decoding, enhancing spatial robustness under viewpoint changes.
-
ActiveViewPose-200K Dataset: A dataset of 200K image–language–camera-motion triplets, efficiently constructed via a semi-automatic pipeline using 4K high-quality semantically annotated assets, 500 diverse scenes, heuristic algorithms, and GPT-4o-generated instructions.
-
ActiveManip-Bench: Built on NVIDIA Isaac Sim, this benchmark features the G1 humanoid robot, 12 semantic active manipulation tasks, 100 objects, and 20 scenes—the first simulation benchmark for evaluating active manipulation.
Loss & Training¶
- Stage 1 – Semantic Active Perception Alignment: ActiveViewPose-200K is used to train only the Camera Adapter and camera decoder. The loss is the MSE between predicted and ground-truth camera motion: \(\mathcal{L}_{\text{stage1}} = \mathcal{L}_{\text{MSE}}(A_{\text{head}}, A_{\text{head}}^*)\)
- Stage 2 – Active Manipulation Fine-tuning: The Camera Adapter is frozen; ActiveViewPose-200K and robot manipulation data are mixed to train the decoupled action heads: \(\mathcal{L}_{\text{stage2}} = \lambda_{\text{head}} \mathcal{L}_{\text{head}} + \lambda_{\text{other}} \mathcal{L}_{\text{other}}\)
Key Experimental Results¶
Semantic Active Perception Evaluation (ActiveViewPose-200K Test Set)¶
| Method | Val | Test1 | Test2 | Avg |
|---|---|---|---|---|
| Qwen2.5-VL-72B | 63.9 | 65.1 | 58.0 | 62.3 |
| Multi-SpatialMLLM | 72.8 | 74.3 | 63.6 | 70.2 |
| Gemini-2.5-Pro | 73.3 | 76.5 | 68.2 | 72.7 |
| SaPaVe (Stage 1) | 85.5 | 89.1 | 78.3 | 84.3 |
With only 2B parameters, SaPaVe surpasses Gemini-2.5-Pro by 11.6%, demonstrating that semantic active perception is not an emergent capability of general large models and requires dedicated data training.
Real-World Active Manipulation (vs. Existing VLAs)¶
| Method | Occluded Pick-and-Place | Out-of-View Pick-and-Place | Occluded Articulated Manipulation | Out-of-View Articulated Manipulation | Avg |
|---|---|---|---|---|---|
| π₀ | 55 | 45 | 45 | 35 | 45.00 |
| GR00T-N1 | 60 | 55 | 50 | 50 | 53.75 |
| SaPaVe | 90 | 85 | 85 | 80 | 85.00 |
Ablation Study¶
| Ablation | Occluded Pick-and-Place | Out-of-View Pick-and-Place | Occluded Articulated | Out-of-View Articulated | Avg |
|---|---|---|---|---|---|
| w/o Stage 1 | 65 | 55 | 50 | 45 | 53.75 |
| w/o Stage 2 | 75 | 60 | 70 | 60 | 66.25 |
| w/o Decoupled Action Heads | 80 | 70 | 70 | 65 | 71.25 |
| w/o Camera Adapter | 80 | 75 | 70 | 70 | 73.75 |
| w/o USKI | 75 | 75 | 65 | 60 | 68.75 |
| Full Model | 90 | 85 | 85 | 80 | 85.00 |
Each component contributes significantly. Stage 1 has the greatest impact on out-of-view tasks (success rate roughly halved when removed), and USKI is critical for basic manipulation robustness.
Highlights & Insights¶
- Elegant decoupling: Separating embodiment-agnostic camera motion from embodiment-specific manipulation actions reduces data requirements and avoids learning conflicts.
- Data efficiency: The bottom-up strategy reuses large-scale, easily collectible camera motion data, requiring only limited robot data to generalize.
- Complete ecosystem: The simultaneous release of a dataset (ActiveViewPose-200K) and an evaluation benchmark (ActiveManip-Bench) fills a critical gap in active manipulation assessment.
- Strong experimental validity: Simulation, real-world, generalization, and ablation experiments collectively provide clear attribution of each component's contribution.
- Substantial real-world gains: Large improvements over π₀ (+40%) and GR00T N1 (+31.25%) in real-world settings demonstrate the necessity of active perception.
Limitations & Future Work¶
- Currently limited to 2-DoF head camera motion (pitch/yaw), without extension to full 6-DoF viewpoint control.
- Manipulation is evaluated on the Unitree G1 humanoid robot (26-DoF); transferability to other robot morphologies is not sufficiently validated.
- Camera motion in ActiveViewPose-200K is generated by heuristic algorithms, which may deviate from human-intuitive viewpoint adjustment strategies.
- Sim-to-real transfer details are insufficiently discussed.
- The combination of active camera and wrist camera yields limited gains (Tab. 2), suggesting that multi-view fusion strategies warrant further optimization.
Related Work & Insights¶
- vs. Next-Best-View methods: NBV methods are not end-to-end and lack semantic inputs; SaPaVe achieves semantics-driven end-to-end active perception.
- vs. VQA-based methods: VQA-based methods (e.g., Look Further) select from discrete candidate viewpoints, whereas SaPaVe directly predicts camera motion in continuous space.
- vs. π₀ / GR00T N1: Directly extending the action space of these VLAs to include camera motion performs poorly due to unified-space conflicts and the absence of perceptual priors; SaPaVe substantially surpasses them through decoupling and the two-stage strategy.
- vs. teleoperation methods (e.g., Open-TeleVision): These methods rely on costly real-world data collection, whereas SaPaVe's camera motion training data can be generated at scale with low cost.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The first end-to-end active manipulation framework; the decoupling and bottom-up strategy are highly original.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Simulation, real-world, generalization, and ablation experiments, with dataset and benchmark released simultaneously.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure and well-motivated problem statement, though some details require consulting the appendix.
- Value: ⭐⭐⭐⭐⭐ — Fills an important gap in active manipulation research and represents a significant contribution to the robotics community.