SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics¶

Conference: CVPR 2025
arXiv: 2603.12193
Code: https://lmzpai.github.io/SaPaVe
Area: Robotics
Keywords: VLA, active perception, active manipulation, camera control, humanoid robots

TL;DR¶

SaPaVe proposes an end-to-end active manipulation framework. By decoupling the action space of camera movement and manipulation actions, it adopts a bottom-up, two-stage training strategy (learning semantic camera control first, followed by joint optimization) to train active perception priors on a 200K semantic camera movement dataset. Coupled with a 3D geometry-aware module to enhance execution robustness under viewpoint changes, it achieves 31.25% and 40% higher success rates than GR00T-N1 and \(\pi_0\), respectively, in real-world tasks.

Background & Motivation¶

Background: VLA models (\(\pi_0\), GR00T-N1) trained and deployed under fixed, near-optimal viewpoints have achieved good manipulation capabilities.

Limitations of Prior Work: (1) The real world contains occlusions and out-of-view objects, which cannot be covered by fixed viewpoints; (2) directly incorporating camera movement into the VLA action space disrupts existing fixed-viewpoint manipulation priors and requires a large amount of expensive active manipulation data; (3) VLA models lack 3D geometric understanding, leading to unstable execution under viewpoint changes.

Key Challenge: Active manipulation requires two complementary capabilities—semantic active perception (selecting appropriate viewpoints) and active viewpoint execution (manipulating even under non-optimal viewpoints)—but existing methods either fail to support semantic-input active perception or cannot handle manipulation under viewpoint changes.

Goal: How to achieve both semantic-driven active perception and viewpoint-robust execution in VLAs in a data-efficient manner?

Key Insight: Camera movement is embodiment-agnostic (independent of the robot embodiment) and can be learned separately using large-scale image-language-camera movement data, while manipulation actions are embodiment-specific and require joint optimization.

Core Idea: Decoupled action space (camera vs. manipulation) + bottom-up training (learning active perception first, then active manipulation) + 3D geometry injection.

Method¶

Overall Architecture¶

Input RGB image \(I_t\) + language instruction \(L\) + optional 3D information \(G_t\) → VLM backbone → decoupled action heads outputting camera movement \(A_{\text{head}}\) (pitch/yaw) and manipulation actions \(A_{\text{other}}\) (26-DoF joint angle increments).

Key Designs¶

Decoupled Action Heads & Camera Adapter:
- Function: Separates camera control and manipulation actions into two decoders and uses a LoRA adapter to learn camera movement.
- Mechanism: The Camera Adapter is a LoRA module on the VLM, specifically learning semantic camera control priors. Two independent action decoders respectively predict \(A_{\text{head}} \in \mathbb{R}^2\) (pitch/yaw) and \(A_{\text{other}} \in \mathbb{R}^{26}\) (dual-arm + dual-hand joints).
- Design Motivation: Decoupling avoids interference between camera movement and manipulation actions inside a unified action space; the Camera Adapter keeps the original VLM weights intact.
Universal Spatial Knowledge Injection:
- Function: Injecting 3D geometric information (depth maps, camera intrinsic/extrinsic parameters, etc.) into the action generation process.
- Mechanism: Uses the encoder of a pre-trained 3D geometric model to encode geometric information into spatial tokens, which are element-wise added to the VLM output tokens to guide action prediction during the action denoising process.
- Design Motivation: Provides 3D spatial understanding for active manipulation under viewpoint changes, without requiring retraining or architecture modifications.
Two-Stage Bottom-Up Training:
- Stage 1 (Active Perception Alignment): Train the Camera Adapter + Camera Decoder using ActiveViewPose-200K, with MSE loss supervising the camera movement. Learn "where to look in what scenes".
- Stage 2 (Active Manipulation Fine-tuning): Freeze the Camera Adapter and train the two Action Decoders using mixed data (ActiveViewPose-200K + manipulation data) to jointly optimize both camera and manipulation.

Datasets & Benchmark¶

ActiveViewPose-200K: 200K image-language-camera movement pairs, with 4K finely annotated 3D assets + 500 scenes, generated via a semi-automated pipeline.
ActiveManip-Bench: The first active manipulation simulation benchmark, featuring 12 tasks \(\times\) 100 objects \(\times\) 20 scenes.

Key Experimental Results¶

Main Results (ActiveManip-Bench Simulation)¶

Method	Unoccluded	Occluded	Out-of-View	Average
GR00T-N1	50.0	24.2	5.0	17.2
\(\pi_0\)	31.7	17.5	8.3	14.2
SaPaVe	83.3	76.7	70.0	75.2

Real-World Results¶

Method	Occluded PnP	OoV PnP	Occluded Arti	OoV Arti	Avg
GR00T-N1	70	45	55	40	52.5
\(\pi_0\)	55	35	45	30	41.25
SaPaVe	90	85	85	80	85.0

Ablation Study¶

Ablation	Avg Success Rate
w/o Stage 1	53.75%
w/o Stage 2	66.25%
w/o Decoupled Head	71.25%
w/o Camera Adapter (full finetune)	73.75%
w/o Spatial Knowledge	71.25%
Full SaPaVe	85.0%

Key Findings¶

Active viewpoint is superior to the combination of fixed viewpoint + wrist camera—the success rate of fixed-viewpoint on Out-of-View tasks is <20%, while active viewpoint is >70%.
Both stages of training are essential: removing Stage 1 cuts the Out-of-View success rate in half.
Decoupled > Unified: Unified action space performs ~14% worse than decoupled.
LoRA adapter > Full-parameter fine-tuning: Full fine-tuning disrupts VLM semantic understanding.
The 2B-parameter SaPaVe outperforms Gemini 2.5 Pro by 16% on semantic active perception.

Highlights & Insights¶

Insight that "camera movement is embodiment-agnostic": This key observation highlights that "where to look" does not depend on the robot embodiment itself. Therefore, it can be learned separately using large-scale image datasets and transferred to any robot.
Bottom-up training strategy: Building perception priors first and then learning manipulation on top of them is much more data-efficient than end-to-end joint training.
The first active manipulation benchmark (ActiveManip-Bench): Fills the evaluation gap, covering critical scenarios such as occlusions and out-of-view objects.

Limitations & Future Work¶

Only validated on the Unitree G1 humanoid robot; generalization to other robotic platforms (e.g., robotic arms) remains to be tested.
Camera movement is restricted to 2 DoF (pitch/yaw), without considering translation and full 6 DoF motion.
ActiveViewPose-200K contains synthetic data, and the sim-to-real gap may affect real-world active perception quality.
Not compared with Next-Best-View (NBV) methods or multi-view fusion methods.

vs. GR00T-N1: Fixed-viewpoint VLA, where direct fine-tuning with camera movement yields poor results. SaPaVe's decoupled + two-stage strategy outperforms it by 31.25% in the real world.
vs. \(\pi_0\): Displays the same fixed-viewpoint limitations. SaPaVe outperforms it by 40%.
vs. VQA-based active perception: Uses discrete candidate viewpoint selection, unable to achieve continuous camera control. SaPaVe directly outputs continuous camera movements.

Rating¶

Novelty: ⭐⭐⭐⭐ The decoupled + bottom-up training design is clever, presenting a systematic contribution to active manipulation frameworks.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Very comprehensive experiments across simulation, real-world, ablation studies, generalization, and comparisons.
Writing Quality: ⭐⭐⭐⭐ Clear motivation and reasonable system design.
Value: ⭐⭐⭐⭐⭐ The first framework to systematically address active manipulation in VLAs, with datasets and a benchmark that hold long-term value.