This paper proposes BridgeVLA, which projects 3D point clouds into multi-view 2D images and uses 2D heatmaps as an intermediate representation to align the input and output spaces, enabling efficient and effective 3D robot manipulation learning.
Leveraging pretrained vision-language models (VLMs) to build vision-language-action (VLA) models has become the dominant paradigm for learning generalizable robot manipulation policies.
However, most VLA models rely solely on 2D image inputs and require large amounts of data collection, while 3D policies offer high sample efficiency but lack the broad semantic knowledge of VLMs.
Existing 3D VLA methods (e.g., 3D-VLA, SpatialVLA) suffer from two critical issues:
Actions are represented as token sequences without spatial structure, failing to exploit the spatial priors in 3D data and resulting in low sample efficiency.
The 3D inputs used during fine-tuning are misaligned with the 2D image inputs seen during pretraining, hindering knowledge transfer.
How to design a 3D VLA model that simultaneously inherits the semantic generalization of VLMs and the spatial efficiency of 3D policies, while achieving input-output alignment in a unified 2D space?
BridgeVLA adopts a two-stage training pipeline: 2D heatmap pretraining followed by 3D action fine-tuning, using PaliGemma (SigLIP visual encoder + Gemma Transformer) as the VLM backbone.
Function: Trains the VLM to predict spatial heatmaps conditioned on text descriptions, overcoming the limitation of vanilla VLMs that can only produce token sequences.
Data: 120K object detection samples from RoboPoint.
Heatmap Construction: For each target object with bounding box center \(\hat{\mathbf{x}_i}\), a truncated Gaussian probability map is constructed:
$\(H_i^{gt}(\mathbf{x}) = \begin{cases} \exp(-\|\mathbf{x}-\hat{\mathbf{x}_i}\|^2 / 2\sigma^2) & \text{if } p_i(\mathbf{x}) \geq p_{\min} \\ 0 & \text{otherwise} \end{cases}\)$
For multiple targets, the final heatmap \(H^{gt}\) is obtained by averaging and normalizing.
Decoding: Image tokens output by the VLM are spatially rearranged according to patch positions into a feature grid, then restored to the original image resolution via a learnable convex upsampling module.
Loss: Cross-entropy loss supervises heatmap prediction.
Scalability: This pretraining strategy can in principle leverage any vision-language dataset that can be reformulated as heatmap prediction (e.g., keypoint detection, semantic segmentation).
Input Alignment: RGB-D images are used to reconstruct point clouds, which are then rendered into three orthographic projection views (top, front, right) as VLM inputs, maintaining alignment with the 2D images seen during pretraining.
Translation Prediction: Each of the three views produces a heatmap; back-projection is applied to score 3D grid points, and the highest-scoring point is taken as the end-effector translation.
Rotation/Gripper/Collision Prediction: Global features are obtained via max-pooling over output tokens from each view, local features are extracted from heatmap peaks, and the concatenated features are passed through an MLP to predict rotation (Euler angles, 72 bins per axis), gripper state, and collision flag.
Coarse-to-Fine Strategy: After an initial prediction, the point cloud is cropped and zoomed around the predicted translation, and a second forward pass is performed; the result of the second pass is used for execution.
Key Design: No additional 3D positional information or robot state is injected into the VLM, maximally preserving the pretraining-finetuning distributional consistency.
Data Efficiency: Achieves 95.4% success rate with only 3 demonstrations per task, while π₀ and SpatialVLA nearly completely fail with 10 demonstrations.
Reaches 96.9% with 10 demonstrations per task, outperforming RVT-2 (90%) by an average of 32%.
Comprehensively surpasses RVT-2 across 6 generalization settings, with particularly large margins under lighting variation and compositional generalization.
Input-output alignment paradigm: BridgeVLA is the first 3D VLA to unify pretraining and fine-tuning input-output spaces into a common 2D domain — an elegant design choice.
Exceptional sample efficiency: 95.4% success rate with only 3 demonstrations per task, far exceeding π₀, SpatialVLA, and other methods.
The advantage of heatmap intermediate representation is thoroughly validated through ablation — removing it causes a 57-percentage-point drop in success rate.
Not injecting extra 3D information yields better results — maintaining distributional consistency with pretraining matters more than incorporating additional information.
Experiments are highly comprehensive: 3 simulation benchmarks + real-robot evaluation across 7 settings + 3 ablation studies.
Poor performance on long-horizon tasks: Success rate of 0% on GemBench L4, lacking subtask decomposition capability; future work could incorporate LLM-based task planning.
Occlusion issues: In tasks such as Place Cups, target keypoints may be occluded in all orthographic views; dynamic projection view selection warrants exploration.
Limited category-level generalization: Absolute success rates in real-robot category-generalization settings remain modest, partly due to viewpoint discrepancies between pretraining and robot data.
Pretraining data scale: Only 120K detection samples are currently used; expanding to semantic segmentation, keypoint detection, and other tasks is expected to improve generalization.
Information loss from orthographic projections: Three fixed orthographic views may discard critical information from certain perspectives.
Compared to SpatialVLA: BridgeVLA avoids injecting 3D positional information into the VLM, instead preserving distributional consistency through orthographic projection, yielding better results.
Compared to RVT-2: BridgeVLA inherits the orthographic projection + heatmap design but adds a VLM backbone and pretraining, thereby acquiring semantic generalization capability.
Compared to π₀: 3D perception combined with heatmap output substantially improves sample efficiency.
Maintaining pretraining distributional consistency is more important than injecting additional information — this insight has broad implications for VLM-based robot policy design.
The heatmap-as-intermediate-representation paradigm is generalizable to other spatial prediction tasks (e.g., grasp point prediction, navigation).
The object detection → heatmap task design in the pretraining stage is clever, providing a general paradigm for adapting VLMs to downstream spatial tasks.
The coarse-to-fine two-pass inference strategy effectively improves precision, but incurs 2× inference overhead — more efficient alternatives are worth exploring.
Writing Quality: 8/10 — Clear structure with five research questions addressed systematically; strong logical coherence.
Value: 8/10 — Achieving 95% success rate with only 3 demonstrations represents a meaningful breakthrough in sample efficiency with practical application value.