A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation¶

Conference: ICCV 2025 arXiv: 2504.12636 Code: https://a-embodied.github.io/A0/ Area: Image Generation Keywords: Robotic Manipulation, Spatial Affordance, Hierarchical Model, Diffusion Model, Cross-Platform Generalization

TL;DR¶

This paper proposes A₀, an affordance-aware hierarchical diffusion model that decomposes manipulation tasks into high-level spatial affordance understanding (predicting contact points and trajectories) and low-level action execution. Pretrained on 1M contact point data and fine-tuned with minimal task-specific data, A₀ achieves cross-platform deployment across Franka/Kinova/Realman/Dobot, reaching a 45% success rate on complex trajectory tasks such as whiteboard wiping.

Background & Motivation¶

Background: Robotic manipulation approaches fall into two categories: modular methods (leveraging visual foundation models) and end-to-end VLA methods (directly generating actions).

Limitations of Prior Work: Modular methods lack deep understanding of object spatial affordances; end-to-end methods generate actions without understanding spatial positions, leading to poor performance on complex manipulations (e.g., wiping whiteboards, stacking objects).

Core Idea: The paper proposes an Embodiment-Agnostic Affordance Representation — object-centric prediction of contact points and post-contact 2D waypoint trajectories — decoupling the method from specific robot platforms.

Method¶

Key Designs¶

Embodiment-Agnostic Affordance Representation: Unifies affordance information from robot data, hand-object interaction (HOI) data, and custom data into the \((I, L, C, T)\) format — image, language instruction, contact point, and trajectory waypoints.
A₀ Diffusion Model: Based on the DiT architecture, takes noisy waypoints and diffusion timesteps as input, and injects SigLiP visual features and Qwen2.5-7B text features via cross-attention. A Position Offset Attention module is introduced to extract inter-frame motion information.
Two-Stage Training:
Pretraining: Learns general object localization ability on 1M PixMo-One-Point data.
Fine-tuning: Learns dynamic manipulation on annotated trajectory data.

Loss & Training¶

Pretraining: \(\mathcal{L}_p = \text{MSE}(x_t^0, f_\theta(k, x_t^k, I_t, \ell))\); Fine-tuning: \(\mathcal{L}_s = \text{MSE}(x_{t:t+T}^0, f_\theta(k, x_{t:t+T}^k, I_{t-1:t}, \ell))\)

Key Experimental Results¶

Platform	A₀ Success Rate	Best Baseline	Notes
Franka	62.50%	55.0% (OpenVLA)	Average over 8 tasks
Kinova	53.75%	42.5%	Cross-platform generalization
Wipe Board	45%	~20%	Trajectory-following task

Key Findings¶

Pretraining contact point localization capability significantly improves post-fine-tuning manipulation performance.
The 2D waypoint representation is inherently cross-platform; deployment to different robots requires only 2D→3D back-projection and grasp sampling.
The advantage is most pronounced on trajectory-following tasks, as conventional methods lack modeling of post-contact trajectories.

Pretraining Data Scale¶

Data Source	Contact Points	Usage
PixMo-One-Point	1M	Object localization pretraining
HOI Data	50K	Hand-object interaction
Robot Data	20K	Manipulation tasks

Cross-Platform Deployment Results¶

Platform	Tasks	Avg. Success Rate	Deployment
Franka	8	62.5%	Direct
Kinova	6	53.8%	Direct
Realman	4	48.5%	Adapted
Dobot	3	45.0%	Adapted

Highlights & Insights¶

The "embodiment-agnostic" design is highly practical: by predicting 2D points and trajectories on objects rather than robot-specific configurations, the method adapts to arbitrary platforms via depth back-projection and a grasp sampler.
The large-scale contact point pretraining paradigm is noteworthy: cheap point-annotation data establishes a strong spatial localization prior that transfers effectively to downstream tasks.

Limitations & Future Work¶

The method relies on an external grasp sampler for precise grasp pose estimation; task execution fails when the sampler fails.
The 2D-to-3D depth back-projection is limited by depth estimation accuracy and may fail on transparent or reflective objects.
The 45% success rate on complex trajectory tasks such as whiteboard wiping leaves substantial room for improvement.
The PixMo-One-Point pretraining data primarily covers static localization; dynamic trajectory data remains scarce.
The 2D waypoint representation cannot handle tasks requiring precise force control (e.g., assembly).
Multi-step manipulation planning and long-horizon task execution are not explored.
The computational overhead of Position Offset Attention is not analyzed in detail.

Rating¶

Novelty: ⭐⭐⭐⭐ Affordance-based hierarchy combined with embodiment-agnostic representation is elegantly designed.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four robot platforms with diverse tasks.
Writing Quality: ⭐⭐⭐⭐ Method description is clear and well-organized.
Value: ⭐⭐⭐⭐⭐ Directly relevant to practical robotic deployment.