BiPreManip: Learning Affordance-Based Bimanual Preparatory Manipulation through Anticipatory Collaboration¶

Conference: CVPR 2026 arXiv: 2603.21679 Code: Project Page Area: Robotic Manipulation / Human Understanding Keywords: Bimanual collaborative manipulation, visual affordance, preparatory manipulation, anticipatory reasoning, point cloud

TL;DR¶

This paper proposes BiPreManip, a framework for bimanual preparatory manipulation based on visual affordance representations. The system first anticipates the primary hand's target interaction region, then guides the assistive hand to perform preparatory actions (e.g., flipping a bottle so its cap faces the primary hand), achieving substantial improvements over baselines in both simulated and real-world environments.

Background & Motivation¶

Background: Bimanual manipulation research has advanced considerably in recent years (ACT, RDT-1B, 3D FlowMatch Actor, etc.), covering symmetric, sequentially independent, and complementary-role collaboration paradigms.

Limitations of Prior Work: Existing methods assume both hands can directly interact with an object, yet many everyday scenarios require one hand to first change the object's state before the other hand can operate—e.g., pushing a tablet to the table edge before grasping it, or standing a pen upright before uncapping it.

Key Challenge: Preparatory manipulation demands asymmetric anticipatory coordination and long-horizon interdependent planning—the assistive hand must understand the primary hand's future intent while avoiding interference with its anticipated interaction region.

Goal: Define and address a novel problem category of "collaborative preparatory manipulation," enabling robots to learn bimanual coordinated behavior of prepare-then-operate.

Key Insight: Affordance-driven reasoning—first use an affordance map to anticipate the final goal action, then inversely derive the assistive hand's preparatory behavior.

Core Idea: Cross-arm reasoning is achieved via an anticipatory affordance map, such that every action of the assistive hand serves the primary hand's ultimate goal.

Method¶

Overall Architecture¶

Input: object point cloud + language instruction → Goal Affordance Network predicts anticipatory affordance → Pre-Affordance Network reasons about the assistive hand's preparatory action → Anticipatory Object Pose Predictor estimates the target object pose → Reorient Actor executes reorientation → Goal Affordance Network is invoked again to execute the final goal action.

Key Designs¶

Goal Affordance Network:
- PointNet++ encodes point cloud features \(f_p\); CLIP encodes language instructions as \(f_l\)
- An MLP fuses both to predict per-point affordance scores \(s\) (likelihood of each point serving as a contact region)
- A cVAE predicts the target gripper orientation \(d_{\text{goal}} \in SO(3)\); combined with the contact point, this yields a 6D action \(a_{\text{goal}} \in SE(3)\)
- Key distinction: The prediction is anticipatory rather than reactive—it imagines interactions that only become feasible after preparatory manipulation is complete
- Design Motivation: Affordance representations naturally encode "where and how to interact," offering greater generalizability than directly learning action sequences
Pre-Affordance Network:
- Conditioned on the anticipatory goal affordance, reasons about how the assistive hand should act
- Fuses \((f_p, f_l, f_{p_{\text{goal}}}, f_{d_{\text{goal}}})\) to predict a pre-affordance map
- A cVAE samples the assistive gripper orientation \(d_{\text{pre}}\), yielding preparatory action \(a_{\text{pre}} \in SE(3)\)
- Design Motivation: The assistive hand's preparatory behavior must align with the primary hand's future interaction space, precluding naive blind grasping
Anticipatory Object Pose Predictor + Reorient Actor:
- Estimates the ideal object pose \(T^{\text{obj}} = (t^{\text{obj}}, r^{\text{obj}}) \in SE(3)\) that enables collision-free contact at the target region for the primary hand
- Transforms the point cloud: \(O' = T^{\text{obj}} \cdot O\)
- The Reorient Actor receives the transformed point cloud and current grasp scene, predicting a 6D reorientation motion
- Design Motivation: Many preparatory tasks require adjusting object orientation (e.g., rotating a bottle so its cap faces the primary hand); explicit modeling of this step is more controllable than end-to-end learning

Loss & Training¶

Affordance scores: supervised with \(\ell_1\) loss; positive and negative samples are derived from demonstrations
Gripper orientation: geodesic distance loss \(\mathcal{L}_{\text{ori}} = \arccos\frac{\text{Tr}(d^\top d^*) - 1}{2}\) + KL regularization
The anticipatory stage lacks direct annotations; supervision is constructed via pose transformations from the execution stage: \(R_{\text{grp,ant}} = R_{\text{obj,init}} \cdot R_{\text{obj,fin}}^\top \cdot R_{\text{grp,fin}}\)
Both the pose predictor and reorientation actor are cVAEs trained with a combination of geodesic loss + \(\ell_1\) + KL

Key Experimental Results¶

Main Results (Success Rate %, Trained / Unseen Objects)¶

Category	BiPreManip	ACT	3DFA	Heuristic	W2A
Bowl (Edge-Push)	49/52	32/27	3/0	15/21	0/0
Cap	71/74	22/36	5/14	31/37	2/4
Pen-Button (Artic.)	67/72	15/9	14/25	27/34	0/0
Lighter	43/58	34/30	41/36	21/32	2/0
Plate (PerAct2)	85/82	30/26	71/68	81/78	4/4

Ablation Study¶

Configuration	Bottle	Pen-button	Pen-cap	Notes
Full model	30/26	67/72	26/32	Best
w/o Ant-Aff	27/13	48/58	23/10	Removing anticipatory affordance causes notable degradation
w/o ObjPosePred	24/15	51/50	21/8	Removing pose prediction leads to reorientation failures

Key Findings¶

Across 18 object categories, BiPreManip significantly outperforms all baselines on the majority of tasks
The method generalizes well to unseen objects; on some categories, success rates on unseen objects even exceed those on training objects
Both the anticipatory affordance and the object pose predictor are critical components
Real-world human-to-robot handover experiments further validate the practical utility of the approach

Highlights & Insights¶

The paper defines a novel problem category of "collaborative preparatory manipulation," filling an important gap in bimanual manipulation research
Affordance-driven anticipatory reasoning is highly elegant—the "think before act" paradigm closely mirrors human behavior
Parameter sharing ensures semantic consistency between the anticipatory and execution stages, guaranteeing coherence between "imagination" and "execution"
The benchmark comprising 18 object categories and 882 instances is systematic and comprehensive

Limitations & Future Work¶

The cVAE's multimodal modeling capacity is limited; more complex manipulations may require diffusion-based models
The method relies on complete point cloud observations and may degrade under heavy occlusion
Only two-step preparation (grasp + reorientation) is currently supported; longer-horizon preparatory sequences remain unexplored
Integration with language models could enable more complex task decomposition

Single-arm affordance methods such as Where2Act provide the foundation but do not support coordinated reasoning
The Transformer architecture of ACT suits general bimanual prediction but lacks anticipatory reasoning capability
Key insight: for manipulation tasks requiring sequential coordination, explicitly modeling "intent" is more effective than end-to-end learning

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Novel problem definition + anticipatory affordance-driven bimanual coordination framework
Experimental Thoroughness: ⭐⭐⭐⭐ Simulation + real world, 18 categories, multiple baselines and ablations
Writing Quality: ⭐⭐⭐⭐ Clear logic, intuitive illustrations
Value: ⭐⭐⭐⭐⭐ Advances bimanual manipulation toward more practical scenarios