AnyBimanual: Transferring Unimanual Policy for General Bimanual Manipulation¶
Conference: ICCV 2025 arXiv: 2412.06779 Code: https://anybimanual.github.io/ Area: Robotic Manipulation Keywords: bimanual manipulation, policy transfer, skill primitives, visual alignment, behavior cloning
TL;DR¶
This paper proposes AnyBimanual, a plug-and-play framework that transfers pretrained unimanual manipulation policies to general bimanual manipulation scenarios via a Skill Manager and a Visual Aligner, achieving significant multi-task generalization with only a small number of bimanual demonstrations.
Background & Motivation¶
Bimanual manipulation systems play an important role in domestic service, robotic surgery, and industrial assembly. Compared to single-arm systems, bimanual systems offer a larger workspace and can accomplish more complex tasks (e.g., one arm stabilizes the target while the other operates), but face critical bottlenecks:
Expensive Data: The action space of bimanual manipulation is extremely high-dimensional; collecting teleoperation demonstrations requires dedicated systems, additional sensors, and precise calibration, incurring substantial labor costs.
Generalization Difficulty: Constrained by data volume, directly learned bimanual policies struggle to generalize across diverse tasks.
Limitations of Existing Approaches: - LLM/VLM-based high-level planning methods: constrained by predefined low-level executors and unable to handle contact-rich tasks. - Fixed role assignment (stabilizer/actor): inflexible collaboration modes. - Parameterized atomic actions: difficult to specify manually, limiting deployment scenarios.
Key Insight: Unimanual policies (e.g., PerAct, RVT) have demonstrated impressive cross-task generalization through large-scale parameters and training data. Bimanual tasks can often be decomposed into combinations of unimanual subtasks; therefore, the general manipulation knowledge embedded in unimanual policies can be extracted and transferred.
Method¶
Overall Architecture¶
AnyBimanual consists of two core modules: a Skill Manager handling the language branch and a Visual Aligner handling the visual branch. Two pretrained unimanual policy models predict the actions of the left and right arms respectively.
Key Designs¶
-
Skill Manager
- Maintains a discrete set of skill primitives \(\mathcal{Z} = \{z_1, z_2, ..., z_K\}\), where each skill \(z_k \in \mathbb{R}^D\) is an implicit embedding.
- Expresses the language embeddings for each arm as a linear combination of skill primitives plus a task-oriented compensation term: \(\hat{l}_t^{left} = \sum_{k=1}^K \hat{w}_{k,t}^{left} z_k + \epsilon_t^{left}, \quad \hat{l}_t^{right} = \sum_{k=1}^K \hat{w}_{k,t}^{right} z_k + \epsilon_t^{right}\)
- Employs a multimodal Transformer to dynamically predict the combination weights at each step: \((\\hat{w}_t^{left}, \epsilon_t^{left}, \hat{w}_t^{right}, \epsilon_t^{right}) = f_\theta(v_t, l, p_t)\)
- Skill primitives can be initialized with language template tokens from pretrained unimanual policies to mitigate the domain gap.
- Design Motivation: For example, a bimanual handover task can be decomposed into a "place" skill for the left arm and a "pick" skill for the right arm.
-
Generalizable Skill Representation Learning
- Sparse regularization encourages reconstruction of language embeddings using the minimum number of skill primitives: \(\mathcal{L}_{skill} = \|\hat{w}^{left}\|_1 + \|\hat{w}^{right}\|_1 + \lambda_\epsilon(\|\epsilon^{left}\|_{2,1} + \|\epsilon^{right}\|_{2,1})\)
- The \(\ell_1\) norm enforces sparse selection, enabling each skill representation to capture an independent primitive motion.
- The \(\ell_{2,1}\) norm on the compensation term ensures task-specific knowledge is introduced only when necessary.
-
Visual Aligner
- Generates two spatial soft masks to decompose the voxel space, aligning the visual input of each arm to its pretraining distribution: \(v_t^{left} = (\hat{v}_t^{left} \odot v_t) \oplus v_t, \quad v_t^{right} = (\hat{v}_t^{right} \odot v_t) \oplus v_t\)
- Mutual exclusivity is enforced by maximizing the Jensen-Shannon divergence: \(\mathcal{L}_{voxel} = -D_{KL}(\hat{v}_t^{left}\|\hat{v}_t^{right})/2 - D_{KL}(\hat{v}_t^{right}\|\hat{v}_t^{left})/2\)
- Intuition: During asynchronous bimanual collaboration, the left and right arms attend to different regions of the workspace; mutual exclusive decomposition naturally restores the bimanual scene to a unimanual configuration.
Overall Training Objective¶
where \(\mathcal{L}_{BC}\) is the cross-entropy loss for behavior cloning. The framework is model-agnostic and supports different architectures including Transformer-based and diffusion-based policies.
Key Experimental Results¶
Main Results (RLBench2, 100 demonstrations)¶
| Method | straighten rope | sweep dustpan | press buttons | put in fridge | Average |
|---|---|---|---|---|---|
| PerAct2 | 8 | 52 | 41 | 16 | 14.67 |
| PerAct2+Pretrain | 17 | 55 | 40 | 22 | - |
| PerAct-LF | 11 | 47 | 9 | 7 | - |
| PerAct+AnyBimanual | 24 | 57 | 14 | 26 | 32.00 |
Overall average success rate: AnyBimanual (32.00%) vs. PerAct2 (14.67%), a gain of 17.33%.
Ablation Study¶
| Row | Skill Manager | Visual Aligner | Long | Generalized | Sync | Average |
|---|---|---|---|---|---|---|
| 1 | - | - | 16.29 | 23.50 | 3.50 | 14.67 |
| 2 | ✗ | ✗ | 19.57 | 25.50 | 9.50 | 16.75 |
| 3 | ✗ | ✓ | 21.57 | 44.00 | 15.50 | 19.75 |
| 4 | ✓ | ✗ | 23.71 | 42.00 | 17.00 | 25.67 |
| 5 | ✓ | ✓ | 27.29 | 44.00 | 25.00 | 32.00 |
Key Findings¶
- AnyBimanual improves over the PerAct2 state of the art by 17.33%, with particularly pronounced advantages on long-horizon, multi-variant, and synchronous coordination tasks.
- As a plug-and-play method, it also improves the performance of PerAct-LF (+72.76%) and RVT-LF (+39.41%).
- The Skill Manager contributes most to long-horizon tasks (+8.92%), while the Visual Aligner contributes significantly to generalization and synchronization tasks.
- The two modules exhibit synergistic effects exceeding the sum of their individual contributions: from 14.67% to 32.00%.
- The average success rate across 9 real-world bimanual tasks is 84.62%, demonstrating strong practical utility.
- Visualizations show that the Skill Manager dynamically schedules reasonable skill combinations, and the Visual Aligner effectively decomposes the voxel space.
Highlights & Insights¶
- Elegant Framework Design: The core idea is concise—bimanual manipulation is treated as the coordinated composition of two unimanual skills, realized through sparse decomposition and mutual exclusive masking.
- Model Agnosticism: Compatible with different unimanual policies such as PerAct and RVT, offering strong practical versatility.
- Data Efficiency: Transfer is achievable with only 20–100 bimanual demonstrations, substantially reducing data requirements.
- Good Interpretability: Skill scheduling weights and voxel decomposition masks are both visualizable, facilitating understanding of model behavior.
Limitations & Future Work¶
- Performance is poor on simple tasks requiring precise rotation (e.g., Rotate Toothbrush, 20% success rate).
- Introducing additional complexity on short-horizon simple tasks (e.g., Lift ball) may slightly degrade performance.
- The number of skill primitives \(K\) must be set manually.
- Evaluation is currently limited to RLBench2 and constrained real-world scenarios; validation on larger-scale benchmarks is lacking.
- Performance depends on the quality of keyframe extraction.
Related Work & Insights¶
- This work contrasts with the fixed architectural design of PerAct2, demonstrating the paradigm advantage of "transfer rather than retrain."
- The sparse skill decomposition idea is generalizable to other multi-agent collaboration scenarios.
- The mutual exclusive visual masking design is inspired by observations of asynchronous bimanual collaboration and can be extended to multi-arm systems.
Rating¶
- Novelty: ⭐⭐⭐⭐ The framework design for transferring unimanual policies to bimanual settings is novel, with clear skill scheduling and visual alignment mechanisms.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 12 simulation and 9 real-world tasks, with multiple baselines, complete ablations, and rich visualizations.
- Writing Quality: ⭐⭐⭐⭐ Problem motivation is clear and the method is described systematically.
- Value: ⭐⭐⭐⭐⭐ A highly practical plug-and-play solution with significant impact on robotic manipulation research.