ManiLong-Shot: Interaction-Aware One-Shot Imitation Learning for Long-Horizon Manipulation¶
Conference: AAAI 2026 arXiv: 2512.16302 Code: Website Area: Reinforcement Learning Keywords: One-shot imitation learning, long-horizon manipulation, interaction-awareness, invariant regions, task decomposition
TL;DR¶
This paper proposes ManiLong-Shot, a framework comprising three modules—interaction-aware task decomposition, invariant region prediction, and region matching—that generalizes to 20 unseen long-horizon manipulation tasks after training on only 10 short-horizon tasks, achieving a one-shot imitation success rate of 30.2%, a relative improvement of 22.8% over the prior state of the art.
Background & Motivation¶
Problem Definition¶
One-Shot Imitation Learning (OSIL): learning new skills from a single demonstration without additional training. Robots must rapidly learn and execute diverse long-horizon manipulation tasks in everyday settings (e.g., "set the table," "tidy the kitchen"), which involve sequential interactions with multiple objects.
Limitations of Prior Work¶
Short-horizon restriction: Most OSIL methods are designed for short-horizon skills only (e.g., IMOP, zhang2024oneshot) and cannot scale to multi-step long-horizon tasks.
Task-variant dependency: Some methods require new tasks to be minor variants of training tasks, or rely on known 3D object models.
Predefined primitive libraries: wu2024one depends on a predefined primitive library to compose long-horizon manipulation, limiting flexibility.
Core Motivation — Inspiration from Human Learning¶
When faced with an unseen task (e.g., arranging tableware), humans naturally decompose it into short-horizon primitives and infer the key interaction regions: 1. Pick up the plate (at the edge) → 2. Place the plate (target location on the table) → 3. Pick up the fork (at the handle)
Each primitive is delimited by a physical interaction event (contact/release), and imitation is achieved by replicating actions in these regions. The core question is: can a robot infer subtask boundaries and key interaction regions from unannotated demonstrations?
Method¶
Overall Architecture¶
Three core modules are organized around physical interaction events: 1. Interaction-aware task decomposition: decomposes a demonstration into a sequence of primitives, each comprising pre-contact, grasping, and post-contact phases. 2. Interaction-aware region prediction network: predicts functionally invariant interaction regions for each primitive. 3. Interaction-aware region matching network: aligns predicted regions with the current observation to compute the target end-effector pose.
Inference pipeline: decompose demonstration → predict invariant regions → match current scene → pose regression → motion planning → execution → iterate until task completion.
Key Designs¶
1. Interaction-Aware Task Decomposition¶
The demonstration trajectory is organized into a primitive sequence based on physical interaction phases:
- Pre-contact phase: the gripper opens and approaches the object, ending when joint velocity drops to zero (alignment in place).
- Grasping phase: the gripper transitions from open to closed, grasping the object.
- Post-contact phase: after a successful grasp, the gripper transitions from closed to open, placing the object or interacting with another object.
Two interchangeable decomposition strategies are provided: - Rule-based: infers phase boundaries by analyzing joint velocity and gripper state changes; stable and reliable. - VLM-based: employs GPT-4o with structured trajectory representations to automatically identify interaction phases; semantically aware.
Short-horizon tasks contain one interaction cycle (3 phases); long-horizon tasks repeat multiple cycles.
Design Motivation: Physical interaction events serve as natural subtask boundaries that are more robust than semantic segmentation and are transferable across tasks and environments.
2. Interaction-Aware Region Prediction Network¶
Identifies functionally and semantically invariant interaction regions for each interaction phase. An invariant region is defined as a 3D geometric subset that maintains \(SE(3)\)-equivariant structure across states sharing the same optimal policy.
Architecture: - Input: RGB-D dense point clouds of consecutive state pairs \(\{s_i, s_{i+1}\}\) - Backbone: Point Cloud Transformer V3 (PTV3) with progressive downsampling, cross-scene cross-attention, and within-scene self-attention - Output: interaction probability distribution → activated region \(\mathcal{I}(s_i)\)
Key distinctions: - Pre-contact + grasping phases: jointly trained (functionally similar); predict grasp surface regions. - Post-contact phase: an additional Positioning Network is activated, using attention mechanisms to align the grasped object with the target region.
Training supervision: instance segmentation masks from simulation are used as ground truth. Training is performed exclusively on short-horizon tasks \(\mathcal{T}^{\text{sh}}\).
Design Motivation: The invariant region concept enables knowledge transfer from short-horizon to long-horizon tasks—the same "grasp the cup rim" region remains consistent across different tasks.
3. Interaction-Aware Region Matching Network¶
Aligns the predicted invariant regions from the demonstration with the current execution state to compute the target pose.
Pipeline: 1. State routing network: selects the frame in the demonstration trajectory most similar to the current state. 2. Feature fusion: cropped invariant region point cloud \(\mathcal{I}(s_i)\) and current scene point cloud are jointly downsampled. 3. Dual-stage attention: a cross-self-cross attention module enhances spatial and geometric feature alignment. 4. Correspondence matrix computation: a dual softmax matching algorithm computes correspondence matrix \(\mathbf{C}\).
Pose regression based on the correspondence matrix:
Motion planning uses the RRT-Connect algorithm to find collision-free trajectories.
Design Motivation: Pose estimation based on correspondences rather than direct action prediction is more robust to object pose variations and scene layout changes.
Loss & Training¶
- Training uses only 10 short-horizon tasks (100 demonstration trajectories per task).
- Observations from front, side, and wrist RGB-D cameras are used.
- Backbone: PTV3.
- Supervised learning with ground-truth instance masks and correspondence matrices.
Key Experimental Results¶
Experimental Setup¶
- Simulation benchmark: RLBench-Oneshot (30 tasks: 10 SH + 20 LH)
- LH tasks at three difficulty levels: Level 1 (13 tasks, 6 interactions), Level 2 (4 tasks, 9 interactions), Level 3 (3 tasks, 12 interactions)
- Real robot: UFactory xArm7 + RealSense D435/D415, 3 LH tasks
- Baselines: ARP, 3DDA, RVT2 (SOTA IL models), IMOP (OSIL SOTA)
- 25 trials per task (5 random seeds); mean ± standard deviation reported
Main Results — Short-Horizon Tasks¶
| Model | Avg. Success Rate (%) | Avg. Rank | # Best Tasks |
|---|---|---|---|
| IMOP | 65.24 | 3.9 | 1 |
| RVT2 | 79.2 | 3.1 | 0 |
| 3DDA | 85.0 | 2.9 | 2 |
| ARP | 86.6 | 2.7 | 3 |
| ManiLong-Shot | 90.4 (+3.8%) | 1.9 | 6 |
Main Results — Unseen Long-Horizon Tasks (OSIL)¶
| Model | Avg. Success Rate (%) | Avg. Rank |
|---|---|---|
| RVT2+FT | 4.1 | 3.7 |
| 3DDA+FT | 4.5 | 3.3 |
| ARP+FT | 4.7 | 3.3 |
| IMOP | 7.4 | 2.6 |
| ManiLong-Shot | 30.2 (+22.8%) | 1.0 |
Representative task comparisons:
| Task | IMOP | ManiLong-Shot |
|---|---|---|
| Empty Container | 4.0% | 28.0% |
| Empty Dishwasher | 1.3% | 42.7% |
| Put Item in Drawer | 40.0% | 65.3% |
| Take Item out Drawer | 38.7% | 76.0% |
| Set Table | 2.7% | 17.3% |
| Stack Blocks | 1.3% | 8.0% |
Ablation Study¶
| Configuration | Level 1 | Level 2 | Level 3 | Notes |
|---|---|---|---|---|
| ManiLong-Shot (Rule) | Highest | Highest | Highest | Rule-based decomposition |
| ManiLong-Shot (VLM) | Below Rule | Below Rule | Below Rule | VLM reasoning is unstable |
| w/o Positioning | Degraded | Notably degraded | Largest degradation | Inaccurate placement in post-contact phase |
Real Robot Experiments¶
| Task | IMOP | ManiLong-Shot |
|---|---|---|
| Stack Blocks | 60% | 80% |
| Stack Cups | 20% | 60% |
| Place Cups | 20% | 40% |
| Average | 33.3% | 60.0% (+26.7%) |
Key Findings¶
- Consistent superiority on short-horizon tasks: 90.4% average success rate; best on 6 out of 10 training tasks.
- Substantial advantage on long-horizon tasks: 30.2% vs. 7.4% (IMOP), an absolute gain of 22.8%; fine-tuned SOTA models achieve only 4–5%.
- Rule-based decomposition outperforms VLM-based: VLM reasoning is unstable, and the gap widens with task complexity; the VLM variant still significantly outperforms all baselines.
- Positioning network is critical: its removal causes inaccurate placement in the post-contact phase, disrupting the subsequent subtask execution chain.
- Effective sim-to-real transfer: 60% average success rate, a 26.7% improvement over IMOP.
Highlights & Insights¶
- Physical interaction as universal decomposition primitives: the pre-contact/grasping/post-contact three-phase structure is a natural organization of grasping-based manipulation, more robust than semantic segmentation.
- Generalizing to long-horizon tasks trained only on short-horizon ones: 10 short tasks → 20 unseen long tasks, demonstrating the compositional generalization capacity of interaction primitives.
- Elegant abstraction via invariant regions: the functional property that "the cup rim is suitable for grasping" is formalized as an \(SE(3)\)-equivariant invariant region.
- Flexibility of dual decomposition strategies: the rule-based approach offers stability while the VLM-based approach can capture semantic patterns; either can be selected as needed.
- End-to-end one-shot generalization: no task-specific fine-tuning is required, enabling truly zero-shot transfer.
Limitations & Future Work¶
- Restricted to grasping operations: the three-phase decomposition is grounded in physical contact and does not apply to non-grasping behaviors (e.g., wiping, pouring, and other continuous interactions).
- Parallel gripper assumption: the framework is designed around parallel grippers; dexterous hand manipulation is not addressed.
- Tabletop environment constraint: experiments are conducted in tabletop settings; more complex environments (mobile manipulation, multi-room scenarios) remain unvalidated.
- Absolute success rate remains low: the 30.2% average on long-horizon tasks, while far exceeding baselines, still limits practical applicability.
- VLM reasoning instability: the inconsistency of GPT-4o as a task decomposer is amplified in more complex tasks.
Related Work & Insights¶
- IMOP (zhang2024oneshot): the pioneering work introducing the invariant region concept; the direct foundation of this paper.
- GravMAD/DECO (chen2024/2025): subtask decomposition methods based on physical interactions.
- RLBench (james2020): robotic manipulation simulation benchmark; this paper constructs the RLBench-Oneshot subset.
- PTV3: a point cloud Transformer used as the geometric feature extraction backbone.
- Insight for long-horizon manipulation research: composing short-horizon primitives is more effective than directly learning long-horizon sequences.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of interaction-aware three-phase decomposition and invariant region prediction is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across a 30-task benchmark, ablations, VLM comparisons, and real robot experiments.
- Writing Quality: ⭐⭐⭐⭐ — Clear figures, rigorous problem formulation, and detailed method description.
- Value: ⭐⭐⭐⭐ — Provides a practical framework for long-horizon one-shot imitation learning.