ManiLong-Shot: Interaction-Aware One-Shot Imitation Learning for Long-Horizon Manipulation¶

Conference: AAAI 2026 arXiv: 2512.16302 Code: Website Area: Reinforcement Learning Keywords: One-shot imitation learning, long-horizon manipulation, interaction-awareness, invariant regions, task decomposition

TL;DR¶

This paper proposes ManiLong-Shot, a framework comprising three modules—interaction-aware task decomposition, invariant region prediction, and region matching—that generalizes to 20 unseen long-horizon manipulation tasks after training on only 10 short-horizon tasks, achieving a one-shot imitation success rate of 30.2%, a relative improvement of 22.8% over the prior state of the art.

Background & Motivation¶

Problem Definition¶

One-Shot Imitation Learning (OSIL): learning new skills from a single demonstration without additional training. Robots must rapidly learn and execute diverse long-horizon manipulation tasks in everyday settings (e.g., "set the table," "tidy the kitchen"), which involve sequential interactions with multiple objects.

Limitations of Prior Work¶

Short-horizon restriction: Most OSIL methods are designed for short-horizon skills only (e.g., IMOP, zhang2024oneshot) and cannot scale to multi-step long-horizon tasks.

Task-variant dependency: Some methods require new tasks to be minor variants of training tasks, or rely on known 3D object models.

Predefined primitive libraries: wu2024one depends on a predefined primitive library to compose long-horizon manipulation, limiting flexibility.

Core Motivation — Inspiration from Human Learning¶

When faced with an unseen task (e.g., arranging tableware), humans naturally decompose it into short-horizon primitives and infer the key interaction regions: 1. Pick up the plate (at the edge) → 2. Place the plate (target location on the table) → 3. Pick up the fork (at the handle)

Each primitive is delimited by a physical interaction event (contact/release), and imitation is achieved by replicating actions in these regions. The core question is: can a robot infer subtask boundaries and key interaction regions from unannotated demonstrations?

Method¶

Overall Architecture¶

Three core modules are organized around physical interaction events: 1. Interaction-aware task decomposition: decomposes a demonstration into a sequence of primitives, each comprising pre-contact, grasping, and post-contact phases. 2. Interaction-aware region prediction network: predicts functionally invariant interaction regions for each primitive. 3. Interaction-aware region matching network: aligns predicted regions with the current observation to compute the target end-effector pose.

Inference pipeline: decompose demonstration → predict invariant regions → match current scene → pose regression → motion planning → execution → iterate until task completion.

Key Designs¶

1. Interaction-Aware Task Decomposition¶

The demonstration trajectory is organized into a primitive sequence based on physical interaction phases:

Pre-contact phase: the gripper opens and approaches the object, ending when joint velocity drops to zero (alignment in place).
Grasping phase: the gripper transitions from open to closed, grasping the object.
Post-contact phase: after a successful grasp, the gripper transitions from closed to open, placing the object or interacting with another object.

Two interchangeable decomposition strategies are provided: - Rule-based: infers phase boundaries by analyzing joint velocity and gripper state changes; stable and reliable. - VLM-based: employs GPT-4o with structured trajectory representations to automatically identify interaction phases; semantically aware.

Short-horizon tasks contain one interaction cycle (3 phases); long-horizon tasks repeat multiple cycles.

Design Motivation: Physical interaction events serve as natural subtask boundaries that are more robust than semantic segmentation and are transferable across tasks and environments.

2. Interaction-Aware Region Prediction Network¶

Identifies functionally and semantically invariant interaction regions for each interaction phase. An invariant region is defined as a 3D geometric subset that maintains \(SE(3)\)-equivariant structure across states sharing the same optimal policy.

Architecture: - Input: RGB-D dense point clouds of consecutive state pairs \(\{s_i, s_{i+1}\}\) - Backbone: Point Cloud Transformer V3 (PTV3) with progressive downsampling, cross-scene cross-attention, and within-scene self-attention - Output: interaction probability distribution → activated region \(\mathcal{I}(s_i)\)

Key distinctions: - Pre-contact + grasping phases: jointly trained (functionally similar); predict grasp surface regions. - Post-contact phase: an additional Positioning Network is activated, using attention mechanisms to align the grasped object with the target region.

Training supervision: instance segmentation masks from simulation are used as ground truth. Training is performed exclusively on short-horizon tasks \(\mathcal{T}^{\text{sh}}\).

Design Motivation: The invariant region concept enables knowledge transfer from short-horizon to long-horizon tasks—the same "grasp the cup rim" region remains consistent across different tasks.

3. Interaction-Aware Region Matching Network¶

Aligns the predicted invariant regions from the demonstration with the current execution state to compute the target pose.

Pipeline: 1. State routing network: selects the frame in the demonstration trajectory most similar to the current state. 2. Feature fusion: cropped invariant region point cloud \(\mathcal{I}(s_i)\) and current scene point cloud are jointly downsampled. 3. Dual-stage attention: a cross-self-cross attention module enhances spatial and geometric feature alignment. 4. Correspondence matrix computation: a dual softmax matching algorithm computes correspondence matrix \(\mathbf{C}\).

Pose regression based on the correspondence matrix:

\[\mathbf{T}_j = \arg\min_{\mathbf{T} \in SE(3)} \|\mathbf{T} \mathbf{T}_i^{-1} P_{\mathcal{I}(s_i)} \mathbf{C} - P_{s_j}\|\]

Motion planning uses the RRT-Connect algorithm to find collision-free trajectories.

Design Motivation: Pose estimation based on correspondences rather than direct action prediction is more robust to object pose variations and scene layout changes.

Loss & Training¶

Training uses only 10 short-horizon tasks (100 demonstration trajectories per task).
Observations from front, side, and wrist RGB-D cameras are used.
Backbone: PTV3.
Supervised learning with ground-truth instance masks and correspondence matrices.

Key Experimental Results¶

Experimental Setup¶

Simulation benchmark: RLBench-Oneshot (30 tasks: 10 SH + 20 LH)
- LH tasks at three difficulty levels: Level 1 (13 tasks, 6 interactions), Level 2 (4 tasks, 9 interactions), Level 3 (3 tasks, 12 interactions)
Real robot: UFactory xArm7 + RealSense D435/D415, 3 LH tasks
Baselines: ARP, 3DDA, RVT2 (SOTA IL models), IMOP (OSIL SOTA)
25 trials per task (5 random seeds); mean ± standard deviation reported

Main Results — Short-Horizon Tasks¶

Model	Avg. Success Rate (%)	Avg. Rank	# Best Tasks
IMOP	65.24	3.9	1
RVT2	79.2	3.1	0
3DDA	85.0	2.9	2
ARP	86.6	2.7	3
ManiLong-Shot	90.4 (+3.8%)	1.9	6

Main Results — Unseen Long-Horizon Tasks (OSIL)¶

Model	Avg. Success Rate (%)	Avg. Rank
RVT2+FT	4.1	3.7
3DDA+FT	4.5	3.3
ARP+FT	4.7	3.3
IMOP	7.4	2.6
ManiLong-Shot	30.2 (+22.8%)	1.0

Representative task comparisons:

Task	IMOP	ManiLong-Shot
Empty Container	4.0%	28.0%
Empty Dishwasher	1.3%	42.7%
Put Item in Drawer	40.0%	65.3%
Take Item out Drawer	38.7%	76.0%
Set Table	2.7%	17.3%
Stack Blocks	1.3%	8.0%

Ablation Study¶

Configuration	Level 1	Level 2	Level 3	Notes
ManiLong-Shot (Rule)	Highest	Highest	Highest	Rule-based decomposition
ManiLong-Shot (VLM)	Below Rule	Below Rule	Below Rule	VLM reasoning is unstable
w/o Positioning	Degraded	Notably degraded	Largest degradation	Inaccurate placement in post-contact phase

Real Robot Experiments¶

Task	IMOP	ManiLong-Shot
Stack Blocks	60%	80%
Stack Cups	20%	60%
Place Cups	20%	40%
Average	33.3%	60.0% (+26.7%)

Key Findings¶

Consistent superiority on short-horizon tasks: 90.4% average success rate; best on 6 out of 10 training tasks.
Substantial advantage on long-horizon tasks: 30.2% vs. 7.4% (IMOP), an absolute gain of 22.8%; fine-tuned SOTA models achieve only 4–5%.
Rule-based decomposition outperforms VLM-based: VLM reasoning is unstable, and the gap widens with task complexity; the VLM variant still significantly outperforms all baselines.
Positioning network is critical: its removal causes inaccurate placement in the post-contact phase, disrupting the subsequent subtask execution chain.
Effective sim-to-real transfer: 60% average success rate, a 26.7% improvement over IMOP.

Highlights & Insights¶

Physical interaction as universal decomposition primitives: the pre-contact/grasping/post-contact three-phase structure is a natural organization of grasping-based manipulation, more robust than semantic segmentation.
Generalizing to long-horizon tasks trained only on short-horizon ones: 10 short tasks → 20 unseen long tasks, demonstrating the compositional generalization capacity of interaction primitives.
Elegant abstraction via invariant regions: the functional property that "the cup rim is suitable for grasping" is formalized as an \(SE(3)\)-equivariant invariant region.
Flexibility of dual decomposition strategies: the rule-based approach offers stability while the VLM-based approach can capture semantic patterns; either can be selected as needed.
End-to-end one-shot generalization: no task-specific fine-tuning is required, enabling truly zero-shot transfer.

Limitations & Future Work¶

Restricted to grasping operations: the three-phase decomposition is grounded in physical contact and does not apply to non-grasping behaviors (e.g., wiping, pouring, and other continuous interactions).
Parallel gripper assumption: the framework is designed around parallel grippers; dexterous hand manipulation is not addressed.
Tabletop environment constraint: experiments are conducted in tabletop settings; more complex environments (mobile manipulation, multi-room scenarios) remain unvalidated.
Absolute success rate remains low: the 30.2% average on long-horizon tasks, while far exceeding baselines, still limits practical applicability.
VLM reasoning instability: the inconsistency of GPT-4o as a task decomposer is amplified in more complex tasks.

IMOP (zhang2024oneshot): the pioneering work introducing the invariant region concept; the direct foundation of this paper.
GravMAD/DECO (chen2024/2025): subtask decomposition methods based on physical interactions.
RLBench (james2020): robotic manipulation simulation benchmark; this paper constructs the RLBench-Oneshot subset.
PTV3: a point cloud Transformer used as the geometric feature extraction backbone.
Insight for long-horizon manipulation research: composing short-horizon primitives is more effective than directly learning long-horizon sequences.

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of interaction-aware three-phase decomposition and invariant region prediction is novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across a 30-task benchmark, ablations, VLM comparisons, and real robot experiments.
Writing Quality: ⭐⭐⭐⭐ — Clear figures, rigorous problem formulation, and detailed method description.
Value: ⭐⭐⭐⭐ — Provides a practical framework for long-horizon one-shot imitation learning.