DexMove: Learning Tactile-Guided Non-Prehensile Manipulation with Dexterous Hands¶
Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=dT3ZciXvNX
Code: Project Page https://peilin-666.github.io/projects/DexMove/
Area: Robotics / Dexterous Hands / Non-prehensile Manipulation / Tactile Sensing
Keywords: Non-prehensile manipulation, dexterous hands, visuo-tactile sensors, flow matching policy, wrist-finger coordination, hybrid sim+human data
TL;DR¶
DexMove adopts a hybrid data paradigm combining "large-scale simulation trajectories + a small amount of human tactile demonstrations" to train a flow matching policy. This allows a multi-fingered dexterous hand to push and rotate tabletop objects through wrist-finger coordination and tactile closed-loop control (non-prehensile relocation). On a real robot, it achieves an average success rate of 77.8% across 6 object categories, surpassing ablation baselines by 36.6% and improving efficiency by nearly 300%.
Background & Motivation¶
- Background: Non-prehensile manipulation (moving objects by pushing/pressing without lifting them) is a more robust alternative to pick-and-place for relocating large, heavy, fragile, or irregular objects. However, most existing works utilize two-fingered grippers or pushers for single-point contact, leaving dexterous multi-fingered hands under-explored in this scenario.
- Limitations of Prior Work: ① Data Scarcity—training a generalizable policy requires large-scale, physically plausible contact datasets covering variations in geometry, mass distribution, and surface friction. Teleoperation is inefficient and lacks high-fidelity force feedback, while pure simulation suffers from significant sim-to-real gaps (especially in tactility). ② Controller Deficiencies—multi-contact interactions couple the forces and motions of multiple fingers through hand-object dynamics; currently, there is a lack of whole-hand motion planners that coordinate such interactions.
- Key Challenge: While dexterous hands are inherently suited for non-prehensile manipulation (distributed multi-point contact is more stable than single-point and manages objects with difficult dynamics like thin plates or cylinders), the "scarcity of scalable high-fidelity data" and the "absence of force-aware multi-contact coordination strategies" hinder progress.
- Goal: Develop a non-prehensile manipulation framework for tactile-enabled dexterous hands that can scale force-conditioned wrist-finger trajectories and utilize real tactile feedback for closed-loop control, generalizing to unseen objects, frictions, and language-conditioned long-horizon tasks.
- Core Idea: Hybrid Data Synthesis + Decoupled Tactile Force Planning—generate mass-scale "force-conditioned wrist-finger trajectories" in simulation (addressing scale) and collect "fingertip force distributions" from human demonstrations using wearable visuo-tactile devices (addressing tactile fidelity). These are fused via a flow matching policy, with a standalone TaFo-Net predicting "desired future finger forces" to drive the trajectory policy.
Method¶
Overall Architecture¶
DexMove is divided into "Data Acquisition" and "Policy Learning." On the data side: 2M force-conditioned wrist-finger trajectories are synthesized in simulation via optimization + rejection sampling; then, approximately 300k frames of real tactile vector fields are collected from human demonstrations using wearable exoskeletons + R-Tac visuo-tactile sensors. The policy side consists of a pipeline of three Flow Matching (FM)/Transformer components: ① A contact-building FM policy provides the initial grasp pose → ② TaFo-Net predicts future desired finger forces based on historical tactile fields → ③ DexMove-Policy rollouts future wrist-finger trajectories conditioned on historical states, target poses, and desired forces.
flowchart LR
A[Object Point Cloud + Goal Pose] --> B[Contact-Building FM Policy<br/>Predict Initial Hand Pose]
B --> C[DexMove-Policy<br/>Flow Matching Trajectory]
D[Historical Tactile Vector Field V<br/>Historical/Goal Object Pose] --> E[TaFo-Net<br/>Predict Future Tactile Field → Force G]
E -->|Desired Force G_1:Tf| C
C --> F[Wrist-Finger Coordination<br/>Non-prehensile Relocation]
F -->|Real-time Tactile/Pose Feedback| D
Key Designs¶
1. Force-conditioned Trajectory Synthesis: Using "Penetration Depth" as a Force Proxy. To enable large-scale trajectory generation in rigid-body simulations, wrist poses \((R^{wrist}_0, T^{wrist}_0)\) are uniformly sampled. Each fingertip is pushed toward the nearest surface along displacement vector \(d\) (augmented by Gaussian noise \(\hat{d}=d+\varepsilon\) for diversity). An optimization (Eq. 1-2) solves for joint/wrist configurations that satisfy fingertip target positions and constrain contact within the sensor area \(L_{region}\). Post-contact, the hand is translated incrementally in random directions via MuJoCo. Directions are accepted via rejection sampling if stable contact holds over 50cm. Under the non-slip assumption, fingertip trajectories are derived from initial offsets and object transformations: \(P^{tip}_t = P^{obj}_t + R_z(\omega^{obj}_t)(P^{TIP}_0 - P^{obj}_0)\). Crucially, normal force is approximated by penetration depth \(G \approx D_{sensor} = r - \text{distance}(P^{TIP}_t, \text{surface})\). Finger positions are perturbed along the normal \(\vec{n}\) as \(\hat{P}^{TIP}_t = P^{TIP}_t + \vec{n}\cdot N(0,\sigma)\) to augment trajectories with varying force magnitudes. Inverse kinematics with wrist regularization \(L_{wrist}\) (prioritizing finger movement over arm movement) are used for final configurations. The dataset expands 88 YCB objects into 2M sequences across 412k grasp configurations.
2. Human Demonstrations for Real Tactility: Isomorphic Visuo-tactile Sensors. Since rigid-body simulations fail to model high-fidelity dynamics or real tactile output, force data is supplemented by human demonstrations. A wearable exoskeleton equipped with R-Tac visuo-tactile sensors on human fingertips allows for data collection that can be directly transferred to robotic hands—this isomorphic design minimizes the domain gap. Each trial records goal poses, real-time poses, and tactile data: normal force \(G\) derived from penetration and shear forces from 2D marker displacements, forming a tactile vector field \(V \in \mathbb{R}^{v\times 4}\) (\(v=33\) markers) collected at 30FPS across 20 objects.
3. TaFo-Net Force Planning: Implicit Environment Encoding via Historical Tactile Fields. The trajectory policy requires "desired finger forces" \(G_{1:T_f}\) as a condition, which TaFo-Net predicts. The core insight is that historical tactile vector fields implicitly encode environment properties (e.g., friction, contact state), while poses provide error signals. The network has three stages: (i) Per-finger spatial encoding—tactile fields are encoded into tokens via a light Transformer + geometric position embeddings; (ii) Cross-finger attention—tokens from all fingers in a frame are processed via multi-head self-attention \(\tilde{U}_{i,1:F}=CF(U_{i,1:F}+g_{1:F})\) to model coordination constraints; (iii) Per-finger causal temporal attention—a causal mask ensures query at time \(i\) only attends to tokens at \(\leq i\), enabling goal-conditioned, temporally and cross-fingerly consistent inference. Training minimizes reconstruction loss \(L_{rec}=\sum_t\sum_f \|\hat{V}_{t,f}-V_{t,f}\|^2\).
4. DexMove-Policy: Goal-Conditioned Trajectory Rollout. Both contact building and trajectory generation use Flow Matching (FM), which is faster than diffusion policies. FM learns a time-dependent velocity field \(u(\cdot)\) from interpolated samples \(X_t=(1-t)X_0+tX_1\), with objective \(L=\mathbb{E}\|(X_1-X_0)-u(X_t,t,\text{cond})\|^2\). DexMove-Policy is conditioned on system history \(T_p\) (joints, wrist, object pose, contact \(C\), forces \(G\)), target pose, and TaFo-Net's \(G_{1:T_f}\). These are fused via cross-attention and fed into a Transformer decoder to predict the velocity field, outputting future \(T_f\) frames of hand states \(X_1=(P^{hand},A^{hand},R^{wrist},T^{wrist})_{1:T_f}\).
Key Experimental Results¶
Main Results: Success Rate (Initial Yaw Error × Friction Surfaces)¶
| Method | 0–30° Fric.A | 0–30° Fric.B | 30–60° Fric.A | 30–60° Fric.B | 60–90° Fric.A | 60–90° Fric.B |
|---|---|---|---|---|---|---|
| Open-loop | 36.7 | 10.0 | 23.3 | 0.0 | 3.3 | 0.0 |
| DyWA (Gripper) | 50.0 | 36.7 | 46.7 | 30.0 | 50.0 | 33.3 |
| CORN (Gripper) | 43.3 | 36.7 | 46.7 | 40.0 | 43.3 | 43.3 |
| DexMove | 86.7 | 86.7 | 80.0 | 83.3 | 70.0 | 60.0 |
Fric.B is a friction surface unseen during training. DexMove shows minimal degradation (robustness), whereas gripper baselines degrade significantly on Surface B.
Efficiency: Average Completion Time (s, lower is better)¶
| Method | 0–15 cm | 15–30 cm | 30–45 cm |
|---|---|---|---|
| DyWA | 36.1 | 52.2 | 60.6 |
| CORN | 41.4 | 54.5 | 62.1 |
| DexMove | 8.3 | 10.9 | 12.4 |
DexMove completes tasks in less than half the time of gripper baselines (nearly 300% efficiency gain) due to multi-finger contact and fewer motion primitives.
Ablation Study: Success Rate per Object (%)¶
| Method | Lego | Mouse | Book | Keyboard | Large Can | Small Can |
|---|---|---|---|---|---|---|
| Wrist-Only (Locked Fingers) | 13.3 | 0.0 | 33.3 | 20.0 | 0.0 | 0.0 |
| w/o Cross-Finger | 13.3 | 3.3 | 63.3 | 50.0 | 0.0 | 3.3 |
| w/o Shear-Force | 70.0 | 66.7 | 33.3 | 13.3 | 0.0 | 0.0 |
| w Heuristic Force | 36.7 | 43.3 | 66.7 | 0.0 | 0.0 | 0.0 |
| DexMove | 66.7 | 86.7 | 90.0 | 90.0 | 63.3 | 70.0 |
Key Findings¶
- Multi-finger > Single-point: Gripper baselines fail primarily during rotation (especially cylinders) because they rely on single contact points. Continuous multi-surface contact by dexterous hands enables precise rotation.
- Removing Cross-finger Attention → Plane-only limits: The model can only handle flat objects (book/keyboard) and fails to capture inter-finger coordination.
- Removing Shear Force → Heavy object failure: The model degrades to predicting smooth means, failing on heavy objects where shear feedback is vital for slip detection.
- Heuristic Force < Learned Force: Hand-crafted "add force upon slip" strategies perform poorly across most tasks.
- Strong Generalization: Success on deformable objects (Plushie 96.7%, Tissue 100%) is high; TaFo-Net can recover performance on uneven surfaces with 15 minutes of fine-tuning.
Highlights & Insights¶
- "Penetration depth as force proxy" is a key trick bridging rigid-body simulation and tactility: it allows inexpensive simulations to produce "forced" trajectories and enables force augmentation via normal-direction perturbations.
- Isomorphic wearable tactile exoskeleton: Using sensors for both humans and robots is an engineering masterstroke to inject "zero-domain-gap real tactility" into a policy at low cost.
- Decoupled Force Planning and Trajectory Generation: TaFo-Net handles "how much force" while DexMove-Policy handles "how to move," with desired force as the bridge—this makes tactile closed-loop learning interpretable.
- Implicit Environment Encoding: Not explicitly estimating friction but letting the network infer it from tactile history naturally supports generalization to unseen surfaces.
Limitations & Future Work¶
- Accuracy is still sensitive to tactile quality; success rates for large objects drop significantly when tactile noise \(\sigma\) reaches 0.2.
- Object motion is modeled as 3 DoF (x/y translation + yaw); more complex reorientations like flipping or uprighting are not addressed.
- Simulation trajectories rely on non-slip assumptions and penetration-based force approximations, leaving the fidelity of highly dynamic or high-slip contacts unknown.
- Evaluation is limited to a small set of objects; further validation on larger scale, open-world scenarios is needed.
Related Work & Insights¶
- Non-prehensile Manipulation: Evolves from planar pushing (Mason 1986) to controlled contact breaking/rebuilding (Chi 2024). This work extends it to dexterous hands + tactile closed-loop, demonstrating that multi-point contact is significantly more stable.
- Tactile Data Collection: Teleoperation often lacks feedback for dexterous hands; tactile gloves have marker-mismatch domain gaps. Isomorphic sensors (Zhu 2025) represent the current frontier for low-domain-gap data collection.
- Insight: For "contact-rich yet hard-to-simulate" tasks, the paradigm of "using cheap simulation for motion scale + small amounts of real data for physical fidelity (tactile/force)" is a highly reusable data strategy.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First non-prehensile policy for tactile dexterous hands; hybrid paradigm and isomorphic exoskeleton are highly original.
- Experimental Thoroughness: ⭐⭐⭐⭐ Solid real-robot benchmarks + detailed ablations + robustness tests; however, lacks more diverse public baselines.
- Writing Quality: ⭐⭐⭐⭐ Clear correspondence between motivation, challenges, and methods.
- Value: ⭐⭐⭐⭐⭐ Provides a comprehensive hardware/software/data suite with real-world verification, offering significant utility for the dexterous manipulation community.