DexFlyWheel: A Scalable Self-Improving Data Generation Framework for Dexterous Manipulation¶

Conference: NeurIPS 2025 arXiv: 2509.23829 Code: Project Page Area: Robotics Keywords: Dexterous Manipulation, Data Flywheel, Imitation Learning, Residual Reinforcement Learning, Sim-to-Real

TL;DR¶

This paper proposes DexFlyWheel, a dexterous manipulation data generation framework that starts from a single human demonstration and progressively scales data diversity through a self-improving loop composed of IL, residual RL, and data augmentation. The framework generates 2,000+ demonstrations across 4 tasks, achieving an average policy success rate of 81.9% and a real-world transfer success rate of 78.3%.

Background & Motivation¶

Dexterous manipulation is a core challenge in robotics. The high degrees of freedom of multi-fingered hands and rich contact interactions demand larger, more diverse, and higher-quality datasets. Existing data collection methods suffer from the following bottlenecks:

Human teleoperation: Requires substantial human effort, is typically confined to laboratory settings, and scales poorly.

Motion planning: Effective for parallel-jaw grippers, but struggles with the high-dimensional action spaces and complex contact dynamics of multi-fingered hands.

LLM-driven methods: Capable of generating high-level instructions, but unable to provide finger-level fine-grained control.

Pure RL methods: Suffer from exploration difficulty, dependence on reward engineering, and produce non-human-like behaviors that hinder sim-to-real transfer.

Replay-and-edit methods (e.g., DexMimicGen): Only apply spatial transformations to existing demonstrations, failing to explore new manipulation strategies and breaking down when object geometry varies significantly.

Core insight: Manipulating different objects typically requires only minor adjustments to the manipulation trajectory — human demonstrations should be treated as behavioral priors rather than mere replay data.

Method¶

Overall Architecture¶

DexFlyWheel operates in two phases:

Warm-up Phase: Starting from a single VR teleoperation demonstration, an initial dataset \(\mathcal{D}_1\) is generated via data augmentation.
Self-Improving Data Flywheel Phase: Multiple iterations \(i = \{1, 2, ..., n-1\}\), each executing a closed-loop pipeline.

Each flywheel iteration consists of four steps: 1. Train a base policy \(\pi_{\text{base}}^i\) (imitation learning). 2. Train a residual policy \(\pi_{\text{res}}^i\) (residual RL). 3. Roll out the combined policy to collect new trajectories \(\mathcal{D}_O^i\). 4. Apply data augmentation to expand diversity → \(\mathcal{D}_{i+1}\).

Key Designs¶

1. VR Teleoperation + Data Augmentation (Warm-up)¶

Apple Vision Pro is used to track hand/wrist/head poses, collecting a single seed demonstration \(d_{\text{seed}}\) in simulation.

The data augmentation module \(\mathcal{A}_{\text{EP}}\), extended from MimicGen, supports multi-dimensional augmentation: - Environment diversity: Lighting conditions and tabletop appearance variations. - Spatial diversity: Object pose variations. - Implemented via trajectory editing and simulation domain randomization.

2. Base Policy Training (Imitation Learning)¶

Diffusion Policy is adopted as the base policy. The input state is \(s_t = \{s_t^{\text{vis}}, s_t^{\text{obj}}, s_t^{\text{prop}}\}\): - \(s_t^{\text{vis}}\): Camera visual input. - \(s_t^{\text{obj}}\): Object state (6D pose + velocity). - \(s_t^{\text{prop}}\): Robot proprioception (joint angles/velocities + end-effector pose).

The policy outputs an action sequence \((a_t, a_{t+1}, ..., a_{t+H})\), where \(H\) is the prediction horizon.

3. Residual Reinforcement Learning¶

This is the core innovation of the flywheel mechanism. The base policy \(\pi_{\text{base}}\) is frozen, and a residual policy \(\pi_{\text{res}}\) is trained to generate corrective actions:

\[\tilde{a}_t = a_t + \alpha \cdot \triangle a_t\]

The combined policy is \(\pi_{\text{combined}} = \pi_{\text{base}} + \alpha \cdot \pi_{\text{res}}\).

A progressive scheduling scheme is employed during training:

\[\pi_{\text{combined}}(s) = \begin{cases} \pi_{\text{base}}(s) + \alpha \cdot \pi_{\text{res}}(s) & \text{with probability } \epsilon \\ \pi_{\text{base}}(s) & \text{with probability } 1 - \epsilon \end{cases}\]

\(\epsilon\) increases linearly from 0 to 1, gradually transferring control from the base policy to the residual policy. The residual policy receives only \(s_t^{\text{obj}}\) and \(s_t^{\text{prop}}\), focusing on object adaptation rather than learning manipulation from scratch.

4. Rollout + Augmentation Loop¶

The combined policy is rolled out under randomized object configurations. High-quality trajectories are filtered by task success and then augmented via \(\mathcal{A}_{\text{EP}}\) to cover more environment and spatial configurations, generating training data for the next iteration.

Loss & Training¶

Base policy: Standard denoising loss of Diffusion Policy.
Residual policy: Standard RL loss (PPO), with task success as the reward signal.
Iteration settings: \(i = \{1, 2, 3\}\), generating 20, 100, and 500 trajectories respectively.
Simulation platform: OmniGibson (photorealistic rendering), 80 object categories, 12 environments.

Key Experimental Results¶

Main Results: Data Flywheel Effectiveness¶

Task	i=1 Success	i=2 Success	i=3 Success	# Scenes
Grasp	15.0%	58.0%	90.0%	3960
Pour	36.1%	55.6%	85.8%	1440
Lift	13.9%	44.4%	79.4%	1560
Handover	0.8%	17.5%	72.5%	1200
Average	16.5%	43.9%	81.9%	2040

From i=1 to i=3: object categories increased by 20×, scene count by 214.7×, and success rate by +396.4%.

Comparison with Baselines¶

Method	Grasp	Pour	Lift	Handover	Average
Human Demo (Default, 20 demos)	6.1%	16.7%	13.9%	0.8%	9.4%
Human Demo (Enhanced)	15.0%	36.1%	2.5%	0.0%	13.4%
DexMimicGen (Default, 1 seed)	30.3%	38.9%	28.2%	28.3%	31.4%
DexMimicGen (Enhanced, 10 seeds)	50.3%	44.4%	43.7%	42.5%	45.2%
DexFlyWheel (1 seed)	90.0%	85.8%	79.4%	72.5%	81.9%

Using only a single seed demonstration, DexFlyWheel achieves 81.9%, substantially outperforming DexMimicGen (45.2%), which enjoys a 10× data advantage.

Data Generation Efficiency¶

Method	Time per Trajectory	Total Time for 500 Successful Trajectories	Generation Success Rate
Human Teleoperation	60s	12.5h	-
DexMimicGen	15s	4.4h	63.0%
DexFlyWheel	15s	2.4h	89.8%

Ablation Study¶

Variant	Impact
w/o Residual Policy	Largest performance drop; object generalization degrades sharply (20 → 8.25 object categories)
w/o Data Augmentation \(\mathcal{A}_{\text{EP}}\)	Limited environment and spatial diversity
w/o Residual + w/o Augmentation	Base policy only; worst overall performance

Real-World Deployment¶

Transferred to a dual-arm Real-Man robot via a Digital Twin: - Dual-arm Lift: 78.3% success rate - Handover: 63.3% success rate

Key Findings¶

Pronounced flywheel effect: Data diversity and policy performance improve jointly in each iteration, forming a virtuous cycle.
Residual RL is the cornerstone: On average, it improves object generalization by 32.1% (\(\pi_{\text{base}} \rightarrow \pi_{\text{combined}}\)).
A single demonstration suffices: The flywheel can be bootstrapped from one human demonstration without requiring large-scale human data collection.
Successful sim-to-real transfer: Reliable simulation-to-real transfer is achieved via the Digital Twin approach.

Highlights & Insights¶

Transplanting LLM self-improvement to robotics: Inspired by iterative self-improvement in LLMs (RLHF loops), DexFlyWheel applies this paradigm to robotic data generation — an elegant cross-domain knowledge transfer.
Elegant design of residual RL: Rather than learning manipulation skills from scratch, the residual policy fine-tunes atop imitation learning, aligning with the physical intuition that manipulating different objects requires only minor adjustments.
Stability through progressive scheduling: Linear scheduling of \(\epsilon\) from 0 to 1 prevents the residual policy's early-stage exploration from disrupting the base policy.
Minimal human effort: Each task requires only 1 demonstration → 2,000+ diverse demonstrations, achieving extremely high data efficiency.

Limitations & Future Work¶

Manually designed reward functions: The residual RL relies on hand-crafted task rewards; future work could explore LLM-driven reward generation.
Absence of tactile feedback: The current policy and simulation do not incorporate tactile signals, limiting performance on contact-rich tasks.
Limited number of iterations: Only 3 iterations are demonstrated; whether additional iterations lead to convergence or degradation remains an open question.
OmniGibson sim-to-real gap: Despite photorealistic rendering, the sim-to-real gap may still constrain transfer for more complex tasks.

Core distinction from DexMimicGen: DexMimicGen only performs replay-and-edit (spatial transformations) on trajectories and cannot explore new strategies; DexFlyWheel achieves policy-level adaptation via residual RL.
Complementarity with pure RL: Rather than learning from scratch with RL (which suffers from exploration difficulty), the framework combines IL as a prior with RL for fine-tuning — a mutually complementary design.
Generality of the data flywheel: The closed-loop paradigm of IL → residual RL → rollout → augmentation is generalizable to other robotic manipulation tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ (Data flywheel concept is novel; the residual RL + IL combination is elegantly designed)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (4 tasks × multiple baselines × ablations × real-world deployment — exceptionally comprehensive)
Writing Quality: ⭐⭐⭐⭐ (Clear structure, rich figures and tables)
Value: ⭐⭐⭐⭐⭐ (Addresses the core bottleneck of data scarcity in dexterous manipulation with practical deployment value)