Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training¶
Conference: NeurIPS 2025 arXiv: 2509.18631 Code: Project Page Area: Robotics / Sim-to-Real / Domain Adaptation Keywords: sim-to-real, optimal transport, domain adaptation, behavior cloning, robotic manipulation
TL;DR¶
This paper proposes a sim-and-real policy co-training framework based on Unbalanced Optimal Transport (UOT), which aligns the joint observation-action distribution (rather than only the marginal observation distribution), and incorporates a temporally aligned sampling strategy to handle data imbalance, achieving a 30% improvement in OOD generalization on robotic manipulation tasks.
Background & Motivation¶
Background: Behavior cloning (BC) requires large amounts of costly real-world demonstrations, while simulators can generate data at low cost but suffer from visual/sensor gaps between simulation and reality.
Limitations of Prior Work: Naively mixing sim and real data for co-training lacks explicit feature-space constraints, leading to severe performance degradation in OOD scenarios; marginal distribution alignment methods such as MMD are too coarse and disrupt task-relevant feature structure.
Key Challenge: Simulated data is abundant but distribution-mismatched, real data is scarce but distribution-accurate, and the two sources are severely imbalanced in quantity (\(N_{sim} \gg N_{real}\)).
Key Insight: Align the joint observation-action distribution via optimal transport rather than aligning observations alone, thereby preserving action-relevant feature structure.
Core Idea: Apply unbalanced OT to handle partial overlap and data imbalance between sim and real joint distributions, augmented with DTW-based temporal alignment sampling to improve mini-batch quality.
Method¶
Overall Architecture¶
Behavior cloning and UOT regularization are trained in parallel: \(L = L_{BC}(f_\phi, \pi_\theta) + \lambda \cdot L_{UOT}(f_\phi)\). The encoder \(f_\phi\) is shared; the BC loss trains the policy while the UOT loss aligns the sim-real feature space.
Key Designs¶
-
Action-Aware Feature Alignment via Optimal Transport:
- Function: Aligns the joint distribution \((z, x)\) rather than \(z\) alone (\(z\) = features, \(x\) = proprioceptive state/action).
- Ground cost: \(C_\phi = \alpha_1 \cdot d_\mathcal{Z}(z^i_{src}, z^j_{tgt}) + \alpha_2 \cdot d_\mathcal{A}(x^i_{src}, x^j_{tgt})\)
- Design Motivation: The joint distribution preserves the structure of "which features correspond to which actions," providing finer-grained alignment than marginal matching.
-
Unbalanced Optimal Transport (UOT):
- Function: Relaxes the strict marginal constraints of standard OT to handle the \(N_{sim} \gg N_{real}\) data imbalance.
- Mechanism: \(L_{UOT} = \min_\Pi \langle\Pi, \hat{C}\rangle_F + \epsilon\Omega(\Pi) + \tau \text{KL}(\Pi\mathbf{1}||\mathbf{p}) + \tau \text{KL}(\Pi^\top\mathbf{1}||\mathbf{q})\)
- Design Motivation: Standard OT forces every simulated sample to be matched to a real sample; however, many simulated states are not covered in the real domain. UOT allows partial mass to remain untransported.
-
Temporally Aligned Sampling Strategy:
- Function: Uses DTW to compute trajectory similarity and constructs mini-batches via similarity-weighted sampling.
- Weight: \(w(\xi_{src}, \xi_{tgt}) = \frac{1}{1+e^{10\cdot(\bar{d}-0.01)}}\)
- Design Motivation: Randomly sampled mini-batches yield poorly matched sim-real pairs; DTW alignment produces more semantically coherent sample pairings.
Key Experimental Results¶
Sim-to-Sim Domain Adaptation¶
| Method | In-Distribution Success Rate | OOD Success Rate |
|---|---|---|
| Co-training | 0.71 | 0.28 |
| MMD Alignment | — | 0.50 (but hurts source domain) |
| Ours (UOT) | 0.78 | 0.36 |
| Target-only | — | 0.0 |
Sim-to-Real Domain Adaptation¶
| Method | Image Policy | Point Cloud Policy |
|---|---|---|
| Co-training | 0.55 | 0.60 |
| Ours | 0.73 | 0.77 |
Key Findings¶
- Joint distribution alignment outperforms marginal alignment (MMD); t-SNE visualizations clearly show better mixing of source and target features.
- Scaling from 100 to 1000 simulated trajectories yields a 25% improvement in OOD performance, confirming the value of additional simulated data.
- The framework supports both RGB image and point cloud modalities.
Highlights & Insights¶
- Theoretical Advantage of Joint Distribution Alignment: Preserving the action-feature structure yields richer alignment signal than marginal matching.
- Elegant Handling of Data Imbalance via UOT: UOT does not force every simulated sample to find a real counterpart, allowing partial mass to remain untransported.
- Cross-Modal Validation: The same framework is effective for both RGB and point cloud inputs, demonstrating the generality of the approach.
Limitations & Future Work¶
- Only addresses the visual gap; dynamic discrepancies are not handled (quasi-static assumption).
- Requires a small amount of structured real demonstrations rather than unlabeled data.
- Tasks are limited to quasi-static grasping and manipulation; dynamically contact-rich tasks remain untested.
Related Work & Insights¶
- vs. DeepJDOT: DeepJDOT uses pseudo-labels for joint distribution alignment, whereas this work uses ground-truth actions/states combined with UOT to handle imbalance.
- vs. Co-training: Co-training lacks explicit constraints and suffers severe OOD collapse; the proposed method resolves this via explicit OT alignment.
- vs. MMD: MMD aligns only marginal distributions, which is insufficiently fine-grained; joint distribution alignment achieves higher precision.
Rating¶
- Novelty: ⭐⭐⭐⭐ The UOT + DTW sampling combination is novel, and the joint distribution alignment perspective is theoretically grounded.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers sim-to-sim and sim-to-real settings, image and point cloud modalities, with sufficient ablations and visualizations.
- Writing Quality: ⭐⭐⭐⭐ Method motivation is clear and the modular design is easy to follow.
- Value: ⭐⭐⭐⭐ Offers significant practical value for sim-to-real robot learning.