Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training¶

Conference: NeurIPS 2025 arXiv: 2509.18631 Code: Project Page Area: Robotics / Sim-to-Real / Domain Adaptation Keywords: sim-to-real, optimal transport, domain adaptation, behavior cloning, robotic manipulation

TL;DR¶

This paper proposes a sim-and-real policy co-training framework based on Unbalanced Optimal Transport (UOT), which aligns the joint observation-action distribution (rather than only the marginal observation distribution), and incorporates a temporally aligned sampling strategy to handle data imbalance, achieving a 30% improvement in OOD generalization on robotic manipulation tasks.

Background & Motivation¶

Background: Behavior cloning (BC) requires large amounts of costly real-world demonstrations, while simulators can generate data at low cost but suffer from visual/sensor gaps between simulation and reality.

Limitations of Prior Work: Naively mixing sim and real data for co-training lacks explicit feature-space constraints, leading to severe performance degradation in OOD scenarios; marginal distribution alignment methods such as MMD are too coarse and disrupt task-relevant feature structure.

Key Challenge: Simulated data is abundant but distribution-mismatched, real data is scarce but distribution-accurate, and the two sources are severely imbalanced in quantity (\(N_{sim} \gg N_{real}\)).

Key Insight: Align the joint observation-action distribution via optimal transport rather than aligning observations alone, thereby preserving action-relevant feature structure.

Core Idea: Apply unbalanced OT to handle partial overlap and data imbalance between sim and real joint distributions, augmented with DTW-based temporal alignment sampling to improve mini-batch quality.

Method¶

Overall Architecture¶

Behavior cloning and UOT regularization are trained in parallel: \(L = L_{BC}(f_\phi, \pi_\theta) + \lambda \cdot L_{UOT}(f_\phi)\). The encoder \(f_\phi\) is shared; the BC loss trains the policy while the UOT loss aligns the sim-real feature space.

Key Designs¶

Action-Aware Feature Alignment via Optimal Transport:
- Function: Aligns the joint distribution \((z, x)\) rather than \(z\) alone (\(z\) = features, \(x\) = proprioceptive state/action).
- Ground cost: \(C_\phi = \alpha_1 \cdot d_\mathcal{Z}(z^i_{src}, z^j_{tgt}) + \alpha_2 \cdot d_\mathcal{A}(x^i_{src}, x^j_{tgt})\)
- Design Motivation: The joint distribution preserves the structure of "which features correspond to which actions," providing finer-grained alignment than marginal matching.
Unbalanced Optimal Transport (UOT):
- Function: Relaxes the strict marginal constraints of standard OT to handle the \(N_{sim} \gg N_{real}\) data imbalance.
- Mechanism: \(L_{UOT} = \min_\Pi \langle\Pi, \hat{C}\rangle_F + \epsilon\Omega(\Pi) + \tau \text{KL}(\Pi\mathbf{1}||\mathbf{p}) + \tau \text{KL}(\Pi^\top\mathbf{1}||\mathbf{q})\)
- Design Motivation: Standard OT forces every simulated sample to be matched to a real sample; however, many simulated states are not covered in the real domain. UOT allows partial mass to remain untransported.
Temporally Aligned Sampling Strategy:
- Function: Uses DTW to compute trajectory similarity and constructs mini-batches via similarity-weighted sampling.
- Weight: \(w(\xi_{src}, \xi_{tgt}) = \frac{1}{1+e^{10\cdot(\bar{d}-0.01)}}\)
- Design Motivation: Randomly sampled mini-batches yield poorly matched sim-real pairs; DTW alignment produces more semantically coherent sample pairings.

Key Experimental Results¶

Sim-to-Sim Domain Adaptation¶

Method	In-Distribution Success Rate	OOD Success Rate
Co-training	0.71	0.28
MMD Alignment	—	0.50 (but hurts source domain)
Ours (UOT)	0.78	0.36
Target-only	—	0.0

Sim-to-Real Domain Adaptation¶

Method	Image Policy	Point Cloud Policy
Co-training	0.55	0.60
Ours	0.73	0.77

Key Findings¶

Joint distribution alignment outperforms marginal alignment (MMD); t-SNE visualizations clearly show better mixing of source and target features.
Scaling from 100 to 1000 simulated trajectories yields a 25% improvement in OOD performance, confirming the value of additional simulated data.
The framework supports both RGB image and point cloud modalities.

Highlights & Insights¶

Theoretical Advantage of Joint Distribution Alignment: Preserving the action-feature structure yields richer alignment signal than marginal matching.
Elegant Handling of Data Imbalance via UOT: UOT does not force every simulated sample to find a real counterpart, allowing partial mass to remain untransported.
Cross-Modal Validation: The same framework is effective for both RGB and point cloud inputs, demonstrating the generality of the approach.

Limitations & Future Work¶

Only addresses the visual gap; dynamic discrepancies are not handled (quasi-static assumption).
Requires a small amount of structured real demonstrations rather than unlabeled data.
Tasks are limited to quasi-static grasping and manipulation; dynamically contact-rich tasks remain untested.

vs. DeepJDOT: DeepJDOT uses pseudo-labels for joint distribution alignment, whereas this work uses ground-truth actions/states combined with UOT to handle imbalance.
vs. Co-training: Co-training lacks explicit constraints and suffers severe OOD collapse; the proposed method resolves this via explicit OT alignment.
vs. MMD: MMD aligns only marginal distributions, which is insufficiently fine-grained; joint distribution alignment achieves higher precision.

Rating¶

Novelty: ⭐⭐⭐⭐ The UOT + DTW sampling combination is novel, and the joint distribution alignment perspective is theoretically grounded.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers sim-to-sim and sim-to-real settings, image and point cloud modalities, with sufficient ablations and visualizations.
Writing Quality: ⭐⭐⭐⭐ Method motivation is clear and the modular design is easy to follow.
Value: ⭐⭐⭐⭐ Offers significant practical value for sim-to-real robot learning.