Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation¶

Conference: CVPR 2025
arXiv: 2406.14235
Code: Project Page
Area: Robotics
Keywords: robotic manipulation, visual pre-training, human-robot domain gap, contrastive alignment, parameter-efficient adapter

TL;DR¶

This paper proposes the HR-Align adaptation paradigm, which leverages paired human-robot video data and a contrastive alignment loss to bridge the semantic discrepancy between models pre-trained on human data and the robot domain in a parameter-efficient manner. It improves the average success rate by 7%+ across 20 simulation tasks and 5 real-world tasks.

Background & Motivation¶

Background: Visual representation learning in the field of robotic manipulation faces severe data scarcity. Existing approaches leverage large-scale human activity datasets (such as Ego4D and Kinetics) to pre-train visual models, which are then used as frozen visual backbones for downstream robot policy learning.

Key Challenge: There are significant morphological differences between humans and robots, leading to the "human-robot domain gap" issue—representations learned by pre-trained models on human data struggle to transfer effectively to the robot domain.

Limitations of Prior Work:

Manipulation-oriented pretext tasks (e.g., hand detection): These methods define pretext tasks on human data to indirectly adapt pre-trained models, but they lack explicit exposure to robot data and fail to directly mitigate the domain gap.
Downstream fine-tuning (fine-tuning in each downstream environment): This requires customizing pre-trained models for each different environment, which sacrifices the generality of the models.

Key Insight: This paper proposes a new "adaptation paradigm" that leverages existing paired human-robot demonstration datasets (such as the RH20T dataset) as a bridge to mitigate the domain gap while maintaining model generality. The core insight is that the dynamic semantics of human and robot demonstrations in paired data are aligned, and this alignment can be used to guide the adaptation.

Method¶

Overall Architecture¶

HR-Align (Human-Robot Semantic Alignment) adopts a three-stream architecture:

Frozen Human Stream: The frozen pre-trained model \(\mathcal{F}\) extracts human video features \(h^f\).
Frozen Robot Stream: The same frozen model extracts robot video features \(r^f\) (unadapted, serving as a negative reference).
Adapted Robot Stream: Learnable adapter modules are injected into the pre-trained model to extract adapted robot video features \(r^t\).

After aggregating the three-stream features via task-aware attention, the adapter parameters are trained using a contrastive alignment loss.

Key Designs¶

1. Parameter-Efficient Adapter Module¶

Lightweight adapters are inserted into the intermediate layers of the pre-trained model, using a residual structure to perform feature adaptation:

\[r^{t,next} = r^{f,inter} + \text{Conv}_{up}(g(\text{Conv}_{down}(r^{f,inter})))\]

\(\text{Conv}_{down}\): Channel dimension reduction convolution.
\(g\): Activation function.
\(\text{Conv}_{up}\): Channel dimension expansion convolution.
Only the adapter parameters are trained while keeping the pre-trained backbone frozen, achieving parameter-efficient adaptation.

2. Task-Aware Feature Modeling¶

Task description text is introduced as a query, and task-relevant semantics are extracted from video spatiotemporal features through an attention mechanism:

A frozen DistilBert is used to encode the task description \(L\), obtaining the query \(l\).
Attention weights \(\mathcal{A}^r = \text{softmax}(r^t \cdot l)\) are computed for the video features of each stream.
Task-aware features are obtained through weighted aggregation: \(\bar{r}^t = (r^t)^T \cdot \mathcal{A}^t\).

3. Human-Robot Contrastive Alignment Loss¶

A bidirectional contrastive loss is designed to constrain the adaptation process, incorporating two core principles:

Principle 1: For paired human-robot videos, the adapted robot feature \(\bar{r}_i^t\) should be more consistent with the human feature \(\bar{h}_i^f\) than the unadapted one \(\bar{r}_i^f\).

Principle 2: Paired human-robot features should be more similar than unpaired ones within a batch (the standard contrastive learning paradigm).

\[\mathcal{L} = \frac{1}{2M}\sum_{i=1}^{M} -\log\frac{\mathcal{S}(\bar{h}_i^f, \bar{r}_i^t)}{\mathcal{S}(\bar{h}_i^f, \bar{r}_i^t) + \mathcal{S}(\bar{h}_i^f, \bar{r}_i^f) + \sum_{j \neq i}\mathcal{S}(\bar{h}_i^f, \bar{r}_j^t)} + \text{对称项}\]

where \(\mathcal{S}(x,y) = \exp(x^T y / \tau)\), \(\tau=0.1\). The unique aspect of this loss is that the "unadapted robot feature" \(\bar{r}_i^f\) is also included in the denominator as a negative sample, directly penalizing the domain gap.

Loss & Training¶

The total loss only consists of the aforementioned human-robot contrastive alignment loss. During training, the adapter parameters and the task-aware linear layer parameters are optimized.

Training Configuration: Adam optimizer, lr=\(1 \times 10^{-4}\), batch size=200, approximately 8k steps, 4×NVIDIA A6000.

Key Experimental Results¶

Main Results¶

Setting	Model	Baseline	+HR-Align	Gain
Adroit Single-task (2 tasks)	D4R	63.0%	65.0%	+2.0%
Adroit Single-task (2 tasks)	R3M	74.0%	81.3%	+7.3%
RLBench Multi-task (18 tasks)	D4R	55.3%	59.9%	+4.6%
RLBench Multi-task (18 tasks)	R3M	50.3%	59.2%	+8.9%
Real-world (5 tasks)	D4R	—	—	+13%
Real-world (5 tasks)	R3M	—	—	+11%

Ablation Study¶

Method	learned params	pen	relocate	Avg
R3M (Frozen)	0M	78.0	70.0	74.0
R3M-PreT (Continued Human Pre-training)	25M	78.0	77.3	77.7
R3M-ClS (Action Classification Fine-tuning)	25M	—	—	Worse
R3M-Align (Ours)	Small	81.3	81.3	81.3

Key Findings¶

Consistently Effective Across Models: Significant improvements are achieved on both R3M and D4R, two models with completely different pre-training methods, validating the generality of the proposed method.
Greater Improvement in Multi-task Settings: R3M achieves an 8.9% improvement across 18 multi-task setups in RLBench, indicating that the adapted model generalizes better on diverse tasks.
Significant Improvement in Real-world Environments: Success rates improve by 11-13% on real-world tasks, far exceeding simulation environments, indicating that the domain gap is more severe in real-world scenarios.
Parameter-efficient: Significant improvements are achieved by training only a small number of adapter parameters, without requiring full-parameter fine-tuning.

Highlights & Insights¶

Paradigm Innovation: This work is the first to propose an adaptation paradigm using "paired data bridging", which sits between frozen utilization and downstream fine-tuning, balancing both generality and domain adaptation.
Ingenious Negative Sample Design: The unadapted robot features are incorporated as additional negative samples in contrastive learning, directly quantifying and penalizing the domain gap.
Low Cost, High Return: Adaptation is completed using only 56k paired videos from an existing community dataset (RH20T), without requiring any additional data collection.
Non-intrusive to Downstream Tasks: The adapted model serves as a general visual backbone, eliminating the need for customization for each downstream environment.

Limitations & Future Work¶

Dependency on Paired Demonstration Data: The method requires paired videos of humans and robots performing the same tasks. Although public datasets exist, data availability remains limited.
Visual Discrepancy Between Adaptation and Downstream Environments: Robot demonstration data in the adaptation stage differs in visual appearance from downstream robot environments, relying solely on "isomorphic robot morphology" to close the gap.
Adapter Position Limited to the Last Layer: The study only validates inserting the adapter in the last layer, without fully exploring the effects of inserting adapters at different layer depths.
Image Resolution and Frame Constraints: The resolution and frame count used in the experiments are relatively low (5 frames), which may limit temporal modeling capabilities.

R3M, MVP, data4robotics: Three major human-data pre-training baselines. This work performs domain adaptation on top of them.
RH20T Dataset: Provides high-quality paired human-robot demonstration data, making the proposed method possible.
Parameter-Efficient Fine-Tuning (PEFT): The adapter design draws inspiration from the PEFT methodology in NLP and CV.
Insight: The "domain gap" is a neglected yet critical challenge in embodied AI. Using paired data as a bridge is a general strategy that can be extended to transfer learning across different embodiments.

Rating ⭐¶

Dimension	Score
Novelty	⭐⭐⭐⭐
Technical Depth	⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐⭐
Engineering Practicality	⭐⭐⭐⭐
Overall Recommendation	⭐⭐⭐⭐