Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents¶

Conference: NeurIPS 2025 arXiv: 2502.01218
Code: https://actol-pretrain.github.io/
Area: Reinforcement Learning Keywords: Vision-language pretraining, embodied intelligence, imitation learning, temporal consistency, Brownian bridge

TL;DR¶

This paper proposes AcTOL, which learns ordered and continuous vision-language representations via a visual-language ordering loss and a Brownian bridge constraint, without relying on rigid goal-reaching assumptions, achieving significant improvements on downstream simulated and real-world robot manipulation tasks.

Background & Motivation¶

State of the Field¶

Background: Pretraining vision-language representations on human action videos to reduce dependence on expert robot demonstrations is a promising direction. Methods such as R3M, LIV, and DecisionNCE employ temporal contrastive learning.

Limitations of Prior Work: Existing methods rely on a "goal-reaching" assumption—that the semantic alignment between language instructions and video frames improves monotonically toward later frames. However, actions in real videos may terminate early or be followed by irrelevant content, leading to erroneous vision-language associations.

Key Challenge: Real human action videos are coarsely annotated and noisy, rendering rigid assumptions invalid.

Core Idea: Exploiting the intrinsic temporal consistency of videos, the learned representations are required to satisfy ordering (frames closer in time exhibit smaller semantic divergence) and continuity (representations of adjacent frames transition smoothly).

Method¶

Key Designs¶

Vision-Language Ordering (VLO) Loss:
- Mechanism: For an anchor frame \(o_i\) and an arbitrary frame pair \((o_j, o_k)\), the semantic alignment differential is defined as \(\mathfrak{R}(\mathbf{v}_i, \mathbf{v}_j, \mathbf{l}) = -\|\text{sim}(\mathbf{v}_i, \mathbf{l}) - \text{sim}(\mathbf{v}_j, \mathbf{l})\|_2\)
- The negative sample set \(\mathcal{N}_{i,j}\) selects frames that are temporally farther away, and contrastive training is performed using an InfoNCE-style loss.
- Theoretical guarantee: When \(\mathcal{L}_{VLO}\) approaches its lower bound \(\mathcal{L}^*\), the learned representations provably satisfy the VLO property.
Brownian Bridge Constraint:
- Inter-frame intervals in videos are modeled as a Brownian bridge process: the mean follows linear interpolation and the variance is maximized at intermediate time steps.
- Loss: \(\mathcal{L}_{BB} = \frac{1}{T}\sum_{t} \frac{1}{2\text{Var}[\mathbf{B}(t)]}\|\mathbf{v}_t - \mathbb{E}[\mathbf{B}(t)]\|^2\)
- Enforces local smoothness in the visual representation space.
Language Robustness: It is theoretically proven that, under language perturbations satisfying \(\|\mathbf{l} - \mathbf{l}'\| \leq \delta_l\), the change in semantic alignment is bounded by \(2C\delta_l\).

Key Experimental Results¶

Main Results — Simulation Success Rate (15 demos)¶

Method	Franka Kitchen	Metaworld
CLIP	27.47	60.33
R3M	42.20	56.50
LIV	42.73	64.33
DecisionNCE	43.20	59.08
AcTOL w/o BB	54.20	70.83
AcTOL	61.80 (+43%)	74.13 (+15%)

Real Robot — Unitree D1¶

Method	Pick Cup	Open Drawer	Close Drawer
DecisionNCE	20%	40%	60%
AcTOL	50%	80%	90%

Key Findings¶

The Brownian bridge constraint contributes substantially (AcTOL vs. w/o BB: +7.6% on Franka Kitchen).
AcTOL exhibits virtually no performance degradation under language perturbations, whereas LIV degrades by 11.9%.
With as few as 5 demonstrations, AcTOL surpasses competing methods that use 15–25 demonstrations.

Highlights & Insights¶

Not assuming the final frame as the goal is the key innovation—only relative temporal distances between frames are used to constrain representations, yielding greater robustness.
The idea of using a Brownian bridge as a continuity regularizer is elegant: it naturally introduces uncertainty modeling into temporal representations.

Limitations & Future Work¶

The approach may not generalize to cyclic or repetitive actions (e.g., stirring), where the temporal ordering assumption does not hold.
Pretraining is conducted solely on EPIC-KITCHENS-100; performance on larger-scale datasets remains unvalidated.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The combination of VLO and Brownian bridge is novel and theoretically grounded.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers simulation, real-world, robustness, ablation, and fine-tuning.
Writing Quality: ⭐⭐⭐⭐⭐ Theoretical analysis is rigorous and experimental design is meticulous.
Value: ⭐⭐⭐⭐⭐ Significantly advances the frontier of embodied pretraining.