LAOF: Robust Latent Action Learning with Optical Flow Constraints¶
Conference: CVPR 2026 arXiv: 2511.16407 Code: GitHub Area: Video Understanding Keywords: Latent action learning, optical flow constraints, embodied intelligence, imitation learning, video pretraining
TL;DR¶
This paper proposes the LAOF framework, which leverages agent optical flow as a pseudo-supervision signal to constrain latent action learning, yielding latent action representations that are more robust to distractors. LAOF substantially outperforms unsupervised baselines on LIBERO and PROCGEN, and matches or surpasses supervised methods that use 1% action labels, while requiring no labels at all.
Background & Motivation¶
Learning latent action representations from large-scale, action-label-free videos is a key pathway toward scalable embodied foundation models. The LAPO paradigm jointly trains latent actions via an autoencoding framework combining an Inverse Dynamics Model (IDM) and a Forward Dynamics Model (FDM), and has been adopted in large-scale embodied models such as LAPA and GR00T N1.
Core Problem: LAPO implicitly assumes that all changes between consecutive frames are caused by the agent's actions. However, real-world videos contain substantial action-irrelevant distractors (e.g., moving background objects, stochastic environmental changes), and a purely reconstruction-based objective may entangle latent actions with visual appearance.
Existing solutions: - Adding sparse action-label supervision (LAOM, villa-X): alternating training under extreme label scarcity is unstable and prone to overfitting. - Discretization via VQ-VAE: introduces an information bottleneck but limits expressiveness.
Core Insight: Optical flow provides pixel-level inter-frame motion information, naturally suppressing static backgrounds while emphasizing moving objects. Pretrained optical flow models already exhibit strong cross-scene generalization. Optical flow thus serves as a pseudo-supervision signal highly correlated with actions, requiring no manual annotation.
Method¶
Overall Architecture¶
Three-stage training pipeline: 1. Pretraining: Jointly train IDM + FDM + optical flow decoder on unlabeled videos. 2. Distillation: Distill IDM knowledge into a latent action policy that accepts only the current frame. 3. Fine-tuning: Train an action decoder (latent action → physical action) using a small number of action labels.
Key Designs¶
-
Optical Flow Pseudo-Supervision Constraint
- Function: Constrains latent actions to capture genuine physical motion via an optical flow decoder.
- Mechanism: A dedicated optical flow decoder \(d_{flow}: \mathcal{Z} \rightarrow \mathcal{F}_{rgb}\) maps latent actions directly to optical flow features. Pseudo-labels are generated by a pretrained RAFT model, converted to RGB format, and encoded by DINOv2. Pretraining loss: \(\mathcal{L}_{pretrain} = \mathcal{L}_{reconstruction} + \mathcal{L}_{flow}\).
- Design Motivation: Optical flow is strongly correlated with actions (moving objects = action outcomes). Using it as an auxiliary decoding target constrains the latent action space to be physically consistent and prevents latent actions from degenerating into appearance encodings.
-
RGB-Format Optical Flow Processing
- Function: Makes optical flow compatible with the DINOv2 visual encoder.
- Mechanism: Flow vectors \((u, v)\) are converted to polar coordinates; direction is mapped to HSV hue and magnitude to saturation and value, followed by a standard HSV→RGB conversion. Magnitude normalization: \(m_{norm} = \min(1.0, m/(\sigma\sqrt{H^2+W^2}))\).
- Design Motivation: Processing both observations and optical flow with DINOv2 uniformly eliminates the need for an additional encoder.
-
Object-Centric Optical Flow
- Function: Extracts agent-relevant optical flow in scenes with dynamic distractors.
- Mechanism: For static-background scenes (e.g., robotic manipulation), global optical flow naturally focuses on agent motion. For dynamic distractor scenes (e.g., games), LangSAM generates object masks to filter irrelevant motion: \(f_{rgb,t}^{sam} = mask_t \odot f_{rgb,t}^{all}\).
- Design Motivation: Adaptive selection between global and object-centric optical flow accommodates diverse scene types while maintaining generality.
Loss & Training¶
- Pure LAOF: \(\mathcal{L}_{pretrain} = \mathcal{L}_{reconstruction} + \mathcal{L}_{flow}\)
- LAOF-Action (with sparse labels): \(\mathcal{L}_{pretrain} = \mathcal{L}_{reconstruction} + (1-\lambda)\mathcal{L}_{flow} + \lambda\mathcal{L}_{action}\), where \(\lambda = M/(N+M)\)
- Distillation: \(\mathcal{L}_{distillation} = \|\pi(\hat{z}_t|s_t,l_t) - z_t\|_2\)
- Fine-tuning: \(\mathcal{L}_{action} = \|d_{action}(\hat{a}_t|z_t) - a_t\|_2\)
Key Experimental Results¶
Main Results — LIBERO Imitation Learning¶
| Method | SPATIAL | OBJECT | GOAL | LONG | Avg. Gain |
|---|---|---|---|---|---|
| LAPO | 80.4% | 81.2% | 84.0% | 44.7% | Baseline |
| CoMo | 74.1% | 87.6% | 80.8% | 49.9% | +0.5 |
| CoMo w/ OF | 76.2% | 89.7% | 82.6% | 57.9% | +4.0 |
| LAOF | 82.5% | 85.3% | 87.2% | 52.0% | +4.2 |
| LAOM-Action (1% labels) | 86.0% | 91.1% | 86.3% | 61.6% | +8.7 |
| LAOF-Action (1% labels) | 88.2% | 95.9% | 88.6% | 63.7% | +11.5 |
Ablation Study — Optical Flow Constraint Placement¶
| Configuration | Performance | Note |
|---|---|---|
| Directly connected to latent action | Best | Flow decoder decodes directly from \(z\) |
| Constrained via FDM | Second best | Indirect constraint reduces effectiveness |
| No optical flow constraint | Baseline | Original LAPO |
Label Ratio Scaling Experiments¶
| Action Label Ratio | LAOF-Action vs. LAOM-Action |
|---|---|
| 0% | LAOF ≥ LAOM-Action @ 1% |
| 1% | LAOF-Action significantly outperforms |
| 5% | Further gains persist |
| 10% | Optical flow constraint remains effective |
Key Findings¶
- Unsupervised LAOF matches or exceeds LAOM-Action using 1% labels, demonstrating the strong effectiveness of optical flow pseudo-supervision.
- The optical flow constraint remains beneficial even when the label ratio increases to 10%, indicating that the two signals are complementary rather than redundant.
- Continuous latent actions consistently outperform discrete VQ-VAE representations on both benchmarks.
- The proposed latent action evaluation metric is strongly correlated with downstream task performance (Pearson correlation: 0.83 / 0.73).
Highlights & Insights¶
- The choice of optical flow as a pseudo-supervision signal is both natural and effective — pixel-level motion capture is a direct visual consequence of actions.
- RGB-format optical flow unifies the processing pipeline for observations and motion, requiring only a single visual encoder.
- The adaptive weighting scheme in LAOF-Action (\(\lambda = M/(N+M)\)) automatically balances the two signals as a function of label ratio.
- As an extension of the LAPO paradigm, LAOF can be directly integrated into existing embodied foundation model training pipelines.
Limitations & Future Work¶
- Relies on a pretrained optical flow model (RAFT); estimation errors propagate as noisy labels.
- Object-centric optical flow depends on LangSAM segmentation quality, which may degrade in complex scenes.
- Validation is limited to LIBERO (robotic manipulation) and PROCGEN (2D games); real-world scenarios remain untested.
- The three-stage training pipeline increases engineering complexity.
Related Work & Insights¶
- vs. LAPO: LAPO implicitly assumes a static background; LAOF explicitly handles dynamic distractors via optical flow.
- vs. LAOM: LAOM requires action labels and suffers from unstable alternating training; LAOF achieves more stable training using label-free optical flow.
- vs. FlowVLA (concurrent): FlowVLA discretizes optical flow into tokens for world model training; LAOF uses continuous optical flow to constrain latent action learning, pursuing a distinct objective.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The idea of using optical flow as pseudo-supervision for latent actions is natural and effective, though the primary contribution lies in empirical validation rather than a conceptual breakthrough.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers LIBERO + PROCGEN, continuous vs. discrete representations, label ratio sweeps, and detailed ablations.
- Writing Quality: ⭐⭐⭐⭐ — Method is clearly presented with precise problem formulation and a well-organized three-stage pipeline.
- Value: ⭐⭐⭐⭐ — Offers practical guidance for pretraining embodied foundation models.