Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild¶

Conference: CVPR 2026 arXiv: 2602.21736 Code: https://research.beingbeyond.com/jala Area: Multimodal VLM Keywords: VLA pretraining, latent action, human video, hand motion, robot manipulation

TL;DR¶

This paper proposes the JALA framework, which constructs a unified latent action space via joint alignment between predictive embeddings and latent actions inferred by an inverse dynamics model, enabling VLAs to learn simultaneously from labeled data and unlabeled in-the-wild human videos. Combined with the 7.5M-sample UniHand-Mix dataset, JALA significantly improves the generalization of robot manipulation policies.

Background & Motivation¶

Background: Vision-Language-Action (VLA) models adapt vision-language models to robotic data to learn manipulation policies, yet the scale and diversity of robot data remain far behind those of vision and language domains.

Limitations of Prior Work: Exploiting human video data involves a quality–diversity trade-off—laboratory data provides accurate hand tracking but limited scene diversity, while in-the-wild videos offer rich diversity but lack action annotations.

Key Challenge: Prior latent action methods (e.g., LAPA) rely on an inverse dynamics model (IDM) to infer latent actions and a forward dynamics model (FDM) to reconstruct future frames. However, reconstructing fine-grained hand manipulation in video is extremely difficult, and the quality bottleneck of the FDM in turn degrades the quality of the inferred latent actions.

Goal: To extract useful action signals from heterogeneous annotated and unannotated human videos for VLA pretraining, without relying on visual reconstruction.

Key Insight: Humans learn manipulation through transferable action patterns rather than by memorizing visual details. Latent actions should be predictable from context and consistent with inverse dynamics, but need not reconstruct pixels.

Core Idea: Replace reconstruction with joint alignment—the intermediate hidden states (predictive embeddings) of the VLA are simultaneously aligned with both action labels and IDM-inferred latent actions.

Method¶

Overall Architecture¶

A Transformer-based VLA processes visual inputs, instructions, and motion tokens → applies Masked Chunk Prediction (MCP) on motion chunks to learn action patterns → infers latent actions from boundary frames via a Latent Action Perceiver (LAP) → aligns VLA hidden states (predictive embeddings) with latent actions → transfers to robot tasks via a flow-matching head during post-training.

Key Designs¶

Joint Alignment:
- Function: Aligns predictive embeddings simultaneously with motion labels and visual dynamics.
- Mechanism: The VLA hidden state \(h_{i,k}\) must satisfy two constraints: (a) predicting the correct motion token \(a_{i,k}\) via MCP, and (b) aligning with the latent action \(z_{i,k}\) generated by LAP: \(\mathcal{L}_{Align} = \sum_{i,k} \|h_{i,k} - z_{i,k}\|_1\)
- Design Motivation: MCP provides supervision when action labels are available; LAP provides visual dynamics signals applicable to any video. The two are complementary and together form a unified action space.
Latent Action Perceiver (LAP) + Latent State Perceiver (LSP) with Decoupled Updates:
- Function: Stably extracts latent actions from visual features and aligns them with VLA context.
- Mechanism: LAP and LSP share a Perceiver architecture. LAP processes motion boundary frames \((v_t, v_{t+\delta})\) to generate latent actions; LSP processes initial frames to map VLA context into the same space. The two are decoupled via asymmetric EMA updates: backbone weights propagate from LSP→LAP, and query weights propagate from LAP→LSP.
- Design Motivation: Directly connecting feature spaces from different visual encoders leads to instability; decoupled EMA allows action anchoring and context prediction to converge progressively and independently.
Hybrid Masked Chunk Prediction (Hybrid MCP):
- Function: Chunk-level action token prediction.
- Mechanism: One chunk is randomly selected as the primary prediction target; preceding chunks remain unmasked; tokens within the target chunk are masked at a random ratio; subsequent chunks are masked with 5% probability. At inference, multiple decoding passes are performed and ensembled.
- Design Motivation: Naive full masking causes a train–inference mismatch; the hybrid strategy ensures contextual alignment.
UniHand-Mix Dataset (7.5M samples):
- Function: Mixed pretraining data combining laboratory annotations and in-the-wild videos.
- Mechanism: 5M+ laboratory annotated samples (with precise MANO hand tracking) + 2.5M in-the-wild Ego4D samples (validated via hand detection and filtered by Gemini activity recognition).
- Data Scale: Over 2,000 hours of video, approximately 2.5× larger than the previous largest UniHand dataset.

Post-Training Transfer¶

A Diffusion Transformer (DiT) flow-matching head converts predictive embeddings into robot actions, integrating pretrained knowledge via cross-attention.

Key Experimental Results¶

Robot Manipulation (Libero Benchmark)¶

Method	Params	LIBERO-Spatial	LIBERO-Object	LIBERO-Goal	LIBERO-Long	Avg.
OpenVLA	7B	84.7	88.4	79.2	53.7	76.5
π0	3B	76.9	96.0	89.4	68.2	82.6
JALA	~2B	Superior	Superior	Superior	Superior	Surpasses same-scale baselines

Hand Motion Generation (Laboratory vs. In-the-Wild)¶

Method	Lab FID↓	In-the-Wild FID↓
Being-H0 (lab only)	Better	Worse
JALA	Maintained	Significantly improved

Key Findings¶

JALA generates more realistic hand motion in in-the-wild scenarios while maintaining laboratory-level performance.
Compared to training on laboratory data alone, mixed training consistently improves performance across all Libero sub-tasks.
Joint alignment outperforms using MCP or LAP individually.
JALA achieves strong performance on real-world robot tasks, particularly in out-of-distribution settings.

Highlights & Insights¶

Bypassing FDM reconstruction is the key innovation: aligning embeddings rather than reconstructing pixels eliminates the primary quality bottleneck.
Decoupled EMA updates is an elegant design: the backbone handles context while queries anchor action representations, leveraging the strengths of each.
UniHand-Mix is currently the largest pretraining dataset for hand manipulation, at 7.5M samples.
The transfer pipeline from human video to robot manipulation (pretraining → flow-matching post-training) is concise and efficient.

Limitations & Future Work¶

The pseudo hand-pose annotation confidence threshold of 0.65 for in-the-wild videos may still introduce noise.
The MANO parametric representation limits modeling of non-hand manipulations (e.g., tool use).
Videos in UniHand-Mix are predominantly egocentric; third-person-view human manipulation videos are not included.
The variety and scale of real-world robot experiments leave room for further expansion.

vs. LAPA: LAPA constrains the latent action space via FDM reconstruction; JALA bypasses the reconstruction bottleneck through joint alignment.
vs. Being-H0: Being-H0 relies solely on laboratory annotated data; JALA extends to in-the-wild videos via LAP.
vs. OpenVLA/RoboVLM: These methods train directly on robot data; JALA acquires richer manipulation priors through human video pretraining.
The joint alignment paradigm for latent actions generalizes to other settings that require learning actions from heterogeneous data sources.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The joint alignment paradigm fundamentally advances latent action learning.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated across hand generation, simulation, and real-world settings.
Writing Quality: ⭐⭐⭐⭐ Method motivation is clearly derived and figures are intuitive.
Value: ⭐⭐⭐⭐⭐ Provides a critical methodological foundation for scalable VLA pretraining from human video.