CVPR 2025 Robotics Language-conditioned manipulation action decoupling motion primitives contrastive learning action representation compositional generalization

LaDA: Language-Grounded Decoupled Action Representation for Robotic Manipulation¶

Conference: CVPR 2025
arXiv: 2603.12967
Code: TBD
Area: Robotic Manipulation
Keywords: Language-conditioned manipulation, action decoupling, motion primitives, contrastive learning, action representation, compositional generalization

TL;DR¶

Proposed LaDA, which decouples 7-DoF robot actions into three types of motion primitives (translation, rotation, and gripper) and aligns them with language semantics. Using soft-label contrastive learning and adaptive loss weighting, it achieves a 93.6% average success rate on LIBERO with only 1.3B parameters.

Background & Motivation¶

Heterogeneity of Vision-Language-Action: In robotic manipulation, there exists a large semantic gap between high-level vision-language understanding ("pick up the red cup") and low-level motion control (joint angles/velocities).

Limitations of Prior Work: Most VLA methods directly predict actions as uninterpretable 7-dimensional vectors, ignoring the structured semantics within actions.

Unexploited Shared Motion Primitives: Different tasks share similar motion primitives (e.g., "push forward", "rotate right", "close gripper"), but existing methods fail to reuse these primitives across tasks.

Demand for Compositional Generalization: Recombining learned motion primitives in new tasks would substantially improve generalization capabilities to unseen scenarios.

Natural Correspondence Between Language and Motion: Verbs in language descriptions (push, pull, rotate) naturally correspond to specific types of motion primitives, a relationship that can be explicitly modeled.

Core Idea: Decoupling 7-DoF actions into three types of primitives (translation/rotation/gripper) and grounding each primitive with language semantics to achieve compositional generalization.

Method¶

Overall Architecture¶

LaDA consists of three core modules:

Action Decoupling: Decomposing 7-DoF (\(\Delta x, \Delta y, \Delta z, \Delta r_x, \Delta r_y, \Delta r_z, g\)) into three types of motion primitives: translation, rotation, and gripper.
Language Alignment: Aligning each type of motion primitive with the language semantic space via soft-label contrastive learning.
Adaptive Loss Weighting: Balancing the training progress of the three decoupled branches.

Key Design 1: Motion Primitive Decoupling¶

Translation primitive: \([\Delta x, \Delta y, \Delta z]\), corresponding to spatial displacement.
Rotation primitive: \([\Delta r_x, \Delta r_y, \Delta r_z]\), corresponding to orientation changes.
Gripper primitive: \([g]\), corresponding to grasping/releasing actions.
Each type of primitive is predicted by an independent action head, sharing a vision-language backbone.
Decoupling enables the model to learn finer action semantics.

Key Design 2: Soft-Label Contrastive Learning¶

Unlike traditional contrastive learning that uses 0/1 hard labels, it utilizes the semantic similarity between language instructions as soft labels.
For example, the translation primitives for "push the red cup" and "push the blue cup" should be similar, as their "push" semantics are close.
Semantic similarity is calculated via the cosine similarity of sentence embeddings.
This design allows the model to learn that semantically similar instructions should generate similar motion primitives.

Key Design 3: Adaptive Loss Weighting¶

The loss scales and training difficulties of the three primitive branches differ (e.g., rotation is typically harder to predict than translation).
Dynamic adjustment of the loss weight for each branch avoids any single branch dominating the training process.
Ensures balanced progress in training across translation, rotation, and gripper primitives.

Key Experimental Results¶

Main Results (LIBERO Benchmark)¶

Task Suite	LaDA (1.3B)	CLIP-RT (2.6B)	OpenVLA	RoboFlamingo
Spatial	96.4%	—	—	—
Object	97.8%	—	—	—
Goal	88.4%	—	—	—
Long-horizon	86.4%	—	—	—
Average	93.6%	~89%	~85%	~82%

LaDA has only 1.3B parameters, which is half of CLIP-RT.

Ablation Study¶

Configuration	Average SR	Description
Full LaDA	93.6%	All components
W/o Decoupling (Unified Prediction)	~88%	Unified 7-DoF prediction
W/o Soft Labels (Hard Label Contrastive)	~90%	0/1 hard labels
W/o Adaptive Weighting	~91%	Fixed weights
W/o Language Alignment	~87%	Regression loss only

Key Findings¶

Language alignment is the most significant contributing factor (dropping it results in -6.6pp).
Action decoupling contributes +5.6pp, showing that structured decomposition significantly benefits performance.
The gap between soft labels and hard labels (+3.6pp) demonstrates the effectiveness of using linguistic similarity as a supervisory signal.
LaDA outperforms CLIP-RT with half the parameters, indicating that action decoupling combined with language alignment is more effective than simply scaling up the model.

Highlights & Insights¶

Natural Correspondence Between Motion Primitives and Language: It fully exploits the natural structure of human language descriptions of motion, leading to an elegant design.
Ingenious Soft Label Design: Using semantic similarity as a soft supervision signal for contrastive learning provides finer granularity than hard labels.
High Efficiency: Achieving SOTA performance with 1.3B parameters validates that structured design is more effective than brute-force model scaling.
Potential for Compositional Generalization: The three decoupled primitives can be "assembled" into actions required for new tasks.

Limitations & Future Work¶

Verified only in the LIBERO simulation environment; real-world robot experiments are missing.
The tripartite decomposition into translation, rotation, and gripper is specifically designed for 7-DoF; more complex robots (such as dexterous hands) require redesigned primitives.
Long-horizon success rate (86.4%) still has room for improvement, indicating that compositional reasoning over long sequences remains a challenge.
Negative transfer has not been explored—certain similar yet distinct motion primitives might interfere with each other.

CLIP-RT: Uses CLIP to align vision, language, and action but does not decouple the action structure, with a parameter size of 2.6B.
OpenVLA: A generalist VLA model that directly predicts raw action vectors.
Insight: The concept of action decoupling can be extended to broader embodied AI tasks, such as navigation and realistic character animation.

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of motion primitive decoupling and language alignment is highly novel.
Experimental Thoroughness: ⭐⭐⭐☆ — Detailed ablation studies, but lacking real-world experiments.
Writing Quality: ⭐⭐⭐⭐ — Clear motivation and intuitive methodology.
Value: ⭐⭐⭐☆ — The model is small and highly efficient, but currently limited to simulation.
Overall Recommendation: ⭐⭐⭐⭐