Human-assisted Robotic Policy Refinement via Action Preference Optimization¶

Conference: NeurIPS 2025 arXiv: 2506.07127 Code: GitHub Area: LLM Alignment Keywords: VLA models, preference alignment, human-robot collaboration, robotic manipulation, adaptive reweighting

TL;DR¶

This paper proposes Action Preference Optimization (APO), a human-robot collaboration framework that collects interactive trajectories and applies preference alignment to VLA models using binary desirability signals grounded in prospect theory and an adaptive reweighting scheme, enabling the model to learn from failures and improve iteratively.

Background & Motivation¶

Vision-Language-Action (VLA) models, as foundation models for robotic deployment, rely heavily on offline expert demonstration data and lack the capacity for continuous improvement after deployment. Existing approaches face two fundamental challenges:

Behavioral Cloning (BC): Cannot effectively exploit informative signals in failure trajectories, and suffers from distribution shift between expert data and interactive data when fine-tuning large-scale VLA models.

Reinforcement Learning (RL): Encounters gradient instability and difficulty in generalizing value functions when training large-scale VLA models.

Furthermore, transferring mature preference alignment methods from LLMs to VLA models introduces two core technical obstacles: - Irreversible interactions: The irreversibility of robotic manipulation makes it extremely difficult to collect paired positive and negative samples under identical conditions. - Token probability mismatch: Autoregressive VLA models discretize continuous actions into tokens, causing a fundamental mismatch between token probabilities and true action regression loss.

Method¶

Overall Architecture¶

APO consists of two core components:

Human-robot collaborative deployment framework: A human operator monitors robot execution in real time, intervenes to correct difficult scenarios, and simultaneously collects interactive trajectories.
Action preference optimization process: An adaptive reweighting-based preference alignment algorithm that refines the VLA model.

The overall pipeline follows an iterative deploy-optimize loop: an initial policy \(\pi_\theta^0\) is obtained via behavioral cloning on expert data, followed by repeated cycles of deployment for interactive data collection and preference optimization for model updates.

Key Designs¶

Interactive trajectory annotation: Each action in a collected trajectory is annotated with label \(c_t\): - \(c_t = 1\): action autonomously executed by the policy - \(c_t = 2\): action corrected by human intervention - \(c_t = 0\): actions within \(K\) steps prior to human intervention (marked as undesirable)

Prospect theory-based preference alignment: Drawing on Kahneman & Tversky's prospect theory, preference learning is performed using binary desirability signals rather than paired preference data. The reward function is estimated as:

\[r_\theta(o, \hat{a}) = \log \frac{\pi_\theta(\hat{a}|o)}{\pi_{ref}(\hat{a}|o)}\]

The utility function is then defined as:

\[v(o, \hat{a}) = \begin{cases} \lambda_D \sigma(r_\theta(o,\hat{a}) - z_0) & \text{if } \hat{a} \sim \hat{a}_{desirable} \\ \lambda_U \sigma(z_0 - r_\theta(o,\hat{a})) & \text{if } \hat{a} \sim \hat{a}_{undesirable} \end{cases}\]

where \(z_0 = KL(\pi_\theta || \pi_{ref})\) serves as a penalty term to prevent excessive deviation from the reference model.

Adaptive reweighting: To address the mismatch between token probabilities and continuous action regression, the L1 continuous action loss is computed for each sample and normalized at the batch level:

\[w_i = \frac{l_i}{\sum_{i=1}^{B} l_i}\]

The normalized weights adaptively modulate \(\lambda_D\) and \(\lambda_U\): - Desirable data: \(\lambda_D = 1 - e^{-\beta_D \cdot w}\) (samples with larger action prediction error receive higher weight) - Undesirable data: \(\lambda_U = e^{-\beta_U \cdot w}\) (samples closer to failure actions receive higher weight)

Loss & Training¶

The final loss function is:

\[L(\pi_\theta, \pi_{ref}) = \mathbb{E}_{x,y \sim D^h}[-v(x,y)]\]

Training employs balanced sampling: each batch contains 50% expert actions, 25% human-intervention actions, and 25% failure actions, ensuring balanced representation across data types.

Implementation details: OpenVLA is fine-tuned using LoRA (rank=32) with a learning rate of 5e-5, batch size of 8, on 4 A100 GPUs; \(K=10\) is used for labeling undesirable behaviors.

Key Experimental Results¶

Main Results¶

RoboMimic simulation environment (4 long-horizon manipulation tasks, 50 trajectories/task, 50 evaluation episodes):

Method	Coffee	StackThree	ThreePiece	Square	Avg.
Base policy	44%	46%	44%	28%	40.5%
Dagger	42%	50%	36%	28%	39.0%
DPO	52%	46%	28%	22%	37.0%
KTO	48%	52%	46%	32%	43.5%
APO	60%	54%	46%	32%	48.0%

Perturbation generalization (only 20 interactive trajectories + 20 expert demonstrations from the original task):

Method	Position Perturb.	Background Perturb.	Texture Perturb.	Avg.
Base policy	12%	42%	10%	21.3%
KTO	20%	46%	6%	24.0%
APO	26%	46%	12%	28.0%

Real-world experiment (block insertion task):

Method	In-distribution	Position Perturb.	Background Perturb.	Texture Perturb.
Base policy	65%	25%	10%	25%
TPO	75%	40%	20%	45%
APO	85%	55%	30%	55%

Ablation Study¶

Generalization across VLA architectures (π0-FAST model):

Method	Coffee	StackThree	Insert Square
Base policy	68%	64%	85%
Dagger	64%	66%	85%
APO	76%	74%	95%

Lifelong learning experiment: The model is updated every 20 interactions. APO continues to improve performance even after gains from expert data saturate, while the human intervention rate gradually decreases.

Key Findings¶

Behavioral cloning methods fail to surpass the base policy on VLA models, primarily due to distribution shift between expert and interactive trajectories.
DPO, which relies on synthetic negative samples rather than real failure interactions, yields the worst performance.
TPO uses randomly sampled negative examples, leading to training instability.
Under perturbation scenarios, APO not only adapts to new settings but also improves performance on the original task, whereas other methods exhibit catastrophic forgetting.
APO learns to autonomously recover from failures (e.g., re-grasping, iteratively adjusting positions).

Highlights & Insights¶

Elegant problem decomposition: The two core obstacles to transferring LLM preference learning to VLA models—irreversible interactions and token probability mismatch—are addressed separately via prospect theory and adaptive reweighting, respectively.
Practical human-robot collaboration paradigm: Human intervention simultaneously ensures deployment reliability and naturally generates the data required for preference learning.
No paired preference data required: Following the spirit of KTO, only binary signals (desirable/undesirable) are needed rather than pairwise comparisons, substantially reducing annotation cost.
Learning from failure: APO not only avoids undesirable actions but also actively self-corrects after encountering failures.

Limitations & Future Work¶

Experiments are limited to autoregressive VLA models (OpenVLA and π0-FAST); generalization to regression-based or diffusion policy models remains unverified.
Human intervention still requires real-time monitoring; reducing the cost of human involvement is an open problem.
The undesirable behavior annotation window \(K=10\) is fixed and may need to be dynamically adjusted per task.
Gains under perturbation scenarios are modest (e.g., texture perturbation improves from 10% to 12%), indicating substantial room for further robustness improvement.

KTO: The core preference alignment idea in this work derives from KTO's prospect-theory-based optimization, augmented with adaptive reweighting.
DPO/RLHF: Classic preference alignment methods face the challenge of obtaining paired data in robotic settings.
Grape (TPO): A trajectory-level preference alignment method that requires paired trajectories under identical conditions, which is infeasible in real-world deployments.
This work originates from NLP preference alignment and provides a transferable continual learning paradigm for VLA models in embodied intelligence.

Rating¶

Novelty: ⭐⭐⭐⭐ (transfers preference alignment to VLA while resolving key technical barriers)
Experimental Thoroughness: ⭐⭐⭐⭐ (simulation + real world, multi-task multi-perturbation, lifelong learning, cross-model validation)
Writing Quality: ⭐⭐⭐⭐ (clear problem formulation, well-motivated)
Value: ⭐⭐⭐⭐ (provides a practical solution for continuous post-deployment improvement of VLA models)