Learning Parameterized Skills from Demonstrations¶

Conference: NeurIPS 2025 arXiv: 2510.24095 Code: GitHub Area: Optimization (Robot Learning / Skill Discovery) Keywords: Parameterized skills, learning from demonstrations, hierarchical policy, variational inference, robot manipulation

TL;DR¶

This paper proposes DEPS, an end-to-end algorithm for discovering parameterized skills from expert demonstrations. Through a three-level hierarchical policy (discrete skill selection → continuous parameter selection → low-level actions) and an information bottleneck design, DEPS learns interpretable and generalizable skill abstractions, achieving significant improvements over baselines on LIBERO and MetaWorld.

Background & Motivation¶

Background: Standard RL applied to long-horizon sequential decision-making fails to exploit intrinsic behavioral patterns, resulting in poor sample efficiency. The Options framework aims to discover modular, temporally extended skills, but existing methods learn either purely discrete or purely continuous skills.

Limitations of Prior Work: (i) Discrete skills lack flexibility and generalize poorly to new contexts; (ii) continuous skills are poorly structured and difficult to interpret; (iii) existing parameterized skill methods (e.g., da Silva et al.) require annotated task parameters, or rely on VLM-based pre-trained clustering (e.g., LOTUS, EXTRACT), assuming the same skill occurs in visually similar environments; (iv) latent variable models are prone to degeneration—low-level policies directly memorize behaviors without learning meaningful skill abstractions.

Key Challenge: How can parameterized skills with both discrete structure and continuous modulation be discovered end-to-end from demonstrations, without relying on additional annotations or predefined skill libraries?

Goal: Automatically discover parameterized skills from multi-task expert demonstrations, enabling rapid generalization to unseen tasks.

Key Insight: Skills are modeled as parameterized trajectory manifolds, and extreme state compression (to 1 dimension) is employed to force latent variables to encode meaningful skill information.

Core Idea: By compressing the observation space to a 1-dimensional "index," an information asymmetry is created that forces discrete skills and continuous parameters to carry critical task-semantic information.

Method¶

Overall Architecture¶

DEPS trains a three-level hierarchy: - Level 1 – Discrete skill policy $\pi^K(k_t | s_{1:t}, a_{1:t-1}, l)$: selects a skill from the skill library. - Level 2 – Continuous parameter policy $\pi^Z(z_t | s_{1:t}, a_{1:t-1}, k_t, l)$: outputs continuous parameters for the selected skill. - Level 3 – Low-level sub-policy $\pi^A(a_t | s'_t, k_t, z_t)$: generates actions conditioned on the compressed state and parameterized skill.

Key Designs¶

Temporal Variational Inference:
- Function: Maximizes the log-likelihood of demonstration trajectories via a variational lower bound.
- Design Motivation: Directly computing $\log p(\tau, l)$ requires marginalizing over all possible skill sequences, which is intractable.
- Mechanism: A variational distribution $q(\kappa, \zeta | \tau, l)$ is introduced; leveraging the non-negativity of KL divergence yields the ELBO: $$\mathcal{L} = \mathbb{E}[\sum_t \log\pi^A(a_t|s'_t, k_t, z_t)] - \mathbb{E}[\sum_t D_{KL}(q(k_t|\tau,l) \| \pi^K(\cdot)) + \mathbb{E}_{k_t}[D_{KL}(q(z_t|\tau,k_t,l) \| \pi^Z(\cdot))]]$$
- Novelty: Unlike Shankar & Gupta, DEPS handles both discrete and continuous latent variables simultaneously and supports high-dimensional state spaces.
Projective State Compression:
- Function: Compresses the input state of the low-level sub-policy to a scalar.
- Design Motivation: (a) Increases overlap in the state space across tasks, improving generalization; (b) the compressed state alone is insufficient to determine actions, forcing the sub-policy to rely on $(k_t, z_t)$ to encode critical information.
- Mechanism: $s'_t = \tanh(\mathbf{w}_{(k_t, z_t)} \cdot s_t^{\text{proj}} + b_{(k_t, z_t)})$, where the projection vector $\mathbf{w}$ and bias $b$ are generated by a skill-conditioned MLP; $\tanh$ normalizes the output to $[-1, 1]$.
- Novelty: The extreme 1-dimensional compression is the paper's primary innovation, conceptualizing skills as indices on a "parameterized trajectory manifold."
Information Asymmetry Design:
- Function: High-level policies receive full observations (images + proprioception), while the low-level sub-policy receives only the compressed proprioceptive state.
- Design Motivation: Prevents the sub-policy from overfitting to visual details, forcing task information to be conveyed through skill variables.
- Additional Constraints: Continuous parameters are predicted at the granularity of discrete skills (rather than at each step), preventing continuous parameters from degenerating into shortcuts encoding per-step actions; a norm penalty on skill parameters is introduced to prevent overfitting.

Loss & Training¶

Maximizes the variational lower bound (ELBO), comprising three terms: behavior cloning loss + discrete KL term + continuous KL term.
Variational networks use bidirectional GRUs; discrete/continuous policies use unidirectional GRUs.
After pre-training, models are fine-tuned for 500 steps on unseen tasks to evaluate generalization.
LIBERO: 80 tasks for pre-training, 20 epochs; MetaWorld: 10 tasks, 40 epochs.

Key Experimental Results¶

Main Results¶

Average success rate across evaluation settings (LIBERO + MetaWorld):

Evaluation Set	Algorithm	Mean Success	Mean Highest Success
LIBERO-OOD	DEPS	0.34±0.08	0.66±0.12
	PRISE	0.10±0.09	0.27±0.23
	BC	0.15±0.04	0.36±0.08
LIBERO-3-shot	DEPS	0.26±0.03	0.49±0.03
	PRISE	0.07±0.07	0.19±0.14
	BC	0.11±0.05	0.22±0.08
MW-Vanilla	DEPS	0.45±0.03	0.65±0.03
	PRISE	0.21±0.07	0.33±0.10
	BC	0.35±0.02	0.51±0.01

Ablation Study¶

Robustness to pre-training data volume (LIBERO-OOD):

Pre-training Epochs	DEPS Mean Highest	BC Mean Highest	PRISE Mean Highest
5 epochs	0.64±0.09	0.30±0.08	0.11±0.13
10 epochs	0.75±0.01	0.30±0.12	0.27±0.33
15 epochs	0.74±0.04	0.32±0.07	0.33±0.26

Key Ablation Findings: - 1D state compression is critical to DEPS performance. - Learning only discrete skills or only continuous skills fails to reproduce DEPS performance. - Varying the maximum number of discrete skills can further improve performance.

Key Findings¶

DEPS achieves over 2× the Mean Success of BC and over 3× that of PRISE on LIBERO-OOD.
DEPS maintains strong performance under extreme data scarcity (3-shot: 0.26 vs. BC 0.11).
The advantage of DEPS grows as pre-training data decreases, indicating that parameterized skill learning also improves data efficiency.
Learned discrete skills are interpretable, corresponding to primitive operations such as grasping, moving, and releasing.
Variations in continuous parameters produce smooth changes in policy behavior (e.g., continuous variation in grasp position).
The compressed 1D state varies monotonically within a single skill, confirming its function as a "trajectory index."

Highlights & Insights¶

Skills as parameterized trajectory manifolds: A novel and intuitive concept—different executions of the same skill correspond to different points on the manifold, and a 1-dimensional index suffices to locate a trajectory.
Counterintuitive effectiveness of extreme compression: Compressing the state to a single scalar proves effective and is the key driver of performance.
Careful information asymmetry design: Rich observations at the high level combined with extreme compression at the low level force each layer of the hierarchy to fulfill a distinct role.
Interpretability: Learned skills correspond to intuitively plausible behavioral primitives, enhancing the credibility of the approach.

Limitations & Future Work¶

Validation is limited to robot manipulation tasks; more complex environments (navigation, bimanual manipulation) remain untested.
The method assumes all tasks share the same state and action spaces.
The number of discrete skills must be specified as a hyperparameter.
1D compression may fail in tasks requiring richer state information.
The method learns solely from offline demonstrations and is not combined with online RL.
Analysis of computational efficiency (training time, inference latency) is absent.
The scalability to larger task sets and longer-horizon problems remains to be verified.

The work is in line with the classical Options framework (Sutton et al.) but achieves end-to-end parameterized skill discovery.
Comparison with PRISE (VLM-based action tokenization) demonstrates the superiority of the end-to-end approach.
The information bottleneck / state compression idea is generalizable to other hierarchical learning settings.
The work motivates a general strategy of combining variational inference with information-theoretic regularization to prevent degeneration.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The parameterized trajectory manifold concept and 1D compression design are highly innovative.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across multiple benchmarks and settings with excellent qualitative visualizations, though more complex environments are absent.
Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are rigorous, concepts are clearly articulated, and figures are intuitive.
Value: ⭐⭐⭐⭐ Makes an important contribution to robot skill learning and hierarchical policy learning.