ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics¶

Conference: ICML 2025
arXiv: 2411.18825
Code: None
Area: Multimodal / VLM
Keywords: Reward Design, VLM, Inverse Reinforcement Learning, Learning from Demonstrations, Robotics

TL;DR¶

ELEMENTAL integrates vision-language models (VLMs) with inverse reinforcement learning (IRL) to extract feature functions via VLMs, optimize weights via IRL, and iteratively improve via self-reflection, achieving a 42.3% improvement over EUREKA across 9 IsaacGym tasks.

Background & Motivation¶

Background: RL performs outstandingly in robotics tasks, but the core bottleneck remains reward function design—requiring extensive domain knowledge and manual parameter tuning.

Limitations of Prior Work: (a) LLM-based methods like EUREKA rely solely on text descriptions to generate reward functions, failing to precisely capture complex spatial tasks; (b) LLMs struggle to balance the weights of different features; (c) text-only task specifications cannot capture implicit user preferences.

Key Challenge: LLMs excel at semantic understanding and feature recognition but struggle with mathematical optimization and weight allocation; IRL excels at matching behaviors from demonstrations but requires predefined features. The two are highly complementary.

Key Insight: Let the VLM handle feature extraction and IRL handle weight optimization, while introducing visual demonstrations as auxiliary information.

Core Idea: A three-phase loop: initial prompt for VLM to generate feature functions \(\to\) Approximate MaxEnt-IRL to learn reward weights and policies \(\to\) self-reflection to compare differences in feature counts and iteratively improve.

Method¶

Overall Architecture¶

Input: Environment code + task text description + visual demonstration \(\to\) Phase 1: VLM generates feature function \(\phi(s)\) \(\to\) Phase 2: IRL learns \(R_\theta(s) = \theta^T \phi(s)\) \(\to\) Phase 3: Compare feature count differences between policy and demonstrations \(\to\) Feedback to VLM to modify features \(\to\) Iterate.

Key Designs¶

Phase 1 - Initial Prompt (VLM Feature Extraction):
- Inputs include: environment MDP code, task text description, and visual demonstrations (overlays / keyframes)
- The VLM (GPT-4o) outputs feature functions \(\phi: \mathcal{S} \to \mathbb{R}^n\) in the format of Python code
- Design Motivation: Visual demonstrations compensate for the inadequacy of text descriptions; the VLM's code capabilities are restricted to "feature extraction" rather than "complete reward design".
Phase 2 - Learning (Approximate MaxEnt-IRL):
- Reward model: \(R_\theta(s) = \theta^T \phi(s)\), initial \(\theta = \{1/n\}^n\)
- Gradient: \(\nabla_\theta \approx \mathbb{E}_{\tau \sim \mathcal{D}}[\sum_s \phi(s)] - \mathbb{E}_{\tau \sim \pi_\psi}[\sum_s \phi(s)]\)
- Alternatingly optimize \(\theta\) (reward weights) and \(\psi\) (PPO policy)
- Key trick: Gradient L1 normalization + \(\theta\) L1 normalization to ensure training stability
- Design Motivation: Since direct computation of the partition function is intractable, policies are used for approximation.
Phase 3 - Reflection (Self-Reflection):
- Compute feature count vectors for policy trajectories and demonstration trajectories: \(\vec{\Phi}_{\pi_\psi}\) vs \(\vec{\Phi}_\mathcal{D}\)
- Feedback differences to the VLM to modify the feature functions
- Automatically completed without requiring additional human input
- Design Motivation: Simulates the "observe \(\to\) execute \(\to\) reflect \(\to\) improve" loop in human learning.

Loss & Training¶

Reward weights: Gradient ascent \(\theta \leftarrow \theta + \alpha \nabla_\theta'\), with L1-normalized \(\theta\)
Policy: Optimize \(\pi_\psi\) using PPO to maximize \(J(\pi_\psi)\)
Perform \(m\) rounds of IRL iterations alternately

Key Experimental Results¶

Main Results¶

Task	ELEMENTAL	EUREKA	BC	IRL	GT Reward
Cartpole	233.92	215.91	149.85	28.15	260.14
Ant	8.49	6.88	-0.05	0.88	7.00
Humanoid	4.70	3.78	-0.43	2.13	5.07
FrankaCabinet	0.36	0.21	0.01	0.00	0.40
AllegroHand	22.97	11.12	0.04	0.01	23.70
ShadowHand	2.71	0.001	0.03	0.01	0.15
Overall Gain	+42.3%	baseline	—	—	upper bound

Ablation Study¶

Configuration	Average Performance	Explanation
Full ELEMENTAL	Optimal	Full three-phase workflow
w/o Self-Reflection	Degraded	Lacks iterative improvement
w/o Visual Input	Degraded	Text only is insufficient for complex tasks
w/o Norm 1 (Gradient Normalization)	Degraded	Unstable training
w/o Norm 2 (Weight Normalization)	Degraded	Inconsistent reward scale

Key Findings¶

The execution rate of GPT-4o's feature code (~80%) is significantly higher than EUREKA's reward code execution rate (<50%)
Generalization experiments: ELEMENTAL improves performance by 41.3% over EUREKA across 4 Ant variants—EUREKA might have memorized standard IsaacGym rewards
This is the first successful application of IRL to high-dimensional tasks in IsaacGym

Highlights & Insights¶

Complementary Architecture: VLM for feature recognition + IRL for weight optimization, playing to their respective strengths
First successful IRL application in IsaacGym: Enabled by the structured feature space provided by the VLM
Self-Reflection Mechanism: Feature count differences provide a more precise improvement signal than textual feedback

Limitations & Future Work¶

Runtime is approximately 2.5 times longer than EUREKA (168 vs 68 minutes)
Not yet validated on real hardware
The format of visual demonstrations (overlays/keyframes) needs manual selection based on the task type

EUREKA (Ma et al. 2023) is a direct competitor
RL-VLM-F uses VLM as a proxy reward but lacks interaction
AIRL (Fu et al. 2018) provides the RL training paradigm
Insight: LLMs/VLMs should not be expected to perform tasks they struggle with (such as mathematical optimization), but should instead focus on semantic understanding

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The integration of VLM+IRL and the self-reflection mechanism is very clever
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 9 IsaacGym tasks + 4 generalization variants + comprehensive ablation study
Writing Quality: ⭐⭐⭐⭐ Clear methodology explanation, sound experimental design
Value: ⭐⭐⭐⭐⭐ Provides a practical and powerful solution for robot reward design