Skip to content

Human-Inspired Multi-Level Reinforcement Learning

Conference: NeurIPS 2025 arXiv: 2501.07502 Code: None Area: Reinforcement Learning Keywords: rating-based RL, KL divergence, human feedback, multi-level learning, reward-free RL

TL;DR

This paper proposes RbRL-KL, which augments rating-based RL (RbRL) with a KL divergence-driven policy loss term. By leveraging failure experiences across different rating levels with varying weights to repel the current policy, RbRL-KL outperforms standard RbRL across 6 DeepMind Control environments.

Background & Motivation

Background: In reward-free settings, RLHF infers rewards from human feedback. PbRL employs preference comparisons, while RbRL uses rating annotations for reward learning.

Limitations of Prior Work: RbRL utilizes ratings solely for reward learning, discarding the policy-directional information embedded in different rating levels.

Key Challenge: Failure experiences of varying performance levels are treated uniformly, whereas humans naturally distinguish between them—completely missing a ball versus going out of bounds represent errors of different severity.

Goal: Directly exploit multi-level rating information during policy learning, enabling the policy to distance itself from failure experiences of different performance levels by varying degrees.

Key Insight: KL divergence is used to measure the distributional similarity between the current policy and experiences at different rating levels, with penalties applied via decreasing weights.

Core Idea: A hierarchical policy loss based on KL divergence allows RL agents to extract directional information from multi-level failure experiences in a manner analogous to human learning.

Method

Overall Architecture

RbRL-KL extends standard RbRL with a third channel: low-level information extraction (RbRL reward learning), high-level information extraction (KL divergence policy direction), and joint training.

Key Designs

  1. Tiered Rating Buffer Storage:

    • \(n\) rating classes are stored in separate buffers \(R_0, \ldots, R_{n-1}\)
    • The highest rating level is excluded from the KL loss; all remaining levels are treated as "failures" of varying performance
  2. Multivariate Gaussian Representation:

    • Trajectory sets at each rating level and the current policy trajectories are parameterized as multivariate Gaussians \(\mathcal{N}(\mu, \Sigma)\)
  3. Hierarchical KL Divergence Policy Loss:

    • Core formula: \(\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log(\pi_\theta) \hat{R}(\sigma_\theta)] - \nabla_\theta \sum_{i=0}^{n-2} \omega_i D_{KL}(D_i \| D_{\pi_\theta})\)
    • KL divergence computed analytically via multivariate Gaussians: \(D_{KL}(P\|Q) = \frac{1}{2}(\text{Tr}(\Sigma_Q^{-1}\Sigma_P) + (\mu_P-\mu_Q)^T \Sigma_Q^{-1}(\mu_Q-\mu_P) + \ln\frac{\det\Sigma_Q}{\det\Sigma_P})\)
    • Weights satisfy \(\omega_0 > \omega_1 > \cdots > \omega_{n-2}\), imposing larger penalties for lower rating levels
    • The first term corresponds to standard policy gradient; the second repels the policy from suboptimal behaviors of varying degrees

Loss & Training

  • An initial phase of \(M\) episodes collects ratings to train the reward predictor; subsequent updates jointly optimize both loss terms
  • The KL loss module is added in a plug-and-play fashion without modifying the original RbRL framework

Key Experimental Results

Main Results (6 DeepMind Control Environments)

Environment RbRL(n=4) RbRL-KL(n=4) Gain% RbRL(n=6) RbRL-KL(n=6) Gain%
Cartpole 402.55 417.54 +3.7 306.92 381.79 +24.4
Ball-in-cup 789.30 861.47 +9.1 828.62 873.92 +5.5
Finger-spin 511.55 579.27 +13.2 559.73 646.37 +15.5
HalfCheetah 238.99 337.04 +41.0 235.46 303.88 +29.1
Walker 606.14 742.05 +22.4 797.90 825.18 +3.4
Quadruped 308.48 477.29 +54.7 199.83 306.78 +53.5

Gain Percentage Across Different Numbers of Rating Classes

Environment n=3 n=4 n=5 n=6
Cartpole +15.5% +3.7% +22.5% +24.4%
HalfCheetah +60.0% +41.0% +45.2% +29.1%
Quadruped -7.5% +54.7% +226.0% +53.5%

Key Findings

  • Significant gains are observed in high-complexity environments (HalfCheetah, Walker, Quadruped)
  • Occasional negative gains at low rating class counts (n=3): coarse failure groupings lead to overly uniform KL penalties
  • Uniform hyperparameters (\(\omega_i\) decaying as \(2^{-i}\)) generalize across all environments

Highlights & Insights

  • Human Learning Analogy: The hierarchical KL penalty formalizes the intuition of "learning different lessons from different mistakes"
  • Modular Design: Plug-and-play compatibility with PPO/DDPG/SAC
  • Multivariate Gaussian Approximation is concise and effective, avoiding complex distributional estimation
  • The approach is transferable to graded penalties in preference-based RL

Limitations & Future Work

  • The weights \(\omega_i\) are set manually, lacking an adaptive mechanism
  • The multivariate Gaussian assumption may be inaccurate for high-dimensional multimodal distributions
  • The optimal choice of rating class count \(n\) is environment-dependent
  • Experimental environments are relatively simple
  • vs. RbRL (White et al. 2024): Original RbRL uses ratings only for reward learning; this work adds a policy-direction learning channel, making the two complementary
  • vs. PbRL (Christiano et al. 2017): PbRL relies on pairwise preferences and cannot assess the absolute quality of individual samples; RbRL-KL exploits the multi-level information in absolute ratings
  • vs. Wu et al. (2024) negative experience RL: Their approach applies a uniform penalty for all failure experiences; this work distinguishes between failure levels for finer-grained shaping
  • vs. DQfD/DDPGfD: These methods incorporate expert demonstrations into replay; this work uses multi-level non-expert failure experiences for policy shaping
  • vs. NAC (Gao et al. 2018): NAC initializes policies with noisy demonstrations and fine-tunes; this work continuously exploits multi-level information throughout training

Hyperparameter Settings

Parameter Value Description
Clip \(\epsilon\) 0.4 PPO clip parameter
Learning rate \(\alpha\) 5e-5 Unified across all environments
Batch size 128 Unified across all environments
Hidden layers 3 Unified across all environments
\(\omega_0\) 1.0 Weight for lowest rating level
\(\omega_1\) 0.5 Exponential decay
\(\omega_2\) 0.25 Exponential decay

Rating

  • Novelty: ⭐⭐⭐⭐ The hierarchical KL loss is intuitively novel, though technically a combination of existing components
  • Experimental Thoroughness: ⭐⭐⭐ 6 environments with 10 seeds is acceptable, but ablation studies and more complex benchmarks are lacking
  • Writing Quality: ⭐⭐⭐⭐ Clear structure, complete formulations, and well-motivated presentation
  • Value: ⭐⭐⭐⭐ Concisely and effectively exploits multi-level feedback, offering meaningful reference for RLHF research