Human-Inspired Multi-Level Reinforcement Learning¶

Conference: NeurIPS 2025 arXiv: 2501.07502 Code: None Area: Reinforcement Learning Keywords: rating-based RL, KL divergence, human feedback, multi-level learning, reward-free RL

TL;DR¶

This paper proposes RbRL-KL, which augments rating-based RL (RbRL) with a KL divergence-driven policy loss term. By leveraging failure experiences across different rating levels with varying weights to repel the current policy, RbRL-KL outperforms standard RbRL across 6 DeepMind Control environments.

Background & Motivation¶

Background: In reward-free settings, RLHF infers rewards from human feedback. PbRL employs preference comparisons, while RbRL uses rating annotations for reward learning.

Limitations of Prior Work: RbRL utilizes ratings solely for reward learning, discarding the policy-directional information embedded in different rating levels.

Key Challenge: Failure experiences of varying performance levels are treated uniformly, whereas humans naturally distinguish between them—completely missing a ball versus going out of bounds represent errors of different severity.

Goal: Directly exploit multi-level rating information during policy learning, enabling the policy to distance itself from failure experiences of different performance levels by varying degrees.

Key Insight: KL divergence is used to measure the distributional similarity between the current policy and experiences at different rating levels, with penalties applied via decreasing weights.

Core Idea: A hierarchical policy loss based on KL divergence allows RL agents to extract directional information from multi-level failure experiences in a manner analogous to human learning.

Method¶

Overall Architecture¶

RbRL-KL extends standard RbRL with a third channel: low-level information extraction (RbRL reward learning), high-level information extraction (KL divergence policy direction), and joint training.

Key Designs¶

Tiered Rating Buffer Storage:
- \(n\) rating classes are stored in separate buffers \(R_0, \ldots, R_{n-1}\)
- The highest rating level is excluded from the KL loss; all remaining levels are treated as "failures" of varying performance
Multivariate Gaussian Representation:
- Trajectory sets at each rating level and the current policy trajectories are parameterized as multivariate Gaussians \(\mathcal{N}(\mu, \Sigma)\)
Hierarchical KL Divergence Policy Loss:
- Core formula: \(\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log(\pi_\theta) \hat{R}(\sigma_\theta)] - \nabla_\theta \sum_{i=0}^{n-2} \omega_i D_{KL}(D_i \| D_{\pi_\theta})\)
- KL divergence computed analytically via multivariate Gaussians: \(D_{KL}(P\|Q) = \frac{1}{2}(\text{Tr}(\Sigma_Q^{-1}\Sigma_P) + (\mu_P-\mu_Q)^T \Sigma_Q^{-1}(\mu_Q-\mu_P) + \ln\frac{\det\Sigma_Q}{\det\Sigma_P})\)
- Weights satisfy \(\omega_0 > \omega_1 > \cdots > \omega_{n-2}\), imposing larger penalties for lower rating levels
- The first term corresponds to standard policy gradient; the second repels the policy from suboptimal behaviors of varying degrees

Loss & Training¶

An initial phase of \(M\) episodes collects ratings to train the reward predictor; subsequent updates jointly optimize both loss terms
The KL loss module is added in a plug-and-play fashion without modifying the original RbRL framework

Key Experimental Results¶

Main Results (6 DeepMind Control Environments)¶

Environment	RbRL(n=4)	RbRL-KL(n=4)	Gain%	RbRL(n=6)	RbRL-KL(n=6)	Gain%
Cartpole	402.55	417.54	+3.7	306.92	381.79	+24.4
Ball-in-cup	789.30	861.47	+9.1	828.62	873.92	+5.5
Finger-spin	511.55	579.27	+13.2	559.73	646.37	+15.5
HalfCheetah	238.99	337.04	+41.0	235.46	303.88	+29.1
Walker	606.14	742.05	+22.4	797.90	825.18	+3.4
Quadruped	308.48	477.29	+54.7	199.83	306.78	+53.5

Gain Percentage Across Different Numbers of Rating Classes¶

Environment	n=3	n=4	n=5	n=6
Cartpole	+15.5%	+3.7%	+22.5%	+24.4%
HalfCheetah	+60.0%	+41.0%	+45.2%	+29.1%
Quadruped	-7.5%	+54.7%	+226.0%	+53.5%

Key Findings¶

Significant gains are observed in high-complexity environments (HalfCheetah, Walker, Quadruped)
Occasional negative gains at low rating class counts (n=3): coarse failure groupings lead to overly uniform KL penalties
Uniform hyperparameters (\(\omega_i\) decaying as \(2^{-i}\)) generalize across all environments

Highlights & Insights¶

Human Learning Analogy: The hierarchical KL penalty formalizes the intuition of "learning different lessons from different mistakes"
Modular Design: Plug-and-play compatibility with PPO/DDPG/SAC
Multivariate Gaussian Approximation is concise and effective, avoiding complex distributional estimation
The approach is transferable to graded penalties in preference-based RL

Limitations & Future Work¶

The weights \(\omega_i\) are set manually, lacking an adaptive mechanism
The multivariate Gaussian assumption may be inaccurate for high-dimensional multimodal distributions
The optimal choice of rating class count \(n\) is environment-dependent
Experimental environments are relatively simple

vs. RbRL (White et al. 2024): Original RbRL uses ratings only for reward learning; this work adds a policy-direction learning channel, making the two complementary
vs. PbRL (Christiano et al. 2017): PbRL relies on pairwise preferences and cannot assess the absolute quality of individual samples; RbRL-KL exploits the multi-level information in absolute ratings
vs. Wu et al. (2024) negative experience RL: Their approach applies a uniform penalty for all failure experiences; this work distinguishes between failure levels for finer-grained shaping
vs. DQfD/DDPGfD: These methods incorporate expert demonstrations into replay; this work uses multi-level non-expert failure experiences for policy shaping
vs. NAC (Gao et al. 2018): NAC initializes policies with noisy demonstrations and fine-tunes; this work continuously exploits multi-level information throughout training

Hyperparameter Settings¶

Parameter	Value	Description
Clip \(\epsilon\)	0.4	PPO clip parameter
Learning rate \(\alpha\)	5e-5	Unified across all environments
Batch size	128	Unified across all environments
Hidden layers	3	Unified across all environments
\(\omega_0\)	1.0	Weight for lowest rating level
\(\omega_1\)	0.5	Exponential decay
\(\omega_2\)	0.25	Exponential decay

Rating¶

Novelty: ⭐⭐⭐⭐ The hierarchical KL loss is intuitively novel, though technically a combination of existing components
Experimental Thoroughness: ⭐⭐⭐ 6 environments with 10 seeds is acceptable, but ablation studies and more complex benchmarks are lacking
Writing Quality: ⭐⭐⭐⭐ Clear structure, complete formulations, and well-motivated presentation
Value: ⭐⭐⭐⭐ Concisely and effectively exploits multi-level feedback, offering meaningful reference for RLHF research