Skip to content

Reward Redistribution via Gaussian Process Likelihood Estimation

Conference: AAAI 2026 arXiv: 2503.17409 Code: GitHub Area: Other Keywords: Sparse Reward, Reward Redistribution, Gaussian Process, Likelihood Estimation, Credit Assignment

TL;DR

This paper proposes GP-LRR, a reward redistribution framework based on Gaussian process likelihood estimation. It explicitly models correlations among state-action pairs via kernel functions, and learns a step-wise reward function by maximizing the marginal likelihood of trajectory returns using a leave-one-out strategy. Theoretical analysis demonstrates that conventional MSE-based methods are a degenerate special case of GP-LRR. Experiments on MuJoCo benchmarks combined with SAC show superior sample efficiency and policy performance.

Background & Motivation

In many real-world reinforcement learning tasks, feedback is only provided at the end of a long sequence of actions (i.e., at episode termination), resulting in extremely sparse and delayed reward signals. For example, in aerospace design, the quality of an aircraft component can only be assessed after the entire manufacturing process is complete. Such sparse feedback makes it difficult to identify which actions most significantly influence the final outcome, causing slow convergence or convergence to suboptimal policies.

Limitations of Prior Work:

RUDDER: Uses an LSTM to predict step-wise returns and redistributes credit via backpropagation, but BPTT over long sequences is unstable and computationally expensive.

IRCR: Uniformly distributes episode return across all time steps; computationally simple but entirely ignores temporal structure.

RRD: Approximates optimal decomposition by sampling random trajectory subsequences, but still ignores dependencies among state-action pairs.

GRD: Learns a causal generative model to produce interpretable proxy rewards, but involves a complex architecture.

Core Problem: All of the above methods assume per-step rewards are independent, thereby ignoring inter-dependencies among state-action pairs. However, the authors observe through experiments that significant lag-1 autocorrelation exists across multiple environments (e.g., adjacent step rewards \((r_t, r_{t+1})\) are highly correlated in HalfCheetah-v4), indicating that rewards at consecutive steps exhibit temporal dependence due to correlated state-action pairs. Ignoring such dependence leads to invalid reward redistribution, misses important action interactions, and ultimately degrades learning efficiency.

Suitability of GPs: Gaussian processes (GPs) are a natural tool for modeling such dependence — as a non-parametric method, GPs treat the unknown reward function as a black-box function and explicitly model reward correlations between different state-action pairs via kernel (covariance) functions. Although GPs have been used in RL for value function approximation and transition dynamics modeling, to the best of the authors' knowledge, this is the first application of GPs to reward redistribution.

Method

Overall Architecture

The core mechanism of the GP-LRR framework:

  1. Model each step's reward \(R(s,a)\) as a sample from a Gaussian process.
  2. Construct the training objective using a leave-one-out (LOO) strategy.
  3. Maximize the marginal likelihood of the observed episode returns.
  4. Use the learned mean function as a dense reward signal for downstream policy optimization.

Key Designs

  1. Gaussian Process Reward Modeling: The ground-truth reward function \(R_b(s,a)\) at each step is modeled as a GP sample:

    \(R_b(s,a) \sim \mathcal{GP}(\mu_{\boldsymbol{\theta}}(s,a), k_{\boldsymbol{\phi}}((s,a),(s',a')))\)

where \(\mu_{\boldsymbol{\theta}}(s,a)\) is a parameterized mean function (represented by a neural network) and \(k_{\boldsymbol{\phi}}\) is a kernel function. For a trajectory \(\tau\), the ground-truth reward vector follows a multivariate Gaussian distribution: \(\mathbf{r}_b \sim \mathcal{N}(\boldsymbol{\mu}_{\boldsymbol{\theta}}, \mathbf{K}_{\boldsymbol{\phi}})\).

The kernel function measures reward correlations between different state-action pairs. The core assumption is that similar state-action pairs should yield similar rewards. The default kernel is the RBF kernel:

$k_{\boldsymbol{\phi}}((s,a),(s',a')) = \sigma_f^2 \exp\left(-\frac{\|(s,a)-(s',a')\|^2}{2\ell_{\text{rbf}}^2}\right)$
  1. Leave-One-Out Objective Construction: In settings where only the episode return \(R_{ep}(\tau)\) is available without per-step rewards, a LOO target is constructed for each time step \(i\):

    \(\tilde{r}(s_i, a_i) = R_{ep}(\tau) - \sum_{t=0, t \neq i}^{T-1} \mu_{\boldsymbol{\theta}}(s_t, a_t)\)

This subtracts the current estimated rewards of all other steps from the total return, yielding a proxy target for the reward at the current step. These LOO targets serve as "noisy observations" for constructing the likelihood function.

  1. Marginal Likelihood Maximization: Given the LOO targets, the following log marginal likelihood is optimized:

    \(\log p(\tilde{\mathbf{r}} \mid \boldsymbol{\theta}, \boldsymbol{\phi}, \sigma_\epsilon) = \underbrace{-\frac{1}{2}(\tilde{\mathbf{r}}-\boldsymbol{\mu}_{\boldsymbol{\theta}})^\top \mathbf{K}_\sigma^{-1}(\tilde{\mathbf{r}}-\boldsymbol{\mu}_{\boldsymbol{\theta}})}_{\text{data fit term}} \underbrace{- \frac{1}{2}\log\det(\mathbf{K}_\sigma)}_{\text{Occam factor}} - \frac{|\tau|}{2}\log(2\pi)\)

where \(\mathbf{K}_\sigma = \mathbf{K}_{\boldsymbol{\phi}} + \sigma_\epsilon^2 \mathbf{I}\). The data fit term measures how well the model explains the data, while the Occam factor penalizes overfitting.

  1. Integration with SAC (Algorithm 2):

    • Maintains two buffers: a transition buffer \(\mathcal{D}\) (for SAC updates) and a full trajectory buffer \(\mathcal{D}_\tau\) (for GP training).
    • Updates the GP model every \(n_{\text{update}}\) episodes.
    • Uses the learned mean function \(\mu_{\boldsymbol{\theta}}(s,a)\) as a dense reward signal in place of the sparse episode reward during SAC updates.
    • GP training ensures numerical stability via Cholesky decomposition.

Theoretical Analysis

The paper provides four important theoretical propositions:

Proposition 1 (MSE as a Special Case): When the kernel matrix degenerates to the identity \(\mathbf{K}_{\boldsymbol{\phi}} = \mathbf{I}\) and observation noise vanishes \(\sigma_\epsilon = 0\), the objective of GP-LRR reduces to the standard MSE reward redistribution loss:

\[\mathcal{L}(\tau; \boldsymbol{\theta}) \propto \frac{|\tau|}{2}\left(R_{ep}(\tau) - \sum_{t=0}^{|\tau|-1}\mu_{\boldsymbol{\theta}}(s_t, a_t)\right)^2\]

This formally establishes that GP-LRR strictly subsumes conventional methods.

Proposition 2 (Gradient Flow and Correlation): The gradient of GP-LRR with respect to the mean function parameters is:

\[\frac{\partial\mathcal{L}}{\partial\boldsymbol{\theta}} = -\frac{\partial\boldsymbol{\mu}_{\boldsymbol{\theta}}^\top}{\partial\boldsymbol{\theta}} \cdot \mathbf{K}_\sigma^{-1}(\tilde{\mathbf{r}} - \boldsymbol{\mu}_{\boldsymbol{\theta}})\]

For a specific \((s_i, a_i)\), the gradient contribution aggregates prediction errors from all state-action pairs (weighted by the precision matrix \(\mathbf{K}_\sigma^{-1}\)), rather than relying solely on its own error. This creates a "credit assignment network" — errors from highly correlated state-action pairs are pooled together during updates.

Proposition 3 (Length-Scale and Smoothness Trade-off): The GP automatically adjusts \(\ell_{\text{rbf}}\) by maximizing the marginal likelihood, balancing data fit against model complexity.

Proposition 4 (Adaptive Noise Level): The gradient with respect to \(\sigma_\epsilon^2\) includes a regularization term \(\frac{1}{2}\text{tr}(\mathbf{K}_\sigma^{-1})\), which prevents noise collapse and avoids overfitting.

Loss & Training

  • The GP model is trained by maximizing the log marginal likelihood; parameters \(\boldsymbol{\theta}, \sigma_f^2, \ell_{\text{rbf}}, \sigma_\epsilon^2\) are updated via gradient descent.
  • The SAC component uses the standard soft Bellman residual (critic), entropy-augmented policy loss (actor), and adaptive temperature.
  • GP training complexity is \(\mathcal{O}(M|\tau|^3)\); cost is amortized by updating every 100 steps with a batch size of 4.
  • The reward model uses a 2-hidden-layer MLP (64 or 256 neurons) with learning rate \(10^{-3}\).

Key Experimental Results

Main Results

Comparison under sparse episode reward settings across four MuJoCo environments (mean return over 5 independent runs):

Environment SAC (sparse) IRCR (uniform) RRD (random decomp.) GP-LRR (Ours)
HalfCheetah-v4 Low, stagnates Slow improvement Moderate Highest, fastest convergence
Hopper-v4 Low, high variance Moderate Moderate Highest
Swimmer-v4 Low Close to GP-LRR Moderate Highest
Walker2d-v4 Low Low Moderate Highest

GP-LRR consistently achieves the best performance across all four environments, with the most pronounced advantage in HalfCheetah-v4, where smooth dynamics and spatially correlated rewards closely match GP's correlation modeling.

Ablation Study

Kernel Function Selection:

Kernel HalfCheetah Hopper Swimmer Walker2d
RBF Best Best Best Worst
Matérn-3/2 Moderate Moderate Moderate Best
Rational Quadratic Moderate Moderate Moderate Second best

Kernel performance is task-dependent: RBF suits smooth reward landscapes, while Matérn is more appropriate for Walker2d, which involves discontinuities.

Length-Scale Initialization:

Init. \(\ell\) HalfCheetah Hopper Swimmer Walker2d
Small (0.1) Poor High variance Insensitive Moderate
Medium (1.0) Moderate Moderate Insensitive Best
Large (10.0) Best Moderate Insensitive Moderate
Adaptive Moderate High variance Insensitive Moderate

No universally optimal initialization exists — HalfCheetah favors large values, Walker2d favors medium values, and Swimmer is insensitive.

Key Findings

  1. Value of Correlation Modeling: The advantage of GP-LRR primarily stems from pooling error information across correlated state-action pairs through the kernel function, rather than treating each step's reward independently.
  2. Implicit Uncertainty Regularization: The Occam factor in the marginal likelihood and the adaptive noise mechanism prevent overconfident reward assignment in under-explored regions.
  3. Task Dependency: The optimal kernel function and hyperparameter settings are highly dependent on the reward structure of the specific environment; there is no universally optimal configuration.
  4. Off-Policy Compatibility: Seamless integration with SAC enables GP-LRR to achieve superior sample efficiency compared to methods requiring on-policy data.

Highlights & Insights

  • A novel perspective on reward redistribution via non-parametric methods: Treating the reward function as a black-box and modeling it with a GP avoids the constraints of parametric family assumptions, while naturally incorporating correlation structure through the kernel function.
  • Tight coupling of theory and practice: The four propositions clearly reveal the relationship between GP-LRR and conventional methods (the subsumption result in Proposition 1), the source of its advantage (the credit assignment network in Proposition 2), and the built-in regularization mechanisms (Propositions 3–4).
  • Elegant design of the LOO strategy: Under the constraint of only observing episode-level returns, the LOO target leverages the full trajectory while providing well-defined "observations" for the GP likelihood.
  • The proof that MSE is a degenerate special case is particularly elegant: \(\mathbf{K}_\phi = \mathbf{I}\) (no correlation) \(+ \sigma_\epsilon = 0\) (no noise) \(=\) conventional MSE, offering a clear explanation of the limitations of prior methods.

Limitations & Future Work

  1. Computational Complexity: The \(\mathcal{O}(M|\tau|^3)\) training complexity of GPs limits applicability to tasks with long trajectories. Although Cholesky decomposition accelerates matrix inversion, it remains a bottleneck as trajectory length increases.
  2. Limited Experimental Scope: Evaluation is restricted to four MuJoCo environments, and the authors acknowledge that not all environments are well-suited to modeling state-action correlations. Validation on discrete action spaces and high-dimensional observation spaces is absent.
  3. Lack of Guidance for Kernel Selection: The optimal kernel varies across environments, and no mechanism for automatic kernel selection or combination is provided.
  4. Gaussian Noise Assumption Only: The framework may fail when reward noise is non-Gaussian; the authors list non-Gaussian extensions as future work.
  5. Missing Comparisons with Recent Baselines: No experimental comparison is made against more recent methods such as GRD and DIASter.
  6. Hyperparameter Sensitivity: Length-scale initialization has a notable impact on results, increasing the tuning burden.
  • The application of GPs in RL has expanded from value function approximation and transition dynamics to reward redistribution, demonstrating the broad applicability of GPs as flexible non-parametric tools.
  • The LOO strategy may inspire other scenarios that require inferring local contributions from global signals, such as multi-agent credit assignment and feature attribution.
  • The concept of a "credit assignment network" induced by the precision matrix may inspire graph-structured reward redistribution methods.
  • Future work could consider sparse GP approximations (e.g., inducing point methods) to reduce computational complexity and enable application to longer trajectories.

Rating

Dimension Score (1–5) Remarks
Novelty ⭐⭐⭐⭐ First application of GP to reward redistribution; fresh perspective
Practicality ⭐⭐⭐ Computational complexity and hyperparameter sensitivity limit real-world use
Theoretical Depth ⭐⭐⭐⭐⭐ Four propositions are complete and elegant; MSE special-case proof is particularly strong
Experimental Thoroughness ⭐⭐⭐ Limited environments; lacks comparison with more recent methods
Writing Quality ⭐⭐⭐⭐ Clear structure; theory and experiments are tightly integrated
Overall ⭐⭐⭐⭐ Strong theoretical contribution; experimental validation scope needs expansion