Skip to content

Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models

Conference: NeurIPS 2025 arXiv: 2602.17497 Code: N/A Area: LLM Pre-training Keywords: temporal credit assignment, in-context learning, advantage function estimation, sparse rewards, online learning

TL;DR

This paper proposes RICL (Retrospective In-Context Learning), which estimates the advantage function by comparing the log-probability difference of an LLM policy before and after an in-context update. This approach converts sparse environment feedback into dense training signals, enabling efficient temporal credit assignment, and achieves comparable convergence performance to traditional RL methods on BabyAI tasks with substantially higher sample efficiency.

Background & Motivation

  1. Background: Online learning for LLM agents relies on environment feedback, but informative feedback is often sparse; in multi-step settings, a sequence of correct actions is required before any reward is received.
  2. Limitations of Prior Work: Sparse feedback increases learning complexity and instability. Training value functions from scratch suffers from poor sample efficiency and limited generalization.
  3. Key Challenge: How can the pre-trained knowledge of LLMs be leveraged to perform temporal credit assignment efficiently?
  4. Goal: To exploit the in-context learning capability of LLMs to convert sparse rewards into dense advantage functions.
  5. Key Insight: The log-probability difference before and after an in-context update implicitly encodes advantage function information.
  6. Core Idea: A theorem establishes that \(\beta \log \frac{\pi'(a|s)}{\pi_0(a|s)} \propto A_r^{\pi_0}(s,a)\), i.e., the log-probability ratio between two policies is proportional to the advantage function.

Method

Overall Architecture

RICL first collects trajectories → a reflection LLM generates feedback → the policy is updated in context → log-prob differences are compared to estimate the advantage function. RICOL further iteratively refines policy parameters via advantage-weighted regression.

Key Designs

  1. RICL (Retrospective In-Context Learning):

    • Function: Converts sparse rewards into dense advantage functions.
    • Mechanism: For each state \(s_t\), the subsequent trajectory (hindsight information \(\{s_{t:T}, a_{t:T-1}, r_{t:T-1}\}\)) is fed into a reflection LLM to generate per-state feedback \(f_t\); the feedback is injected into the prompt to obtain an updated policy \(\pi'\).
    • Advantage Estimation: \(\bar{A}_r^{\pi_0}(s,a) = \frac{\beta}{n}\sum_i(\log\frac{\pi'^{(i)}(a|s)}{\pi_0(a|s)} + \log Z^{(i)}(s))\)
    • Design Motivation: Retrospective updates generate feedback only for visited states, reducing the generalization demands placed on the reflection LLM.
  2. Theoretical Guarantee (Theorem 4.1):

    • Function: Establishes the theoretical connection between log-prob differences and the advantage function.
    • Mechanism: Proves that for any two policies \(\pi_0\) and \(\pi'\), there exists a reward function \(r\) such that \(\beta \log \frac{\pi'(a|s)}{\pi_0(a|s)} \propto A_r^{\pi_0}(s,a)\).
    • Design Motivation: In-context learning implicitly performs a KL-regularized policy update, so the log-prob ratio naturally encodes advantage information.
  3. RICOL (Online Learning Framework):

    • Function: Incorporates the credit assignment results of RICL into LLM parameters.
    • Mechanism: Iteratively updates the policy via advantage-weighted regression (AWR): sample trajectories → estimate advantages with RICL → update parameters with AWR → repeat.
    • Design Motivation: Pure in-context learning is only usable at inference time; AWR consolidates the learned credit assignment knowledge into the model parameters.

Loss & Training

  • RICL: Inference-only; no training required (advantage functions are estimated via log-prob differences).
  • RICOL: AWR loss is used to update LLM parameters.
  • A discrete, enumerable action space permits exact computation of the partition function \(Z(s) = \sum_a \pi'(a|s)\).
  • Averaging over multiple sampled trajectories improves estimation accuracy.

Key Experimental Results

Scenario RICOL PPO Notes
BabyAI (4 scenarios) Comparable convergence Comparable RICOL achieves significantly higher sample efficiency
Advantage function estimation Highly accurate Accurate estimation with few samples
Key state identification Effective Successfully identifies critical decision points

Key Findings

  • RICL accurately estimates advantage functions with only a small number of samples.
  • In-context learning implicitly performs a KL-regularized policy update.
  • RICOL substantially outperforms RICO-GRPO (without explicit credit assignment) in sample efficiency.
  • Retrospective updates (restricted to visited states) reduce the generalization requirements of the reflection LLM.

Per-Scenario Performance on BabyAI

Scenario RICOL PPO RICO-GRPO Sample Efficiency Gain
GoToObj Comparable Comparable Worse
GoToRedBall Comparable Comparable Worse
PickUp Comparable Comparable Worse
PutNext Comparable Comparable Worse

Advantage Function Estimation Quality

  • With only 5 trajectories, an estimation correlation of 0.85 is achieved.
  • Correlation exceeds 0.92 with 10 trajectories.
  • Near-perfect estimation (>0.97) is obtained with 20 trajectories.

Highlights & Insights

  • The paper establishes a theoretical connection between KL-regularized policy updates and in-context learning.
  • Leveraging LLM pre-trained knowledge in place of training a value function from scratch greatly improves sample efficiency.
  • Generating per-action feedback independently is more fine-grained than producing a single feedback signal for an entire trajectory.
  • The theoretical elegance of the approach is notable: log-prob difference = advantage function is a concise and powerful insight.

Limitations & Future Work

  • The method requires a discrete, enumerable action space for computing the partition function; continuous action spaces necessitate additional approximations.
  • The quality of the reflection LLM directly affects credit assignment accuracy; weak reflection capabilities may introduce bias.
  • Validation is limited to BabyAI; more complex environments (e.g., Minecraft, real-world robotics) remain unexplored.
  • Inference cost is high — each state requires multiple trajectory rollouts, reflection, and log-prob computation — which may be impractical at scale.
  • Retrospective updates are only effective for visited states and cannot provide guidance for unvisited ones.
  • The theoretical guarantee of Theorem 4.1 requires the KL-regularization assumption to hold, which may not be strictly satisfied in practice.
  • The variance of the multi-trajectory averaging estimator is insufficiently analyzed; estimates may be unstable with few trajectories.
  • Integration with reinforcement learning from human feedback (RLHF), which itself faces credit assignment challenges, has not been explored.
  • vs. RICO-PPO: RICO-PPO trains a value network from scratch, whereas RICL leverages LLM pre-trained knowledge.
  • vs. Reflexion: Reflexion generates a single feedback signal per trajectory and requires generalization to new states; RICL produces per-state feedback retrospectively.
  • vs. RICO-GRPO: RICO-GRPO normalizes rewards at the trajectory level without performing explicit credit assignment.

Additional Discussion

  • The core contribution lies in transforming the problem from a single dimension to a multi-dimensional analysis, enabling a more comprehensive understanding.
  • The experimental design covers diverse scenarios and baseline comparisons, with statistically significant results.
  • The modular design of the method facilitates extension to related tasks and new datasets.
  • Open-sourcing code and data would be of significant value to community reproducibility and follow-up research.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — The theoretical connection between log-prob differences and advantage function estimation is an elegant and original contribution.
  • Experimental Thoroughness: ⭐⭐⭐ — The BabyAI scenarios are relatively simple; validation in more complex environments is needed.
  • Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are clear and method motivation is well articulated.
  • Value: ⭐⭐⭐⭐ — Offers important insights for online learning of LLM agents and RL credit assignment.