Skip to content

Training a Generally Curious Agent (Paprika)

Conference: ICML 2025
arXiv: 2502.17543
Code: -
Area: Model Compression
Keywords: in-context RL, curious agent, curriculum learning, sequential decision making, Paprika

TL;DR

Proposes the Paprika framework, which fine-tunes LLMs on diverse text-based decision-making tasks, enabling the model to learn general information-gathering and decision-making capabilities and transfer zero-shot to completely unseen tasks.

Background & Motivation

  • LLMs acting as autonomous agents must interact with environments and collect information to achieve goals.
  • Direct deployment in the real world to collect data is highly costly and risky.
  • Synthetic data generation cannot cover all tasks, but the in-context learning capability of LLMs supports learning generalized policies from a small set of tasks.
  • Mechanism: Instead of training the model to solve all tasks, train the model to learn the general process of performing tasks (i.e., in-context RL).
  • This is analogous to the SFT/RLHF phases where only a small number of examples are needed to yield models capable of generalizing to diverse queries.

Method

1. Task Design (10 Task Groups)

All tasks are text-based, multi-turn, and partially observable:

Task Group Number of Training Tasks Max Turns Environment Feedback
20 Questions 1499 20 LLM-generated
Guess My City 500 20 LLM-generated
Wordle 1515 6 Hardcoded
Cellular Automata 1000 6 Hardcoded
Customer Service 628 20 LLM-generated
Murder Mystery 203 20 LLM-generated
Mastermind 1000 12 Hardcoded
Battleship 1000 20 Hardcoded
Minesweeper 1000 20 Hardcoded
Bandit Best Arm 81 21 Hardcoded

2. Dataset Construction

  • Diverse trajectories are generated using Min-p sampling (temperature 1.5, p=0.3).
  • Each task generates \(n_\text{sample}=20\) trajectories.
  • Preference pairs \((h_w, h_l)\) are constructed: \(h_w\) is the highest-scoring trajectory, while \(h_l\) is randomly sampled from low-scoring trajectories.

3. Optimization Objectives

SFT: Maximizing likelihood on winning trajectories:

\[\mathcal{L}_\text{SFT} = -\mathbb{E}\left[\frac{1}{\sum_t |a_t^w|}\sum_t \log \pi_\theta(a_t^w | h_{:t}^w)\right]\]

Multi-turn DPO:

\[\mathcal{L}_\text{DPO} = -\mathbb{E}\left[\log\sigma\left(\sum_t \beta\log\frac{\pi_\theta(a_t^w|h_{:t}^w)}{\pi_\text{ref}(a_t^w|h_{:t}^w)} - \sum_t \beta\log\frac{\pi_\theta(a_t^l|h_{:t}^l)}{\pi_\text{ref}(a_t^l|h_{:t}^l)}\right)\right]\]

Loss is computed only on the agent's action tokens (excluding environment-generated tokens).

RPO: Combines SFT + DPO to mitigate the "undesired disalignment" of DPO:

\[\mathcal{L}_\text{RPO} = \mathcal{L}_\text{DPO} + \alpha \mathcal{L}_\text{SFT}\]

4. Curriculum Learning: Scalable Task Selection

Learning Potential Metric: Measures the learning signal strength of tasks using the coefficient of variation:

\[\nu_\pi(\tau) = \frac{\sqrt{\sigma^2_\pi(\tau)}}{R_\pi(\tau)}\]

High variance -> higher likelihood of sampling both good and bad trajectories -> gradient signal for DPO is present; normalization by the average reward ensures cross-task comparability.

UCB Algorithm for Task Group Selection: Treats each task group as an arm and utilizes UCB to balance exploration and exploitation.

Experimental Results

Main Results: All-Task Training

Paprika improves the average success rate of Llama-3.1-8B-Instruct by 47% (relative gain) across all 10 task groups, using only approximately 22,500 trajectories.

Zero-Shot Transfer (Leave-One-Out)

Task Group Baseline Paprika (LOO) Paprika (All) Single-Task Training
Bandit 42.25% 62.25% 65.0% 58.0%
20 Questions ~25% ~38% ~40% ~35%
Murder Mystery ~15% ~28% ~32% ~30%
  • The LOO model outperforms the initial model on 9 out of 10 task groups.
  • On 7 out of 10 task groups, training on all tasks outperforms single-task training.
  • This indicates that cross-task policy transfer indeed occurs.

Curriculum Learning Effects

  • Curriculum learning improves the average success rate by 1.4% and pass@4 by 3.3% compared to uniform sampling.
  • The benefits are mainly observed in moderately difficult tasks.

Highlights & Insights

  • Demonstrates that LLMs can learn transferable, general exploration strategies from text-based decision-making tasks.
  • Eliminates the need for known optimal algorithms (such as UCB) to generate training data; the model's own diverse sampling suffices.
  • The coefficient of variation serves as an intuitive and effective metric for learning potential.
  • The task designs exhibit good diversity, spanning reasoning, search, and planning strategies.
  • The connection established with meta-RL builds a theoretical framework for LLM agent training.

Limitations & Future Work

  • Data generation remains the main bottleneck (requiring more resources than model updates).
  • Negative transfer is observed in Wordle, indicating that not all strategies are transferable across tasks.
  • Experiments are conducted only on 8B and 12B models; the effectiveness on larger or smaller models remains unexplored.
  • The environments partially rely on LLMs to generate feedback, which may introduce noise.
  • Comparison with online RL methods is missing (only offline DPO is utilized).

Rating

⭐⭐⭐⭐ — An exciting research direction, demonstrating that general decision-making capabilities can be acquired through synthetic data training and transferred zero-shot. However, the bottleneck lies in data generation efficiency.