Training a Generally Curious Agent (Paprika)¶
Conference: ICML 2025
arXiv: 2502.17543
Code: -
Area: Model Compression
Keywords: in-context RL, curious agent, curriculum learning, sequential decision making, Paprika
TL;DR¶
Proposes the Paprika framework, which fine-tunes LLMs on diverse text-based decision-making tasks, enabling the model to learn general information-gathering and decision-making capabilities and transfer zero-shot to completely unseen tasks.
Background & Motivation¶
- LLMs acting as autonomous agents must interact with environments and collect information to achieve goals.
- Direct deployment in the real world to collect data is highly costly and risky.
- Synthetic data generation cannot cover all tasks, but the in-context learning capability of LLMs supports learning generalized policies from a small set of tasks.
- Mechanism: Instead of training the model to solve all tasks, train the model to learn the general process of performing tasks (i.e., in-context RL).
- This is analogous to the SFT/RLHF phases where only a small number of examples are needed to yield models capable of generalizing to diverse queries.
Method¶
1. Task Design (10 Task Groups)¶
All tasks are text-based, multi-turn, and partially observable:
| Task Group | Number of Training Tasks | Max Turns | Environment Feedback |
|---|---|---|---|
| 20 Questions | 1499 | 20 | LLM-generated |
| Guess My City | 500 | 20 | LLM-generated |
| Wordle | 1515 | 6 | Hardcoded |
| Cellular Automata | 1000 | 6 | Hardcoded |
| Customer Service | 628 | 20 | LLM-generated |
| Murder Mystery | 203 | 20 | LLM-generated |
| Mastermind | 1000 | 12 | Hardcoded |
| Battleship | 1000 | 20 | Hardcoded |
| Minesweeper | 1000 | 20 | Hardcoded |
| Bandit Best Arm | 81 | 21 | Hardcoded |
2. Dataset Construction¶
- Diverse trajectories are generated using Min-p sampling (temperature 1.5, p=0.3).
- Each task generates \(n_\text{sample}=20\) trajectories.
- Preference pairs \((h_w, h_l)\) are constructed: \(h_w\) is the highest-scoring trajectory, while \(h_l\) is randomly sampled from low-scoring trajectories.
3. Optimization Objectives¶
SFT: Maximizing likelihood on winning trajectories:
Multi-turn DPO:
Loss is computed only on the agent's action tokens (excluding environment-generated tokens).
RPO: Combines SFT + DPO to mitigate the "undesired disalignment" of DPO:
4. Curriculum Learning: Scalable Task Selection¶
Learning Potential Metric: Measures the learning signal strength of tasks using the coefficient of variation:
High variance -> higher likelihood of sampling both good and bad trajectories -> gradient signal for DPO is present; normalization by the average reward ensures cross-task comparability.
UCB Algorithm for Task Group Selection: Treats each task group as an arm and utilizes UCB to balance exploration and exploitation.
Experimental Results¶
Main Results: All-Task Training¶
Paprika improves the average success rate of Llama-3.1-8B-Instruct by 47% (relative gain) across all 10 task groups, using only approximately 22,500 trajectories.
Zero-Shot Transfer (Leave-One-Out)¶
| Task Group | Baseline | Paprika (LOO) | Paprika (All) | Single-Task Training |
|---|---|---|---|---|
| Bandit | 42.25% | 62.25% | 65.0% | 58.0% |
| 20 Questions | ~25% | ~38% | ~40% | ~35% |
| Murder Mystery | ~15% | ~28% | ~32% | ~30% |
- The LOO model outperforms the initial model on 9 out of 10 task groups.
- On 7 out of 10 task groups, training on all tasks outperforms single-task training.
- This indicates that cross-task policy transfer indeed occurs.
Curriculum Learning Effects¶
- Curriculum learning improves the average success rate by 1.4% and pass@4 by 3.3% compared to uniform sampling.
- The benefits are mainly observed in moderately difficult tasks.
Highlights & Insights¶
- Demonstrates that LLMs can learn transferable, general exploration strategies from text-based decision-making tasks.
- Eliminates the need for known optimal algorithms (such as UCB) to generate training data; the model's own diverse sampling suffices.
- The coefficient of variation serves as an intuitive and effective metric for learning potential.
- The task designs exhibit good diversity, spanning reasoning, search, and planning strategies.
- The connection established with meta-RL builds a theoretical framework for LLM agent training.
Limitations & Future Work¶
- Data generation remains the main bottleneck (requiring more resources than model updates).
- Negative transfer is observed in Wordle, indicating that not all strategies are transferable across tasks.
- Experiments are conducted only on 8B and 12B models; the effectiveness on larger or smaller models remains unexplored.
- The environments partially rely on LLMs to generate feedback, which may introduce noise.
- Comparison with online RL methods is missing (only offline DPO is utilized).
Rating¶
⭐⭐⭐⭐ — An exciting research direction, demonstrating that general decision-making capabilities can be acquired through synthetic data training and transferred zero-shot. However, the bottleneck lies in data generation efficiency.