Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning¶
Conference: NeurIPS 2025 arXiv: 2501.15228 Code: GitHub Area: Reinforcement Learning / NLP Keywords: RAG, multi-agent reinforcement learning, MAPPO, joint optimization, question answering
TL;DR¶
This work models multiple components of a complex RAG pipeline (Query Rewriter, Selector, Generator) as a cooperative multi-agent system and jointly optimizes them via MAPPO, using the F1 score of the final answer as a shared reward. The proposed method outperforms existing single-module optimization approaches on multiple QA benchmarks.
Background & Motivation¶
Background: RAG systems augment LLMs with retrieved external knowledge. Modern RAG pipelines consist of multiple components—query rewriting, document retrieval, document selection, and answer generation.
Limitations of Prior Work: Individual components are typically optimized independently via SFT, leading to a misalignment between module-level objectives and the global objective of generating accurate answers. For instance, documents deemed highly relevant under nDCG-optimized retrieval may not contribute to correct answer generation.
Key Challenge: Existing end-to-end optimization methods (e.g., applying PPO or DPO to individual RAG components) either address only simple two-component pipelines or optimize modules in isolation, failing to adequately model the cooperative relationships among multiple components.
Goal: To jointly optimize the parameters of multiple components in a RAG system so that each module's optimization objective is aligned with final answer quality.
Key Insight: Model RAG as a cooperative multi-agent reinforcement learning (Co-MARL) problem and leverage MAPPO for joint multi-agent optimization.
Core Idea: Treat RAG as a cooperative game in which each component serves as an agent, all sharing a global reward based on final answer F1, and optimize all agents synchronously via MAPPO.
Method¶
Overall Architecture¶
MMOA-RAG models RAG as a multi-agent system \(\langle \mathcal{G}, \mathcal{O}, \mathcal{A}, \mathcal{R} \rangle\). The four-module pipeline proceeds as: Query Rewriter → Retriever (fixed, not trained) → Selector → Generator. The three trainable agents share a single LLM through parameter sharing, reducing training overhead.
Key Designs¶
-
Multi-Agent Modeling (Co-MARL):
- Function: Treats each RAG component as an RL agent sharing a global reward.
- Design Motivation: Independent module optimization leads to objective misalignment; the multi-agent framework naturally captures inter-component cooperation.
- Mechanism: Three agents are defined: the Query Rewriter (QR) receives question \(q\) and outputs sub-questions \(subq\); the Selector (S) receives \(q\) and candidate documents \(D\), outputting a selected subset of document IDs \(D_{\text{selected}}\); the Generator (G) receives \(q\) and \(D_{\text{selected}}\) and produces the final answer.
- Novelty: Unlike Rewrite-Retrieve-Read (which optimizes only the rewriter) or BGM (which optimizes only the bridge module), MMOA-RAG jointly optimizes all three components.
-
Observation / Action / Reward Design per Agent:
- Function: Defines precise MDP elements for each agent.
- Design Motivation: The distinct roles of different agents necessitate differentiated action spaces and penalty terms.
- Mechanism:
- QR's action space is the full vocabulary \(\mathcal{V}\); its reward is \(R_{QR} = R_{\text{shared}} + P_{QR}\), with a penalty of \(-0.5\) when the number of sub-questions exceeds 4.
- Selector's action space is restricted to \(\{\)"0", "1", ..., "K-1", "Document", ","\(\}\), substantially reducing the exploration space; format errors or duplicate IDs incur a penalty of \(-1\).
- Generator's action space is \(\mathcal{V}\), with a penalty of \(-0.5\) for excessively long outputs.
- The shared reward \(R_{\text{shared}}\) is the F1 score of the predicted answer.
- Novelty: The constrained action space for the Selector is an elegant design choice—converting free-text generation into structured ID selection, which significantly improves training stability.
-
MAPPO Joint Optimization:
- Function: Jointly updates all agents using the Multi-Agent PPO algorithm.
- Design Motivation: MAPPO employs a shared global reward in fully cooperative settings to foster inter-agent collaboration, making it more suitable than independent PPO.
- Mechanism: The actor loss adopts the standard PPO clipping objective, extended to multiple agents: $\(\mathcal{L}_{\text{Actor}}(\theta) = \sum_i \sum_t \min(r_t^i \hat{A}_{\pi_\theta}^{i,t},\ \text{clip}(r_t^i, 1-\epsilon, 1+\epsilon) \hat{A}_{\pi_\theta}^{i,t})\)$ The final reward includes a KL penalty to prevent deviation from the SFT baseline: $\(R(s_t^i, a_t^i) = R_i - \beta \log \frac{\pi_\theta(Answer_i | O_i)}{\pi_{\theta_{\text{SFT}}}(Answer_i | O_i)}\)$
- Novelty: All three agents share a single set of LLM parameters (via parameter sharing), resulting in training efficiency comparable to single-agent PPO.
Loss & Training¶
- Warm Start (SFT): Each agent is first fine-tuned via SFT to acquire basic instruction-following capability.
- MAPPO Joint Training: Starting from the SFT checkpoint, rollouts are collected by sequentially passing through QR → Retriever → S → G. The shared reward and per-agent penalties are computed, advantage estimates are obtained via GAE, and both the actor and critic are updated accordingly.
- Mini-batch parallelism is employed to accelerate training.
Key Experimental Results¶
Main Results (Contriever Retriever + Llama-3-8B-Instruct)¶
| Method | HotpotQA F1 | 2Wiki F1 | AmbigQA F1 |
|---|---|---|---|
| LLM w/o RAG | 31.18 | 29.47 | 33.42 |
| Vanilla RAG w/o train | 30.67 | 22.84 | 33.56 |
| Vanilla RAG w SFT | 44.49 | 43.36 | 44.36 |
| SELF-RAG | 38.93 | 38.86 | 39.04 |
| RetRobust | 46.49 | 44.51 | 44.78 |
| Rewrite-Retrieve-Read | 46.32 | 44.17 | 45.92 |
| BGM | 44.54 | 43.29 | 45.76 |
| RAG-DDR | 44.26 | 44.18 | 45.83 |
| MMOA-RAG | 48.29 | 46.40 | 48.59 |
| Δ vs. best baseline | +1.80 | +1.89 | +2.67 |
Ablation Study (Removing Individual Agents from Joint Optimization)¶
| Configuration | HotpotQA F1 | 2Wiki F1 | AmbigQA F1 |
|---|---|---|---|
| MMOA-RAG (QR+S+G) | 48.29 | 46.40 | 48.59 |
| MMOA-RAG w/o QR | 47.07 | 45.25 | 47.19 |
| MMOA-RAG w/o S | 47.94 | 46.19 | 47.53 |
| MMOA-RAG w/o G | worst (per figure) | worst | worst |
Generalization (SFT → MAPPO Gains across RAG Configurations)¶
| Configuration | HotpotQA F1 (SFT→MAPPO) | 2Wiki F1 (SFT→MAPPO) | AmbigQA F1 (SFT→MAPPO) |
|---|---|---|---|
| QR+S+G | 44.69→48.29 (+3.60) | 42.97→46.40 (+3.43) | 46.71→48.59 (+1.88) |
| S+G | 43.14→47.07 (+3.93) | 42.40→45.25 (+2.85) | 45.82→47.19 (+1.37) |
| QR+G | 45.00→47.94 (+2.94) | 42.91→46.19 (+3.28) | 45.31→47.53 (+2.22) |
Key Findings¶
- Joint optimization outperforms isolated optimization: The full three-agent configuration (QR+S+G) achieves the best performance across all datasets.
- Generator is the most critical agent: Removing G from joint optimization causes the largest performance drop, particularly on the single-hop AmbigQA benchmark.
- Selector can be partially substituted by the Generator: The performance drop when removing S is smallest, as the jointly trained Generator develops a degree of noise-filtering capability.
- Multi-hop datasets benefit more: MAPPO yields larger gains on HotpotQA/2Wiki (multi-hop, ~3.5 F1) than on AmbigQA (single-hop, ~1.9 F1), indicating that multi-module cooperation is more critical for complex reasoning.
- MAPPO consistently effective: Across all three configurations (QR+S+G, S+G, QR+G), MAPPO delivers consistent and significant improvements over SFT, demonstrating the generality of the framework.
Highlights & Insights¶
- Novel modeling perspective: This work is the first to model a RAG system as a cooperative multi-agent task, offering a new paradigm for end-to-end optimization of complex AI pipelines.
- Selector action space design: Constraining document selection from free-text generation to structured ID output is an engineering insight that substantially reduces the exploration space and training instability.
- Parameter sharing: All three agents share a single LLM (distinguished only by different prompts), bringing training compute overhead close to that of single-agent PPO.
- Penalty term design: Lightweight per-agent penalties (number of sub-questions, format compliance, answer length) constrain output quality without interfering with the primary optimization objective.
Limitations & Future Work¶
- The Retriever is frozen and excluded from optimization; joint optimization of the retrieval module could potentially yield further gains.
- Experiments are conducted solely on Llama-3-8B-Instruct; larger-scale models and closed-source LLMs remain untested.
- The shared reward is based exclusively on F1 score, without considering multi-objective optimization targets such as latency or cost.
- The training overhead of MAPPO is not quantitatively compared against other methods.
- More complex RAG workflows in DAG form or with iterative retrieval calls are not explored.
Related Work & Insights¶
- MAPPO (Yu et al., 2022): A multi-agent PPO algorithm validated on StarCraft II; this paper transfers it to NLP pipelines.
- Rewrite-Retrieve-Read / BGM: Pioneering works that apply PPO to optimize individual RAG modules; this paper extends the paradigm to joint multi-module optimization.
- Search-R1 / R1-Searcher: Concurrent works applying RL to RAG reasoning, but focused on single-agent settings.
- InstructGPT: The source of inspiration for the KL penalty term.
- Broader Implications: Co-MARL modeling offers a general approach to optimizing any multi-module AI system, with potential applicability to multi-agent coding, tool-use pipelines, and beyond.
Rating¶
- Novelty: ⭐⭐⭐⭐ Modeling RAG as Co-MARL is a fresh perspective; the Selector action space design is particularly elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets, multiple baselines, agent ablations, configuration generalization experiments, and validation across different retrievers.
- Writing Quality: ⭐⭐⭐⭐ Rigorous mathematical formulations, clear architecture diagrams, and precise definitions of agent MDP elements.
- Value: ⭐⭐⭐⭐ Provides a reproducible general framework for optimizing complex RAG systems, with open-source code.