Skip to content

OPERA: A Reinforcement Learning--Enhanced Orchestrated Planner-Executor Architecture for Reasoning-Oriented Multi-Hop Retrieval

Conference: AAAI2026 arXiv: 2508.16438 Code: Ameame1/OPERA Area: Information Retrieval Keywords: RAG, multi-hop retrieval, reinforcement-learning, GRPO, multi-agent

TL;DR

This paper proposes OPERA, a hierarchical framework comprising a Goal Planning Module and a Reason-Execute Module, combined with MAPGRPO—a training algorithm specifically designed for multi-agent settings—to substantially improve performance on reasoning-oriented multi-hop retrieval tasks.

Background & Motivation

State of the Field

Background: Existing RAG systems perform poorly on complex multi-hop questions, with the primary bottleneck being the weak coupling between retrieval and reasoning.

Limitations of Prior Work

Limitations of Prior Work: Static planning approaches (e.g., PlanRAG) cannot dynamically adapt to new information encountered during the retrieval process.

Root Cause

Key Challenge: Assigning both planning and execution responsibilities to a single LLM limits overall reliability.

Solution Direction

Solution Direction: Existing RL-based methods (e.g., BGM) only optimize the gap between the retriever and the LLM, without achieving fine-grained credit assignment at the agent level.

Paper Goals

Goal: How can retrieval and reasoning be deeply coupled within a RAG framework, enabling effective plan decomposition, adaptive retrieval, and precise filtering for complex multi-hop questions?

Method

Overall Architecture

OPERA is decoupled into two layers: 1. Goal Planning Module (GPM): A Plan Agent decomposes complex questions into sub-goals \(\mathcal{P}=\{p_1,\dots,p_m\}\), with dependencies among sub-goals established via placeholders. 2. Reason-Execute Module (REM): An Analysis-Answer Agent performs information sufficiency judgment \(\phi \in \{0,1\}\) and answer extraction; a Rewrite Agent rewrites queries when information is insufficient to improve subsequent retrieval. 3. Trajectory Memory Component (TMC): Records all operational trajectories to enhance interpretability.

Key Designs: MAPGRPO

GRPO is extended into Multi-Agents Progressive Group Relative Policy Optimization: - Three agents are trained sequentially, each using a heterogeneous reward function. - Plan Agent reward: \(r_{\text{plan}} = \lambda_1 f_{\text{logic}} + \lambda_2 f_{\text{struct}} + \lambda_3 f_{\text{exec}}\) - Analysis-Answer Agent reward: \(r_{\text{ana}} = \alpha \cdot \mathbb{I}[\phi=\phi^*] + \beta \cdot \text{EM}(a_i,a_i^*) + \gamma \cdot f_{\text{format}}\) - Rewrite Agent reward: \(r_{\text{rew}} = \omega_1 \sqrt{\text{NDCG@}k} + \omega_2 f_{\text{format}}\) - High-Score Sample Selection: A pre-scored high-quality sample \(c_{\text{best}}\) is injected into each group to mitigate reward sparsity and reduce policy gradient variance.

Key Experimental Results

Main Results

Method HotpotQA EM 2WikiMHQA EM Musique EM
Adaptive-RAG (SFT) 45.7% 30.1% 24.3%
BGM (RL) 41.5% 44.3% 19.6%
OPERA (MAPGRPO) 57.3% 60.2% 39.7%
  • Relative improvement over the best baseline on Musique: 63.4%.
  • Ablation: removing the Plan Agent causes EM to drop from 39.7% to 17.1%, falling below the training-free CoT baseline of 21.2%.
  • Out-of-domain: on NQ (single-hop), MAPGRPO achieves 36.6% EM, whereas SFT degrades to 19.5%.

Highlights & Insights

  • Architecture design outweighs training methodology: the collapse in performance upon removing the Plan Agent confirms that the hierarchical architecture is the core contribution.
  • The expert injection strategy in MAPGRPO significantly reduces policy gradient variance.
  • The framework generalizes to out-of-domain tasks (single-hop QA), demonstrating that RL training does not overfit to fixed reasoning patterns.

Limitations & Future Work

  • Training of the Rewrite Agent is unstable due to reward sparsity caused by conditional activation.
  • Theoretical guarantees cover only local convergence; Musique EM remains below 40%.
  • Inference latency is relatively high, with notable fluctuations in Analysis-Answer Agent latency.
  • Effectiveness on larger LLMs (>7B parameters) has not yet been validated.
  • The framework still struggles with ambiguous decompositions and long reasoning chains.
Dimension OPERA Adaptive-RAG ReAct BGM
Planning Dynamic sub-goals Complexity-based routing No explicit plan None
Retrieval Rewrite Agent adaptive Fixed strategy Reasoning-action loop RL bridging
Training MAPGRPO (role-specific) SFT No training GRPO
Interpretability TMC trajectory logging None Partial None
  • A multi-agent hierarchical architecture with role-specific rewards represents a promising design paradigm for RAG systems.
  • The sequential training and high-score sample injection strategy of MAPGRPO can be generalized to other multi-agent RL scenarios.
  • The TMC design philosophy can be leveraged to improve the auditability and debuggability of AI systems.
  • The finding that "architectural contributions outweigh training contributions" merits consideration in the design of other complex systems.
  • The conditional activation mechanism of the Rewrite Agent embodies the principle of on-demand allocation of computational resources.

Rating

  • Novelty: ⭐⭐⭐⭐ — The multi-agent architecture and MAPGRPO training method are original contributions.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluated on three mainstream benchmarks with ablation and OOD experiments.
  • Writing Quality: ⭐⭐⭐⭐ — Well-structured with reasonably complete theoretical analysis.
  • Value: ⭐⭐⭐⭐ — Provides practical guidance for the design of complex RAG systems.