OPERA: A Reinforcement Learning--Enhanced Orchestrated Planner-Executor Architecture for Reasoning-Oriented Multi-Hop Retrieval¶

Conference: AAAI2026 arXiv: 2508.16438 Code: Ameame1/OPERA Area: Information Retrieval Keywords: RAG, multi-hop retrieval, reinforcement-learning, GRPO, multi-agent

TL;DR¶

This paper proposes OPERA, a hierarchical framework comprising a Goal Planning Module and a Reason-Execute Module, combined with MAPGRPO—a training algorithm specifically designed for multi-agent settings—to substantially improve performance on reasoning-oriented multi-hop retrieval tasks.

Background & Motivation¶

State of the Field¶

Background: Existing RAG systems perform poorly on complex multi-hop questions, with the primary bottleneck being the weak coupling between retrieval and reasoning.

Limitations of Prior Work¶

Limitations of Prior Work: Static planning approaches (e.g., PlanRAG) cannot dynamically adapt to new information encountered during the retrieval process.

Root Cause¶

Key Challenge: Assigning both planning and execution responsibilities to a single LLM limits overall reliability.

Solution Direction¶

Solution Direction: Existing RL-based methods (e.g., BGM) only optimize the gap between the retriever and the LLM, without achieving fine-grained credit assignment at the agent level.

Paper Goals¶

Goal: How can retrieval and reasoning be deeply coupled within a RAG framework, enabling effective plan decomposition, adaptive retrieval, and precise filtering for complex multi-hop questions?

Method¶

Overall Architecture¶

OPERA is decoupled into two layers: 1. Goal Planning Module (GPM): A Plan Agent decomposes complex questions into sub-goals \(\mathcal{P}=\{p_1,\dots,p_m\}\), with dependencies among sub-goals established via placeholders. 2. Reason-Execute Module (REM): An Analysis-Answer Agent performs information sufficiency judgment \(\phi \in \{0,1\}\) and answer extraction; a Rewrite Agent rewrites queries when information is insufficient to improve subsequent retrieval. 3. Trajectory Memory Component (TMC): Records all operational trajectories to enhance interpretability.

Key Designs: MAPGRPO¶

GRPO is extended into Multi-Agents Progressive Group Relative Policy Optimization: - Three agents are trained sequentially, each using a heterogeneous reward function. - Plan Agent reward: \(r_{\text{plan}} = \lambda_1 f_{\text{logic}} + \lambda_2 f_{\text{struct}} + \lambda_3 f_{\text{exec}}\) - Analysis-Answer Agent reward: \(r_{\text{ana}} = \alpha \cdot \mathbb{I}[\phi=\phi^*] + \beta \cdot \text{EM}(a_i,a_i^*) + \gamma \cdot f_{\text{format}}\) - Rewrite Agent reward: \(r_{\text{rew}} = \omega_1 \sqrt{\text{NDCG@}k} + \omega_2 f_{\text{format}}\) - High-Score Sample Selection: A pre-scored high-quality sample \(c_{\text{best}}\) is injected into each group to mitigate reward sparsity and reduce policy gradient variance.

Key Experimental Results¶

Main Results¶

Method	HotpotQA EM	2WikiMHQA EM	Musique EM
Adaptive-RAG (SFT)	45.7%	30.1%	24.3%
BGM (RL)	41.5%	44.3%	19.6%
OPERA (MAPGRPO)	57.3%	60.2%	39.7%

Relative improvement over the best baseline on Musique: 63.4%.
Ablation: removing the Plan Agent causes EM to drop from 39.7% to 17.1%, falling below the training-free CoT baseline of 21.2%.
Out-of-domain: on NQ (single-hop), MAPGRPO achieves 36.6% EM, whereas SFT degrades to 19.5%.

Highlights & Insights¶

Architecture design outweighs training methodology: the collapse in performance upon removing the Plan Agent confirms that the hierarchical architecture is the core contribution.
The expert injection strategy in MAPGRPO significantly reduces policy gradient variance.
The framework generalizes to out-of-domain tasks (single-hop QA), demonstrating that RL training does not overfit to fixed reasoning patterns.

Limitations & Future Work¶

Training of the Rewrite Agent is unstable due to reward sparsity caused by conditional activation.
Theoretical guarantees cover only local convergence; Musique EM remains below 40%.
Inference latency is relatively high, with notable fluctuations in Analysis-Answer Agent latency.
Effectiveness on larger LLMs (>7B parameters) has not yet been validated.
The framework still struggles with ambiguous decompositions and long reasoning chains.

Dimension	OPERA	Adaptive-RAG	ReAct	BGM
Planning	Dynamic sub-goals	Complexity-based routing	No explicit plan	None
Retrieval	Rewrite Agent adaptive	Fixed strategy	Reasoning-action loop	RL bridging
Training	MAPGRPO (role-specific)	SFT	No training	GRPO
Interpretability	TMC trajectory logging	None	Partial	None

A multi-agent hierarchical architecture with role-specific rewards represents a promising design paradigm for RAG systems.
The sequential training and high-score sample injection strategy of MAPGRPO can be generalized to other multi-agent RL scenarios.
The TMC design philosophy can be leveraged to improve the auditability and debuggability of AI systems.
The finding that "architectural contributions outweigh training contributions" merits consideration in the design of other complex systems.
The conditional activation mechanism of the Rewrite Agent embodies the principle of on-demand allocation of computational resources.

Rating¶

Novelty: ⭐⭐⭐⭐ — The multi-agent architecture and MAPGRPO training method are original contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluated on three mainstream benchmarks with ablation and OOD experiments.
Writing Quality: ⭐⭐⭐⭐ — Well-structured with reasonably complete theoretical analysis.
Value: ⭐⭐⭐⭐ — Provides practical guidance for the design of complex RAG systems.