Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning¶
Conference: ACL 2026 arXiv: 2601.13115 Code: None Area: Conversational Search / LLM Agent Keywords: conversational search, reinforcement learning, contextualized reasoning, mixed-initiative behavior, information gain reward
TL;DR¶
This paper proposes ConvAgent, which trains a conversational search agent to alternate between retrieval and reasoning across multi-turn interactions by decomposing the RL training reward into three complementary components: outcome reward, information gain reward, and mixed-initiative action reward.
Background & Motivation¶
Background: LLMs are becoming the primary interface for human-computer interaction; however, in multi-turn conversational search, user intent evolves across turns, requiring dynamic coordination between retrieval and generation.
Limitations of Prior Work: (1) Traditional methods adopt static "rewrite→retrieve→generate" pipelines with independently optimized modules, precluding joint optimization. (2) Emerging deep search agents (e.g., Search-R1) enable joint optimization of retrieval and generation but target only single-turn scenarios, lacking multi-turn conversational capabilities. (3) Existing methods neglect mixed-initiative behaviors, such as posing clarification questions at appropriate moments.
Key Challenge: Multi-turn conversational search simultaneously demands contextual understanding (decontextualization), search optimization (retrieval quality), and action decision-making (when to answer, clarify, or abstain), yet no existing method jointly optimizes all three dimensions.
Goal: Simultaneously optimize multiple aspects within a unified agent framework through contextualized reasoning.
Key Insight: The total reward is decomposed into three complementary components, and the agent is trained via the GRPO algorithm to alternately perform retrieval and reasoning across turns.
Core Idea: Intermediate process rewards (information gain + mixed-initiative behavior) compensate for the sparse supervision of outcome-only rewards, enabling the model to learn more strategic search and interaction behaviors.
Method¶
Overall Architecture¶
At each conversational turn, ConvAgent receives the dialogue history \(\mathcal{H}_n\) and the current query \(q_n\), generates search queries through reasoning, invokes a retriever, analyzes the results, decides on an action (answer / clarify / abstain), and produces a response. The entire trajectory is optimized via GRPO.
Key Designs¶
-
Information Gain Reward:
- Function: Optimizes search query quality and the utilization of retrieved results.
- Mechanism: Measures the information overlap between retrieved passages and the ground-truth answer. F1-score is used for long answers; substring-match accuracy is used for short answers: \(\mathcal{R}_{IG} = \mathcal{S}_{Info}(\{P_n\}_1^k, a_n^*)\)
- Design Motivation: Final-answer-only rewards are too sparse; intermediate retrieval quality signals help the model learn better query rewriting strategies.
-
Mixed-Initiative Action Reward:
- Function: Trains the model to adopt the appropriate action (answer / clarify / abstain) at the right moment.
- Mechanism: Action decision-making is framed as a classification task. Rewards or penalties are assigned by detecting whether the generated sequence contains the correct action label (e.g.,
<clarify>,<noanswer>): +1 for correct, −0.5 for incorrect. - Design Motivation: In conversational settings, not every turn requires a direct answer—sometimes user queries are ambiguous and warrant clarification; other times, insufficient evidence justifies abstention.
-
Utilization Mechanism for Clarification Outcomes:
- Function: Leverages model-generated clarification questions to improve downstream task performance.
- Mechanism: The clarification question \(q_n^c\) is concatenated with the rewritten query \(q_n'\) for retrieval and also substitutes the original query for final answer generation.
- Design Motivation: Clarification should not merely be evaluated on whether a question was asked, but also on whether asking it yields downstream benefit.
Loss & Training¶
The total reward is \(\mathcal{R}(\tau) = \mathcal{R}_{outcome} + 0.5 \times (\mathcal{R}_{IG} + \mathcal{R}_{MIA})\). Optimization is performed using the GRPO algorithm (Group Relative Policy Optimization), which requires neither an explicit reward model nor a value model. PPO is also explored as an alternative.
Key Experimental Results¶
Main Results¶
| Method | TopiOCQA F1 | INSCIT F1 | QReCC F1 | CORAL F1 |
|---|---|---|---|---|
| SFT-3b | 18.2 | 23.7 | 17.0 | 15.2 |
| Search-R1-3b | 26.1 | 5.8 | 5.9 | 3.9 |
| ConvAgent-3b | 25.2 | 23.5 | 24.1 | 22.4 |
| SFT-7b | 23.6 | 24.5 | 19.1 | 18.8 |
| Search-R1-7b | 37.0 | 9.1 | 8.6 | 3.8 |
| ConvAgent-7b | - | - | - | - |
Ablation Study¶
| Configuration | Key Metric | Note |
|---|---|---|
| Remove IG reward | F1 drops | Retrieval optimization signal is critical for retrieval quality |
| Remove MIA reward | Mixed-initiative behavior degrades | Action adaptation is important for conversational quality |
| PPO vs. GRPO | GRPO more stable | GRPO is simpler without requiring an additional reward model |
Key Findings¶
- Search-R1 exhibits unstable performance in conversational settings—strong on TopiOCQA but collapses on the other three datasets—demonstrating that single-turn agents do not generalize to multi-turn scenarios.
- ConvAgent achieves consistent performance across all four datasets, validating the importance of intermediate rewards.
- The information gain reward effectively improves query rewriting quality, even without ground-truth rewritten queries as supervision.
Highlights & Insights¶
- The reward decomposition strategy elegantly addresses the sparse reward problem in RL training without requiring manually annotated intermediate-step supervision.
- The design of the information gain reward is elegant—retrieval-answer overlap serves as a proxy signal for query quality.
- The incorporation of mixed-initiative behavior brings conversational agents closer to real user experiences by enabling them to know when to ask and when to answer.
Limitations & Future Work¶
- Validation is currently limited to 3B and 7B models; performance on larger models remains untested.
- Mixed-initiative behavior covers only three action types, whereas real-world dialogues involve a richer behavioral repertoire.
- The quality of user simulation may affect training outcomes.
- Future work could extend the framework to multimodal conversational search and more complex interaction patterns.
Related Work & Insights¶
- vs. Search-R1: Extends single-turn deep search to multi-turn dialogue, addressing multi-turn challenges through history-conditioned queries and intermediate rewards.
- vs. ChatR1: ChatR1 relies on ground-truth rewritten queries as training signals, whereas ConvAgent's IG reward eliminates this requirement.
- vs. Traditional Conversational Search: Unifies separate rewriting, retrieval, and generation modules into a single agent optimized end-to-end via RL.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of reward decomposition and mixed-initiative behavior constitutes a novel contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ Four datasets, multiple baselines, and ablation analysis.
- Writing Quality: ⭐⭐⭐⭐ Problem definition is clear; method description is systematic.
- Value: ⭐⭐⭐⭐ Offers practical guidance for the development of conversational AI assistants.