Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning¶

Conference: ACL 2026 arXiv: 2601.13115 Code: None Area: Conversational Search / LLM Agent Keywords: conversational search, reinforcement learning, contextualized reasoning, mixed-initiative behavior, information gain reward

TL;DR¶

This paper proposes ConvAgent, which trains a conversational search agent to alternate between retrieval and reasoning across multi-turn interactions by decomposing the RL training reward into three complementary components: outcome reward, information gain reward, and mixed-initiative action reward.

Background & Motivation¶

Background: LLMs are becoming the primary interface for human-computer interaction; however, in multi-turn conversational search, user intent evolves across turns, requiring dynamic coordination between retrieval and generation.

Limitations of Prior Work: (1) Traditional methods adopt static "rewrite→retrieve→generate" pipelines with independently optimized modules, precluding joint optimization. (2) Emerging deep search agents (e.g., Search-R1) enable joint optimization of retrieval and generation but target only single-turn scenarios, lacking multi-turn conversational capabilities. (3) Existing methods neglect mixed-initiative behaviors, such as posing clarification questions at appropriate moments.

Key Challenge: Multi-turn conversational search simultaneously demands contextual understanding (decontextualization), search optimization (retrieval quality), and action decision-making (when to answer, clarify, or abstain), yet no existing method jointly optimizes all three dimensions.

Goal: Simultaneously optimize multiple aspects within a unified agent framework through contextualized reasoning.

Key Insight: The total reward is decomposed into three complementary components, and the agent is trained via the GRPO algorithm to alternately perform retrieval and reasoning across turns.

Core Idea: Intermediate process rewards (information gain + mixed-initiative behavior) compensate for the sparse supervision of outcome-only rewards, enabling the model to learn more strategic search and interaction behaviors.

Method¶

Overall Architecture¶

At each conversational turn, ConvAgent receives the dialogue history \(\mathcal{H}_n\) and the current query \(q_n\), generates search queries through reasoning, invokes a retriever, analyzes the results, decides on an action (answer / clarify / abstain), and produces a response. The entire trajectory is optimized via GRPO.

Key Designs¶

Information Gain Reward:
- Function: Optimizes search query quality and the utilization of retrieved results.
- Mechanism: Measures the information overlap between retrieved passages and the ground-truth answer. F1-score is used for long answers; substring-match accuracy is used for short answers: \(\mathcal{R}_{IG} = \mathcal{S}_{Info}(\{P_n\}_1^k, a_n^*)\)
- Design Motivation: Final-answer-only rewards are too sparse; intermediate retrieval quality signals help the model learn better query rewriting strategies.
Mixed-Initiative Action Reward:
- Function: Trains the model to adopt the appropriate action (answer / clarify / abstain) at the right moment.
- Mechanism: Action decision-making is framed as a classification task. Rewards or penalties are assigned by detecting whether the generated sequence contains the correct action label (e.g., <clarify>, <noanswer>): +1 for correct, −0.5 for incorrect.
- Design Motivation: In conversational settings, not every turn requires a direct answer—sometimes user queries are ambiguous and warrant clarification; other times, insufficient evidence justifies abstention.
Utilization Mechanism for Clarification Outcomes:
- Function: Leverages model-generated clarification questions to improve downstream task performance.
- Mechanism: The clarification question \(q_n^c\) is concatenated with the rewritten query \(q_n'\) for retrieval and also substitutes the original query for final answer generation.
- Design Motivation: Clarification should not merely be evaluated on whether a question was asked, but also on whether asking it yields downstream benefit.

Loss & Training¶

The total reward is \(\mathcal{R}(\tau) = \mathcal{R}_{outcome} + 0.5 \times (\mathcal{R}_{IG} + \mathcal{R}_{MIA})\). Optimization is performed using the GRPO algorithm (Group Relative Policy Optimization), which requires neither an explicit reward model nor a value model. PPO is also explored as an alternative.

Key Experimental Results¶

Main Results¶

Method	TopiOCQA F1	INSCIT F1	QReCC F1	CORAL F1
SFT-3b	18.2	23.7	17.0	15.2
Search-R1-3b	26.1	5.8	5.9	3.9
ConvAgent-3b	25.2	23.5	24.1	22.4
SFT-7b	23.6	24.5	19.1	18.8
Search-R1-7b	37.0	9.1	8.6	3.8
ConvAgent-7b	-	-	-	-

Ablation Study¶

Configuration	Key Metric	Note
Remove IG reward	F1 drops	Retrieval optimization signal is critical for retrieval quality
Remove MIA reward	Mixed-initiative behavior degrades	Action adaptation is important for conversational quality
PPO vs. GRPO	GRPO more stable	GRPO is simpler without requiring an additional reward model

Key Findings¶

Search-R1 exhibits unstable performance in conversational settings—strong on TopiOCQA but collapses on the other three datasets—demonstrating that single-turn agents do not generalize to multi-turn scenarios.
ConvAgent achieves consistent performance across all four datasets, validating the importance of intermediate rewards.
The information gain reward effectively improves query rewriting quality, even without ground-truth rewritten queries as supervision.

Highlights & Insights¶

The reward decomposition strategy elegantly addresses the sparse reward problem in RL training without requiring manually annotated intermediate-step supervision.
The design of the information gain reward is elegant—retrieval-answer overlap serves as a proxy signal for query quality.
The incorporation of mixed-initiative behavior brings conversational agents closer to real user experiences by enabling them to know when to ask and when to answer.

Limitations & Future Work¶

Validation is currently limited to 3B and 7B models; performance on larger models remains untested.
Mixed-initiative behavior covers only three action types, whereas real-world dialogues involve a richer behavioral repertoire.
The quality of user simulation may affect training outcomes.
Future work could extend the framework to multimodal conversational search and more complex interaction patterns.

vs. Search-R1: Extends single-turn deep search to multi-turn dialogue, addressing multi-turn challenges through history-conditioned queries and intermediate rewards.
vs. ChatR1: ChatR1 relies on ground-truth rewritten queries as training signals, whereas ConvAgent's IG reward eliminates this requirement.
vs. Traditional Conversational Search: Unifies separate rewriting, retrieval, and generation modules into a single agent optimized end-to-end via RL.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of reward decomposition and mixed-initiative behavior constitutes a novel contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ Four datasets, multiple baselines, and ablation analysis.
Writing Quality: ⭐⭐⭐⭐ Problem definition is clear; method description is systematic.
Value: ⭐⭐⭐⭐ Offers practical guidance for the development of conversational AI assistants.