HARPO: Hierarchical Agentic Reasoning for User-Aligned Conversational Recommendation¶
Conference: ACL 2026 arXiv: 2604.10048 Code: https://anonymous.4open.science/r/HARPO-D881 Area: Recommender Systems Keywords: Conversational Recommendation, Agentic Reasoning, Preference Optimization, Tree Search, Recommendation Quality
TL;DR¶
This paper proposes HARPO, a framework that reformulates conversational recommendation as a structured decision-making problem explicitly optimizing for recommendation quality. HARPO integrates four components—hierarchical preference learning, value-network-guided tree search reasoning, virtual tool operation abstraction, and multi-agent refinement—achieving significant improvements over existing methods on three benchmarks: ReDial, INSPIRED, and MUSE.
Background & Motivation¶
State of the Field: Conversational Recommender Systems (CRS) aim to help users discover items matching their preferences through natural language interaction. Recent LLM-based CRS approaches have achieved strong performance on proxy metrics such as Recall@K and BLEU.
Limitations of Prior Work: High proxy metric scores do not necessarily reflect high-quality, user-aligned recommendations. Existing methods primarily optimize intermediate objectives—retrieval accuracy, generation fluency, or tool invocation—rather than recommendation quality itself. For example, the query "something casual for a summer wedding" may be misinterpreted as everyday casual wear rather than occasion-appropriate attire; such responses score well on automatic metrics yet yield low user satisfaction.
Root Cause: There is a fundamental misalignment between CRS training and evaluation objectives (proxy metrics) and actual recommendation quality. Proxy metrics correlate only weakly with user-aligned recommendation quality.
Paper Goals: To model conversational recommendation as a structured decision-making problem that explicitly optimizes recommendation quality, rather than treating quality as a byproduct of response generation.
Starting Point: From a decision-making and reasoning perspective, the authors argue that a system should explicitly reason over multiple candidate recommendation strategies, evaluate their expected quality, and select recommendations based on user-alignment criteria rather than proxy signals.
Core Idea: Decompose recommendation quality into interpretable dimensions (relevance, diversity, satisfaction, and engagement) via hierarchical preference learning, and employ a learned value network to guide tree search reasoning over candidate recommendation paths.
Method¶
Overall Architecture¶
HARPO comprises four components: STAR (Structured Tree-search Agentic Reasoning), CHARM (Contrastive Hierarchical Agentic Reward Modeling), BRIDGE (cross-domain transfer), and MAVEN (Multi-Agent refinement), all sharing a pretrained language model backbone. The input is the dialogue context; the output is a recommendation-bearing response; and the optimization target is recommendation quality \(\mathcal{Q}\) rather than proxy metrics.
Key Designs¶
-
STAR: Structured Tree-Search Agentic Reasoning
- Function: Explicitly explores multiple candidate recommendation strategies via tree search and selects the highest-quality path.
- Mechanism: Each reasoning node is represented as \(s=(\mathbf{h}, \tau, \mathbf{v}, d)\), encoding dialogue context, current thought, predicted virtual tool operations, and search depth. A value network decomposes quality into four dimensions (relevance, diversity, satisfaction, engagement), each predicted by a dedicated head and combined via learnable weights. At each node, \(b\) candidate next steps are generated, and beam search selects the optimal path.
- Design Motivation: Unlike direct recommendation generation, tree search allows the system to explore, compare, and refine recommendation decisions before producing the final response.
-
CHARM: Contrastive Hierarchical Agentic Reward Modeling
- Function: Decomposes recommendation quality into interpretable dimensions and learns context-dependent dimension weights.
- Mechanism: Each quality dimension is modeled by a dedicated reward head \(R_k(\mathbf{h}) = \tanh(\mathbf{W}_k^{(2)} \cdot \text{GELU}(\mathbf{W}_k^{(1)} \cdot \mathbf{h}))\), with outputs constrained to \([-1,1]\). Context-dependent weights are learned via a meta-learning approach: \(\mathbf{w} = \text{softmax}(\mathbf{W}_{\text{meta}} \cdot [\text{Enc}(\mathbf{h}); \mathbf{e}_d] + \mathbf{b})\). Training uses a margin-based preference optimization loss.
- Design Motivation: The relative importance of quality dimensions varies across dialogue contexts and domains; adaptive weighting avoids the information loss inherent in a single scalar reward.
-
BRIDGE: Cross-Domain Transfer and VTO Abstraction
- Function: Enables cross-domain transfer of recommendation reasoning via Virtual Tool Operations (VTO) and adversarial domain adaptation.
- Mechanism: A gradient reversal layer is used for adversarial domain adaptation to learn domain-invariant representations, while a learnable domain gate \(\mathbf{z}' = \sigma(\mathbf{g}_d) \odot \mathbf{z} + (1-\sigma(\mathbf{g}_d)) \odot \mathbf{h}\) preserves useful domain-specific signals. VTO decouples high-level reasoning operations from concrete tool implementations, enabling dynamic mapping at runtime.
- Design Motivation: Existing tool-augmented methods are tightly coupled to domain-specific tool implementations, limiting transferability.
Loss & Training¶
The total loss comprises a preference optimization loss \(\mathcal{L}_{\text{pref}}\), a domain adaptation loss \(\mathcal{L}_{\text{domain}}\), a task preservation loss \(\mathcal{L}_{\text{task}}\), and an agent consistency loss \(\mathcal{L}_{\text{agree}}\).
Key Experimental Results¶
Main Results¶
Recommendation performance on the ReDial dataset:
| Method | R@1 | R@10 | R@50 | MRR@10 | User Sat. | Engage. |
|---|---|---|---|---|---|---|
| KBRD | 2.9 | 16.7 | 36.2 | 7.4 | 0.42 | 0.38 |
| UniCRS | 3.8 | 18.1 | 37.4 | 8.4 | 0.45 | 0.41 |
| GPT-4 | — | — | — | — | — | — |
| HARPO | Best | Best | Best | Best | Best | Best |
Ablation Study¶
| Configuration | Key Metric | Note |
|---|---|---|
| Full HARPO | Best | Complete model |
| w/o STAR | Significant drop | Tree search reasoning removed |
| w/o CHARM | Noticeable drop | Hierarchical preference optimization removed |
| w/o BRIDGE | Cross-domain drop | Domain transfer module removed |
| w/o MAVEN | Slight drop | Multi-agent refinement removed |
Key Findings¶
- HARPO achieves an average improvement of 17–21% over the strongest baseline (GPT-4), with larger gains on user-alignment metrics.
- The largest gains are observed on the INSPIRED dataset (R@10 45.7% above GPT-4), where social dialogue requires reasoning about implicit preferences.
- Human evaluation confirms significant superiority over GPT-4 on recommendation quality, explanation quality, and overall rating (+0.55, +0.50, +0.55).
- The CHARM reward model achieves Pearson correlations of 0.64–0.73 with independent human judgments.
Highlights & Insights¶
- Identifying the misalignment between proxy metrics and recommendation quality as a fundamental problem in CRS represents a paradigm-shifting insight for the field.
- The VTO abstraction decouples reasoning logic from concrete tool implementations—analogous to interface design in software engineering—offering an elegant solution for transferable reasoning.
- Multi-dimensional quality decomposition combined with context-adaptive weighting avoids the information compression inherent in a single reward signal.
Limitations & Future Work¶
- The CHARM reward model may itself carry biases; although it correlates with human judgments, it is not a perfect substitute.
- Tree search reasoning introduces additional inference-time computational overhead, which must be considered for latency-sensitive deployment.
- The experimental datasets are limited in scale (ReDial: ~10,000 dialogues), leaving large-scale validation insufficient.
- Future work could explore extending quality dimensions to finer-grained user preference modeling.
Related Work & Insights¶
- vs. UniCRS: UniCRS unifies recommendation and generation but still optimizes proxy metrics, whereas HARPO directly optimizes recommendation quality.
- vs. RecMind: RecMind employs self-motivated reasoning but lacks quality-guided search; HARPO's value network provides explicit quality-oriented guidance.
- vs. GPT-4: Even a powerful general-purpose model performs well on proxy metrics but lags behind the specialized optimization of HARPO on user-alignment metrics.
Rating¶
- Novelty: ⭐⭐⭐⭐ Reformulating conversational recommendation as a quality-optimized decision problem is a novel contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets, human evaluation, and ablation studies are all included.
- Writing Quality: ⭐⭐⭐⭐ Problem analysis is thorough and the framework design logic is clearly articulated.
- Value: ⭐⭐⭐⭐ Offers important methodological insights for the CRS community.