R²ec: Towards Large Recommender Models with Reasoning¶

Conference: NeurIPS 2025 arXiv: 2505.16994 Code: GitHub Area: Recommender Systems Keywords: Recommender Systems, LLM Reasoning, Reinforcement Learning, Dual-Head Architecture, Test-Time Scaling

TL;DR¶

This paper proposes R²ec, the first unified large recommender model that endogenously integrates reasoning capabilities, achieving joint reasoning chain generation and efficient item prediction via a dual-head architecture, and introduces the RecPO reinforcement learning framework to jointly optimize reasoning and recommendation objectives without any annotated reasoning data.

Background & Motivation¶

Applications of large language models (LLMs) in recommender systems have converged on two dominant paradigms: encoding users and items via LLM-based embeddings, or reformulating item prediction as autoregressive generation of item IDs. These large recommender models have demonstrated strong generalization in cold-start, cross-domain, and long-tail scenarios.

The potential value of reasoning for recommendation: Models such as DeepSeek-R1 have demonstrated that test-time scaling—allowing the model more "thinking" time at inference—can substantially improve LLM performance on tasks such as mathematics and coding. Since large recommender models are themselves built upon pretrained LLMs, a natural question arises: how can recommender models also benefit from reasoning?

Critical limitations of existing approaches:

Excessive resource overhead: Maintaining a large reasoning model alongside a separate recommender model leads to compounded memory consumption and inference latency.

Difficulty of joint optimization: Reasoning and recommendation modules can only be trained in alternation with the other frozen; gradients cannot flow across modules, preventing end-to-end alignment.

Technical challenges addressed in this work:

Model design: Most large recommender models rely on autoregressive decoding of item IDs, which is inherently slow; incorporating reasoning would further degrade latency. How can reasoning be integrated while preserving acceptable inference speed?

Training optimization: The recommendation domain lacks annotated reasoning data (unlike mathematics, which has step-by-step solutions), and the subjectivity of reasoning chains makes supervised learning infeasible at scale. Reinforcement learning (RL) is a natural alternative, yet reward design and objective coupling in recommendation settings present unique challenges.

Method¶

Overall Architecture¶

R²ec rests on two pillars: - Dual-head architecture: A single LLM backbone equipped with a language modeling head (for reasoning generation) and a recommendation head (for item prediction), autoregressively generating a reasoning chain before performing single-step item prediction. - RecPO training framework: Annotation-free RL-based training that jointly optimizes reasoning and recommendation via a fused reward.

Key Designs¶

Unified Dual-Head Architecture:
- Language modeling head (lm_head): A standard token embedding table \(\mathbf{H}_\mathcal{T} \in \mathbb{R}^{|\mathcal{T}| \times d}\), responsible for autoregressive generation of reasoning tokens.
- Recommendation head (rec_head): An item embedding table \(\mathbf{H}_\mathcal{V} \in \mathbb{R}^{|\mathcal{V}| \times d}\), where each item embedding is obtained by encoding its textual description through the model itself. Item scores are computed via inner product: \(s(v) = \mathbf{h}_T^\top \mathbf{H}_\mathcal{V}[v]\).

Tight reasoning–recommendation coupling: Both heads share the same hidden state space; the reasoning process directly reshapes the final hidden state \(\mathbf{h}_T\), thereby influencing recommendation scores. This ensures that optimizing the reasoning process directly contributes to improving recommendation performance.

Efficiency advantage: Replacing autoregressive item-ID decoding with next-item prediction (single-step inner-product matching) substantially reduces inference latency. The item table supports flexible addition and removal of items, enabling zero-shot generalization.

RecPO Training Framework:
- Trajectory sampling: Each trajectory covers the complete "reasoning → recommendation" process. For each user input \(x_u\), \(G\) distinct reasoning paths are sampled from the old policy \(\pi_{\theta_{old}}\), each followed by a single-step item recommendation.
- Fused reward: Using ranking metrics (e.g., NDCG) alone as the reward is insufficient—many trajectories of varying quality may yield identical top-\(K\) rankings. A fused reward is therefore designed as:
\(R = \beta R_c + (1 - \beta) R_d\)

where \(R_d = \text{NDCG}@k(\text{rank}(v^+))\) is the discrete ranking reward, and \(R_c = \frac{\exp(\mathbf{h}_T^\top \mathbf{h}_{v^+}/\tau)}{\sum_{v \in \mathcal{V}} \exp(\mathbf{h}_T^\top \mathbf{h}_v/\tau)}\) is the continuous similarity reward. Setting \(\beta \approx 0.05\) keeps the ranking term dominant while allowing the continuous term to provide discriminative signal among trajectories with identical rankings.

- **Joint training objective**: Token-level reasoning decisions and item-level recommendation decisions are unified within a single RL objective:

$\mathcal{J}(\theta) = \frac{1}{G}\sum_{i=1}^{G}\left[\sum_{t=1}^{T_i}\ell_\epsilon(r_{i,t}(\theta), A_i) + \delta_{i,i^\star}\ell_\epsilon(r_{i,T+1}(\theta), A_i)\right]$

A key design choice: all trajectories contribute to policy updates for reasoning tokens, but only the trajectory with the highest advantage (\(i^\star = \arg\max_j A_j\)) contributes gradients for the recommendation action. This preserves diversity in reasoning exploration while focusing recommendation learning on the most promising reasoning paths.

Loss & Training¶

Base models: Gemma2-2B-Instruct and Qwen2.5-3B-Instruct
Advantage estimation: GRPO outperforms RLOO (faster initial learning, gradual growth in reasoning length, analogous to phenomena observed in LLM reasoning training)
Sampling: top-\(K\) sampling with temperature to control stochasticity

Key Experimental Results¶

Main Results: Recommendation Performance on Three Amazon Datasets¶

Method	Instruments H@5	CDs H@5	Games H@5	Instruments N@20	CDs N@20
SASRec	0.0175	0.0076	0.0129	0.0210	0.0141
TIGER	0.0171	0.0067	0.0123	0.0134	0.0069
LangPTune	0.0127	0.0074	0.0049	0.0145	0.0094
D³ (Gemma)	0.0072	0.0216	0.0117	0.0114	0.0194
R²ec (Qwen)	0.0237	0.0513	0.0288	0.0259	0.0457
R²ec (Gemma)	0.0264	0.0573	0.0326	0.0257	0.0527

R²ec improvements over the best baseline: H@5 on CDs +63.7%, N@10 on CDs +72.3%; H@10 on Instruments +67.0%.

Ablation Study¶

Configuration	Instruments H@5	CDs H@5	Games H@5	Note
w/ ClsHead (classification head)	0.0044	0.0030	0.0012	Decoupled reasoning–recommendation, severely degraded
w/o Reasoning	0.0176	0.0469	0.0277	No reasoning, contrastive learning only
w/o \(R_d\) (continuous reward only)	0.0198	0.0521	0.0302	Insufficient discrimination
w/o \(R_c\) (ranking reward only)	0.0244	0.0543	0.0316	Slightly below fused
R²ec	0.0264	0.0588	0.0326	Fused reward optimal

Key Findings¶

Reasoning substantially improves recommendation: Introducing reasoning yields an average improvement of approximately 15%, validating the effectiveness of test-time scaling in recommendation settings.
Tight reasoning–recommendation coupling is critical: The classification head variant (w/ ClsHead) suffers a dramatic performance collapse, indicating that reasoning and recommendation must share the same hidden state space to mutually benefit each other.
The fused reward design is effective: The discrete ranking reward \(R_d\) is the primary signal; the continuous similarity reward \(R_c\) provides supplementary fine-grained information. Using \(R_c\) alone introduces noise.
GRPO outperforms RLOO: GRPO's unit-variance normalization amplifies reward gradients in the recommendation setting, accelerating early-stage learning, and reasoning length grows progressively during training—a phenomenon analogous to that observed in DeepSeek-R1.
Smaller models can excel: Gemma2-2B outperforms Qwen-3B on most tasks, suggesting that model selection matters more than parameter count.

Highlights & Insights¶

The dual-head architecture is particularly elegant: the shared backbone allows reasoning gradients to flow naturally into recommendation parameters, avoiding the gradient disconnection of two-stage approaches; using inner-product matching over an item embedding table is far more efficient than autoregressive item-ID generation.
The work successfully transfers insights from LLM reasoning training (e.g., GRPO) to the recommendation domain, bridging two rapidly advancing research directions.
The fused reward design reflects a deep understanding of challenges specific to recommendation—the discreteness of ranking metrics necessitates continuous signals as a complement.

Limitations & Future Work¶

Analysis of reasoning chain interpretability is primarily qualitative; systematic quantitative evaluation is lacking.
Validation is currently limited to three Amazon datasets; generalizability to larger-scale and more diverse settings remains to be confirmed.
The item embedding table must be constructed in advance, which may increase maintenance costs for rapidly changing item catalogs.
Automatic control of reasoning length and further efficiency optimization are important directions for future work.

The fundamental distinction from reasoning-augmented recommendation methods such as LangPTune is that R²ec is a genuinely end-to-end unified model, completing reasoning and recommendation within a single forward pass.
The RecPO framework provides a reusable paradigm for RL training in recommender systems, particularly the fused reward design and the selective gradient backpropagation strategy.
The work offers broader inspiration for "LLM + vertical domain" applications: how to design unified architectures that allow reasoning capabilities to naturally serve domain-specific tasks.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First unified reasoning–recommendation model; both the dual-head architecture and RecPO are original contributions
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets, multiple baselines, and nine analyses make for exceptionally comprehensive evaluation
Writing Quality: ⭐⭐⭐⭐ Clear structure with well-motivated arguments
Value: ⭐⭐⭐⭐⭐ Opens a new paradigm for reasoning-enhanced recommendation with substantial empirical gains