ACL 2025 Recommender Systems Recommendation Systems Instruction Tuning Collaborative Filtering Cold-Start Reinforcement Learning User Profiling

RecLM: Recommendation Instruction Tuning¶

Conference: ACL 2025
arXiv: 2412.19302
Code: https://github.com/HKUDS/RecLM
Area: Recommendation Systems / LLM
Keywords: Recommendation Systems, Instruction Tuning, Collaborative Filtering, Cold-Start, Reinforcement Learning, User Profiling

TL;DR¶

This work proposes RecLM, a model-agnostic recommendation instruction tuning framework. It injects collaborative filtering signals into user/item profiles generated by an LLM via two-round conversational instruction tuning, and refines profile quality using RLHF (PPO). Serving as a plug-and-play component, it consistently improves performance for BiasMF, NCF, LightGCN, SGL, and SimGCL on MIND, Netflix, and industrial datasets, demonstrating significant efficacy particularly in cold-start scenarios.

Background & Motivation¶

Background: Recommendation systems primarily rely on ID-based collaborative filtering (CF), optimizing user/item ID embeddings via methods like GNNs. While effective with sufficient data, they suffer from cold-start and zero-shot scenarios.

Limitations of Prior Work: (1) ID-based embeddings cannot generate meaningful representations for new items; (2) relying on textual side information (item descriptions) as alternative embeddings is hindered by data incompleteness and quality issues (e.g., misleading tags, irrelevant descriptions); (3) although LLMs possess powerful language comprehension capabilities, they lack the ability to model user-item interaction behavior patterns.

Key Challenge: The textual understanding capability of LLMs and the interaction relationship modeling of collaborative filtering are complementary. However, the puzzle remains: how can LLMs be prompted to "comprehend" the behavioral context of recommendation scenarios?

Goal: (1) Design a mechanism for LLMs to generate accurate user profiles, especially for cold-start users/items; (2) distill high-quality profiles from noisy features.

Key Insight: Convert recommendation tasks into LLM instruction-tuning tasks, encoding high-order collaborative filtering relationships into conversational prompts to guide LLMs in generating user profiles infused with collaborative signals.

Core Idea: Two-round conversational instruction tuning (the first round generates profiles, and the second round predicts interactions) combined with RLHF profile refinement. The generated profile embeddings can then be plugged directly into any recommendation model.

Method¶

Overall Architecture¶

Textual side-information projection \(\rightarrow\) LLM collaborative instruction tuning (two-round dialogue) \(\rightarrow\) RL refinement \(\rightarrow\) user/item profile generation \(\rightarrow\) profile embedding integration into downstream recommendation models.

Key Designs¶

Text-driven user/item representation:
- Function: Use text as a substitute for ID embeddings to achieve zero-shot recommendation.
- Mechanism: Textual item descriptions are projected into a low-dimensional space via an MLP as initial item embeddings: \(\hat{f}_v = T_{raw}(f)\). User embeddings are combined using both ID embeddings and the LLM-generated profiles.
- Design Motivation: Textual features are available even for new items, overcoming the cold-start limitations of ID embeddings.
Two-round collaborative instruction tuning (Core):
- Function: Inject collaborative filtering signals into the profile generation capacity of the LLM.
- Mechanism:
  - First round — Collaborative Profile Generation: Input the target user's interaction history plus the interaction histories of similar users (based on LightGCN embedding distance) to allow the LLM to generate a user profile reflecting collaborative relationships.
  - Second round — Interaction Prediction Supervision: Prompt the LLM with "Will user \(u\) interact with item \(v\)?" based on the first-round profile, where the ground truth is Yes/No. Positive samples are selected from items in the user's history that also appear in the similar users' histories, while negative samples are items from the similar users' histories that the target user has not interacted with.
  - Multi-round tuning strategy: After concatenating the two rounds of dialogue, loss functions are calculated for both \(\mathcal{R}_{fir.}\) (profile) and \(\mathcal{R}_{sec.}\) (Yes/No) to jointly optimize both profile generation and interaction prediction.
- Design Motivation: Relying solely on profile generation lacks direct supervision signals (as ground-truth profiles do not exist); the second-round interaction prediction provides an indirect yet explicit supervision.
RLHF profile refinement:
- Function: Use reinforcement learning to alleviate train-inference discrepancy and over-smoothing issues.
- Mechanism:
  - Reward Model: Build a reward model based on the LLM. Positive profiles are generated using ChatGPT, while negative profiles are created by replacing them using diverse prompts + profile substitution, optimization is guided by a ranking loss.
  - PPO Optimization: Utilize the LLM as the policy, optimize guided by the reward model, and apply KL-divergence constraints to prevent reward hacking.
- Design Motivation: Profiles after instruction tuning might over-rely on collaborative information (similar to GNN over-smoothing); RL refinement preserves personalized characteristics.

Loss & Training¶

Instruction Tuning: LLaMA-2-7B-Chat as the base model, LoRA fine-tuning, optimizing both dialogue rounds simultaneously.
RL: PPO with KL constraints; the reward model is trained using ranking loss.
RecModel Training: BPR loss, with plug-and-play embedding replacement.

Key Experimental Results¶

Main Results¶

Baseline Model	Dataset	Base R@20	+RecLM R@20	Gain
BiasMF	MIND	0.0683	0.0719	+5.3%
BiasMF	Netflix	0.0449	0.0531	+18.3%
BiasMF	Industrial	0.0078	0.0121	+55.1%
LightGCN	MIND	0.0822	0.0842	+2.4%
SimGCL	Netflix	0.0662	0.0683	+3.2%

The improvement is even more significant in cold-start (zero-shot) scenarios.

Ablation Study¶

Configuration	MIND R@20	Netflix R@20
RecLM Full	0.0842	0.0683
w/o Profile (text only)	0.0809	0.0643
w/o Two-Round Tuning (single-round)	0.0823	0.0665
w/o RL Refinement	0.0831	0.0672

Key Findings¶

Most significant improvement in the industrial dataset: BiasMF's R@20 on Industrial went from 0.0078 to 0.0121 (+55.1%), showing that profile enhancement offers the highest value in sparse real-world scenarios.
Strong model-agnosticism: Consistent improvements are achieved across five different recommendation models (ranging from simple MF to GNNs).
Two-round tuning outperforms single-round: Interaction prediction in the second round provides crucial supervision signals for profile generation.
RL refinement is effective: PPO brings an additional 1-2 percentage points boost, mitigating the collaborative over-smoothing issue.
Pronounced advantage in cold-start: RecLM provides meaningful initial representations in zero-shot scenarios.

Highlights & Insights¶

"Recommendation as Dialogue" paradigm: Encoding collaborative filtering relationships into LLM conversational instructions allows the LLM to "learn" behavioral semantics in recommendations. The two-round dialogue design forms a clever closed-loop: profile generation \(\rightarrow\) interaction prediction.
Plug-and-play design: Profile embeddings can be integrated into any recommendation model via simple fusion, ensuring strong practicality.
Industrial validation: Substantial improvements on an anonymous industrial dataset increase credibility.

Limitations & Future Work¶

Large LLM inference overhead: LLaMA-7B generates a profile for every user, requiring efficiency optimizations for large-scale deployment.
Dependence on ChatGPT to generate instruction tuning data and RL positive samples leads to non-negligible construction costs.
The N@20 metric on MIND exhibits decreased performance (BiasMF -12.5%, NCF -11.4%), indicating that profiles might introduce certain noise.
The evaluation is restricted to top-K recommendation, leaving other recommendation tasks (e.g., Click-Through Rate prediction) unassessed.
Similar user selection relies on LightGCN embeddings, failing to retrieve collaborative neighbors for complete cold-start users.

vs RLMRec/LLMRec: Other LLM-enhanced recommendation approaches typically use raw LLM embeddings or generate textual features directly, lacking the injection of collaborative signals. RecLM explicitly introduces high-order collaborative relationships via two-round dialog tuning.
vs KAR: KAR uses LLMs to enrich knowledge but does not involve user profiling or RL refinement.
vs P5/InstructRec: These methods model recommendation entirely as sequential generation tasks via LLMs, incurring high computational costs. RecLM only uses LLMs for profile generation, keeping the actual recommendation task handled by efficient CF models.

Rating¶

Novelty: ⭐⭐⭐⭐ The two-round collaborative instruction tuning + RLHF refinement design is innovative.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 3 datasets (including an industrial one), 5 recommendation models, zero-shot testing, and comprehensive ablation.
Writing Quality: ⭐⭐⭐⭐ Complete structure, clear formulation.
Value: ⭐⭐⭐⭐ A plug-and-play LLM-enhanced recommendation solution with direct practical value for cold-start problems.