Align³GR: Unified Multi-Level Alignment for LLM-based Generative Recommendation¶
Conference: AAAI 2026 (Oral)
arXiv: 2511.11255v2
Code: None
Area: Recommender Systems / Information Retrieval
Keywords: Generative Recommendation, LLM Alignment, Collaborative Filtering, DPO, Semantic-Collaborative ID
TL;DR¶
This paper proposes Align³GR, a unified three-level alignment framework that systematically bridges the semantic-behavioral gap between LLMs and recommender systems at the token level (dual-side SCID), the behavior modeling level (multi-task SFT), and the preference level (progressive DPO).
Background & Motivation¶
Using LLMs as generative recommenders to produce recommendations end-to-end has become a recent trend. However, a fundamental gap exists between the language modeling objective of LLMs—focused on semantic information and next-token prediction—and the implicit user preference modeling objective of recommender systems, which centers on interaction behavior. Existing work typically performs alignment at only one of three stages: tokenization, SFT, or preference RL, lacking systematic multi-level joint optimization. Moreover, prior methods in the tokenization stage often encode only items while ignoring users, and preference alignment approaches largely rely on static offline data, making them ill-suited for dynamically evolving user preferences in real-world scenarios.
Core Problem¶
- How to jointly model both semantic and collaborative signals for users and items at the token level, rather than encoding them in isolation?
- How to enable LLMs during SFT to not only learn behavioral recommendation patterns but also understand the semantic meaning of user tokens?
- How to continuously improve the model through progressive preference optimization (from easy to hard), breaking the performance ceiling of static DPO?
Method¶
Overall Architecture¶
Align³GR is a unified three-level alignment pipeline: Token-Level Alignment → Behavior Modeling-Level Alignment → Preference-Level Alignment. It uses Llama2-7B as the backbone with LoRA for parameter-efficient fine-tuning.
Key Designs¶
-
Token-Level Alignment: Dual-Side SCID Tokenization
- Semantic features (frozen T5 encoder) and collaborative features (frozen DIN encoder) are extracted separately for both users and items, then concatenated and fused via an SC Encoder (MLP) into a unified SC embedding.
- A 3-layer RQ-VAE (256 codebook embeddings per layer, dimension 32) quantizes the SC embedding into discrete SCID tokens.
- The training objective consists of two components: a user-item behavior alignment loss \(\mathcal{L}_{\text{U2I}}\) (sampled-softmax) and RQ-VAE reconstruction/quantization losses, controlled by hyperparameters \(\alpha, \gamma\) in a two-stage switching schedule—first stabilizing behavior alignment (\(\alpha=1, \gamma=0\)), then focusing on quantization learning (\(\alpha=0.1, \gamma=1\)).
- At inference time, the user and item modules operate independently, each generating their respective SCIDs.
-
Behavior Modeling-Level Alignment: Augmented Multi-Task SFT
- Built upon the multi-task SFT of LC-Rec (sequential prediction, asymmetric prediction, intent inference, and preference reasoning), with two key enhancements:
- User SCID Injection: User SCID tokens are incorporated into the prompts of all tasks to provide richer contextual information.
- Bidirectional Alignment Task (\(B_2\)): text→SCID (predicting SCID from user profiles) and SCID→text (reconstructing user profiles from SCIDs), explicitly establishing correspondence between SCID tokens and their semantic meanings.
-
Preference-Level Alignment: Progressive DPO
- Based on Softmax-DPO (1 positive + 20 negatives per sample), training proceeds in two progressive stages:
- SP-DPO (Self-Play DPO): The model plays against itself to generate diverse training data. Leveraging the hierarchical structure of SCIDs, training is divided into three stages (Easy/Medium/Hard) by prefix-ngram overlap, progressively transitioning from completely distinct positive-negative pairs to pairs whose prefixes are highly overlapping yet still different.
- RF-DPO (Real-world Feedback DPO): Real user feedback is used to construct preference data at three levels (disliked/neutral/liked), with the same progressive curriculum—Easy stage uses strongly disliked items as negatives, while Hard stage uses neutral (exposed but not clicked) items as harder negatives.
- The fine-tuned model at each stage serves as the reference model for the next stage: \(\pi_\theta^i \to \pi_{\text{ref}}^{i+1}\).
Loss & Training¶
- Token level: \(\mathcal{L} = \alpha \cdot \mathcal{L}_{\text{U2I}} + \gamma \cdot (\mathcal{L}_{\text{User RQ}} + \mathcal{L}_{\text{Item RQ}})\), with two-stage training.
- Behavior modeling level: multi-task SFT loss + bidirectional alignment auxiliary loss.
- Preference level: Softmax-DPO loss with progressive SP-DPO → RF-DPO, iteratively updating the reference policy at each stage.
- Backbone: Llama2-7B + LoRA, AdamW optimizer, batch size 1024, trained for 20,000 steps, beam width 20.
Key Experimental Results¶
| Dataset | Metric | Align³GR | EAGER-LLM (Prev. SOTA) | Gain |
|---|---|---|---|---|
| Instruments | R@5 | 0.1103 | 0.0991 | +11.3% |
| Instruments | R@10 | 0.1442 | 0.1224 | +17.8% |
| Instruments | N@5 | 0.0947 | 0.0851 | +11.3% |
| Instruments | N@10 | 0.1113 | 0.0926 | +20.2% |
| Beauty | R@10 | 0.0994 | 0.0830 | +19.8% |
| Beauty | N@10 | 0.0529 | 0.0459 | +15.3% |
| Yelp | R@10 | 0.0679 | 0.0569 | +19.3% |
| Yelp | N@10 | 0.0403 | 0.0315 | +27.9% |
Industrial A/B Test (approximately 40 million users, multi-week deployment):
| Model | Recall@100 | Revenue Gain |
|---|---|---|
| TIGER | 0.229 | +0.555% |
| Align³GR | 0.242 | +1.432% |
Ablation Study¶
- Token level: Single-side → dual-side tokenization yields substantial improvements; incorporating collaborative features (CF) provides further gains; the U-I alignment loss works best in combination with dual-side + CF.
- Behavior modeling level: Injecting User SCID into prompts consistently improves performance; the bidirectional alignment task \(B_2\) contributes the most, indicating that LLMs require explicit supervision to establish semantic-to-structured-token mappings.
- Preference level: Self-Play improves R@10 from 0.1295 to 0.1356; the progressive curriculum further raises it to 0.1396; adding real-world feedback RF-DPO with the progressive strategy achieves the best result of 0.1442.
Highlights & Insights¶
- Systematic alignment design: The three-level alignment pipeline (token/behavior/preference) establishes clear alignment objectives at each stage, with experiments confirming that contributions from each level are complementary.
- Dual-side SCID: Unlike prior methods that encode only items, this work jointly models users and items within a unified semantic-collaborative representation and optimizes them with a U2I behavioral loss.
- Progressive DPO: The curriculum learning strategy from SP-DPO to RF-DPO, and from Easy to Hard, addresses the limitations of static DPO in dynamic recommendation scenarios.
- Industrial validation: Beyond offline experiments on three public datasets, the approach is validated through an online A/B test at the scale of 40 million users, achieving a 1.432% revenue improvement.
Limitations & Future Work¶
- Only Llama2-7B is used as the backbone; the impact of larger or more recent LLMs (e.g., Llama3) remains unexplored.
- The RQ-VAE codebook size is fixed at 256, which may cause codebook collisions for extremely large item catalogs.
- The feedback granularity in RF-DPO (disliked/neutral/liked) is coarse; finer-grained feedback signals may yield further improvements.
- On public datasets, LLM-based sentiment analysis is used as a proxy for real user feedback in RF-DPO, potentially introducing noise.
- User history is limited to the most recent 20 interactions, leaving long-sequence modeling capabilities largely unexplored.
Related Work & Insights¶
- vs. LC-Rec: LC-Rec performs only item tokenization and multi-task SFT; Align³GR adds dual-side SCID and progressive DPO, achieving comprehensive improvements.
- vs. EAGER-LLM: EAGER-LLM introduces collaborative signals at the token level but remains item-side only, with no preference alignment; Align³GR achieves all-around gains via dual-side tokenization, enhanced behavior modeling, and progressive DPO.
- vs. LETTER: LETTER proposes a learnable tokenizer but lacks user modeling and preference optimization; Align³GR significantly outperforms LETTER across all metrics.
- vs. Standard DPO: Conventional DPO relies on static offline data, whereas Align³GR's progressive SP-DPO + RF-DPO enables continuous self-improvement and real-world feedback adaptation.
The three-level alignment design (token/behavior/preference) is transferable to other scenarios where LLMs are adapted to downstream tasks such as LLM+search and LLM+advertising. The Easy-to-Hard curriculum strategy in progressive DPO is particularly valuable in settings with noisy preference labels, such as recommendation and advertising. The dual-side tokenization paradigm further suggests that users and items should be jointly modeled within a unified framework rather than processed independently when LLMs are used for recommendation.
Rating¶
- Novelty: ⭐⭐⭐⭐ (The systematic three-level alignment design is innovative, though individual modules such as RQ-VAE and DPO are not novel in themselves.)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Three public datasets + industrial A/B testing + detailed ablation studies.)
- Writing Quality: ⭐⭐⭐⭐ (Well-structured, though some sections contain dense notation.)
- Value: ⭐⭐⭐⭐ (High industrial applicability; academic novelty is moderate-to-high.)