AAAI2026 Recommender Systems AI paper notes paper summaries Recommendation LLM Alignment/RLHF Dialogue Personalized Generation Multimodal/VLM

🎁 Recommender Systems¶

🤖 AAAI2026 · 27 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (24) · 💬 ACL2026 (22) · 🧪 ICML2026 (11) · 🧠 NeurIPS2025 (24)

🔥 Top topics: Recommendation ×13 · LLM ×6

Align³GR: Unified Multi-Level Alignment for LLM-based Generative Recommendation: This paper proposes Align³GR, a unified three-level alignment framework that systematically bridges the semantic-behavioral gap between LLMs and recommender systems at the token level (dual-side SCID), the behavior modeling level (multi-task SFT), and the preference level (progressive DPO).
AutoPP: Towards Automated Product Poster Generation and Optimization: This paper proposes AutoPP, the first pipeline to unify automated product poster generation with CTR-feedback-driven optimization in a single framework. It employs a unified design module to jointly design background, text, and layout; an element rendering module for efficient and controllable poster generation; and Isolated DPO (IDPO) to achieve element-level click-through rate optimization.
Behavior Tokens Speak Louder: Disentangled Explainable Recommendation with Behavior Vocabulary: This paper proposes BEAT, a framework that discretizes user/item behavior representations into interpretable behavior tokens via vector-quantized autoencoders, and aligns collaborative filtering signals to the semantic space of a frozen LLM through multi-level semantic supervision, enabling zero-shot explainable recommendation.
Bid Farewell to Seesaw: Towards Accurate Long-tail Session-based Recommendation via Dual Constraints of Hybrid Intents: This paper proposes the HID framework, which constructs hybrid intents via attribute-aware spectral clustering to distinguish session-relevant from session-irrelevant tail items, and introduces a dual-constraint loss (ICLoss) targeting both long-tail coverage and recommendation accuracy. The framework achieves a "win-win" between long-tail promotion and accuracy, breaking the traditional seesaw dilemma where improving one metric inevitably harms the other.
CroPS: Improving Dense Retrieval with Cross-Perspective Positive Samples in Short-Video Search: This paper proposes CroPS, a data engine that enriches positive sample sets from three complementary perspectives—query reformulation behavior, recommender system interactions, and LLM world knowledge—combined with Hierarchical Label Assignment (HLA) and the H-InfoNCE loss function, to break the filter bubble effect in industrial-scale dense retrieval systems. CroPS has been fully deployed in Kuaishou Search.
Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios: This paper proposes PAS (Police Action Scenarios), an LLM evaluation framework for policing contexts. The framework comprises five stages: scenario definition, reference answer construction, LLM response generation, core metric extraction, and performance interpretation. An evaluation dataset is constructed from 8,000+ official Korean police documents. The study finds that commercial LLMs (GPT-4, Gemini, Claude) perform significantly below reference answers on policing tasks, particularly in factual accuracy and logical correctness.
FreqRec: Exploiting Inter-Session Information with Frequency-enhanced Dual-Path Networks for Sequential Recommendation: This paper proposes FreqRec, a dual-path architecture that applies frequency-domain transformations along the batch axis and the time axis to capture group-level consumption rhythms across sessions and fine-grained individual user interests, respectively. A frequency-domain consistency loss is introduced to explicitly align predicted and ground-truth frequency spectra. FreqRec achieves up to 7.38% improvement in NDCG@10 over the strongest baseline on three Amazon datasets.
From IDs to Semantics: A Generative Framework for Cross-Domain Recommendation with Adaptive Semantic Tokenization: This paper proposes GenCDR, a framework that introduces the generative semantic ID paradigm into LLM-driven cross-domain recommendation for the first time, via two core modules: domain-adaptive semantic tokenization and cross-domain autoregressive recommendation. GenCDR effectively addresses the non-transferability of item IDs and insufficient domain-personalized modeling in conventional approaches.
Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information: This paper proposes the first semi-supervised matrix completion learning paradigm: assuming that the sampling distribution \(P\) and the ground-truth matrix \(G\) share a low-rank subspace, and given a large amount of unlabeled data \(M\) and a small amount of labeled data \(N\), it proves that the generalization error can be decomposed into two independent terms \(\tilde{O}(\sqrt{nd/M}) + \tilde{O}(\sqrt{dr/N})\), achieving significant improvements over explicit-feedback-only baselines on the Douban and MovieLens datasets.
Hard vs. Noise: Resolving Hard-Noisy Sample Confusion in Recommender Systems via Large Language Models: This paper proposes the LLMHNI framework, which leverages two types of auxiliary signals generated by LLMs—semantic relevance and logical relevance—to resolve the confusion between hard samples and noisy samples in recommender systems, significantly improving denoising recommendation performance.
HyMoERec: Hybrid Mixture-of-Experts for Sequential Recommendation: This paper proposes HyMoERec, a hybrid mixture-of-experts architecture combining shared and specialized expert branches. By replacing the conventional feed-forward network in sequential recommendation models with an adaptive expert fusion mechanism, the model captures heterogeneous user behavior patterns and diverse item complexities, consistently outperforming state-of-the-art methods on the MovieLens-1M and Beauty datasets.
Inductive Generative Recommendation via Retrieval-based Speculation: This paper identifies a critical limitation of Generative Recommendation (GR) models — their inability to recommend items unseen during training — and proposes SpecGR, a plug-and-play framework in which an inductively capable drafter model proposes candidate items (including new ones) while the GR model serves as a verifier to rank and validate candidates. A guided re-drafting mechanism further improves verification efficiency, achieving state-of-the-art overall performance across three datasets.
Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models: This paper reveals a non-trivial interaction between prompt selection and inference strategies (Best-of-N, Majority Voting), proposes the IAPO framework that jointly optimizes prompt design and inference scaling as a contextual best-arm identification problem, and introduces PSST—a fixed-budget training algorithm—achieving up to 50% improvement over inference-agnostic methods across 6 tasks.
Interpretable Reward Model via Sparse Autoencoder: This paper proposes SARM (Sparse Autoencoder-enhanced Reward Model), which integrates a pretrained sparse autoencoder into a reward model to map hidden-layer activations into an interpretable, sparse, monosemantic feature space. This design enables feature-level reward attribution and dynamic preference steering, while achieving the highest overall score among all models on RewardBench 2.
Length-Adaptive Interest Network for Balancing Long and Short Sequence Modeling in CTR Prediction: This paper proposes LAIN, a framework that injects sequence length as an explicit conditional signal into CTR models to mitigate performance imbalance between long- and short-sequence users. LAIN comprises three lightweight, plug-and-play modules: a Spectral Length Encoder, Length-Conditioned Prompting, and Length-Modulated Attention.
Moral Change or Noise? On Problems of Aligning AI With Temporally Unstable Human Feedback: Through a longitudinal study involving 400+ participants across 3–5 sessions in the domain of kidney transplant allocation, this paper reveals significant temporal instability in human moral preferences (6–20% response change rate) and demonstrates that such instability substantially degrades the predictive performance of AI alignment models, thereby challenging the validity of current alignment approaches that assume static preferences.
MultiTab: A Scalable Foundation for Multitask Learning on Tabular Data: This paper proposes MultiTab-Net — the first multitask Transformer architecture for tabular data — which alleviates task competition via a multitask masked attention mechanism, and substantially outperforms existing MLP-based multitask models and single-task Transformer models across datasets from recommendation, census, and physics domains.
Preference is More Than Comparisons: Rethinking Dueling Bandits with Augmented Human Feedback: This paper proposes IPEA-HF, a model-free Dueling Bandit framework based on augmented human feedback. It integrates contextual similarity and dependency relations through Augmented Confidence Bounds to calibrate uncertainty, achieving superior performance across multiple benchmarks including recommendation, multi-objective optimization, and LLM response optimization.
Probabilistic Hash Embeddings for Online Learning of Categorical Features: This paper proposes Probabilistic Hash Embeddings (PHE), which models hash embedding tables as random variables and performs posterior inference via Bayesian online learning. PHE addresses the catastrophic forgetting problem caused by parameter sharing in deterministic hash embeddings under streaming data settings. It significantly outperforms deterministic baselines across classification, sequential modeling, and recommender system tasks, while requiring only 2%–4% of the memory needed by collision-free embedding tables.
RecToM: A Benchmark for Evaluating Machine Theory of Mind in LLM-based Conversational Recommender Systems: This paper introduces RecToM, the first human-annotated benchmark for evaluating Theory of Mind (ToM) reasoning capabilities of LLMs in conversational recommender systems (CRS). It covers two dimensions—cognitive inference (desire/intention/belief) and behavioral prediction (strategy prediction/strategy judgment)—comprising 10 question types and 20,524 QA pairs, and exposes systematic deficiencies of current LLMs in fine-grained intention inference and strategy judgment.
Semi-Supervised Synthetic Data Generation with Fine-Grained Relevance Control for Short Video Search Relevance Modeling: This paper proposes SSRA (Semi-Supervised Relevance-Aware synthetic data pipeline), a two-stage framework that generates domain-adaptive short video data with controllable fine-grained relevance labels (4 levels) to enhance the semantic relevance modeling capability of embedding models. Online A/B testing on Douyin's dual-column feed achieves a 1.45% CTR improvement.
SlideTailor: Personalized Presentation Slide Generation for Scientific Papers: This paper defines a new task of preference-guided paper-to-slide generation and proposes the SlideTailor framework, which distills content preferences from user-provided paper–slide example pairs and aesthetic preferences from .pptx templates. A chain-of-speech mechanism aligns slide content with intended spoken narratives. On the self-constructed PSP benchmark, SlideTailor achieves an overall score of 75.8% and a human-evaluation win rate of 81.63%, significantly outperforming existing methods.
Tokenize Once, Recommend Anywhere: Unified Item Tokenization for Multi-domain LLM-based Recommendation: This paper proposes UniTok, a unified item tokenization framework that employs a customized Mixture-of-Experts architecture (TokenMoE) combined with shared codebooks to achieve efficient discrete item representations across multiple domains, eliminating the need to train a separate tokenizer per domain, while maintaining cross-domain semantic balance through a mutual information calibration mechanism.
Tool4POI: A Tool-Augmented LLM Framework for Next POI Recommendation: This paper is the first to introduce the tool-augmented LLM paradigm to the next POI recommendation task. Through three modules—preference extraction, multi-round candidate retrieval, and reranking—the framework enables LLMs to retrieve recommendations from the full POI pool. It achieves over 40% accuracy in Out-of-History (OOH) scenarios (where existing methods yield 0%), with average Acc@5/10 improvements of 20%/30%.
TraveLLaMA: A Multimodal Travel Assistant with Large-Scale Dataset and Structured Reasoning: This paper presents TraveLLaMA, a multimodal language model system for travel assistance. By constructing the TravelQA dataset with 265K QA pairs and the Travel-CoT structured reasoning framework, the system achieves a 10.8% accuracy improvement on travel-related question answering and obtains a SUS usability score of 82.5 in a 500-participant user study.
Wavelet Enhanced Adaptive Frequency Filter for Sequential Recommendation: This paper proposes WEARec, a model that employs Dynamic Frequency Filtering (DFF) to adaptively generate personalized frequency-domain filters conditioned on user context for capturing global preferences, and Wavelet Feature Enhancement (WFE) to compensate for the inability of global DFT to resolve short-term fluctuations. WEARec outperforms all 9 baselines on four datasets, achieving up to 11.4% improvement on long-sequence scenarios with 39–45% faster training speed.
When Top-ranked Recommendations Fail: Modeling Multi-Granular Negative Feedback for Explainable and Robust Video Recommendation: This paper proposes ENF (Explainable Negative Feedback), a framework comprising three collaborative MLLM Agents (Profile Agent, Video Agent, and Reason Agent) and a progressive S-GRPO reinforcement learning training strategy. ENF is the first approach to achieve explainable prediction and root-cause analysis of implicit negative feedback in video recommendation systems. Deployed on Tencent's news platform, it achieves a 6.2% increase in average watch duration and a 9.4% decrease in quick-skip rate.