🎁 Recommender Systems¶

🧠 NeurIPS2025 · 24 paper notes

ASAP: An Agentic Solution to Auto-Optimize Performance of Large-Scale LLM Training: ASAP is a multi-agent system (Coordinator + Analyzer + Proposal) that automatically diagnoses bottleneck types (compute/memory/communication) in large-scale LLM distributed training and proposes sharding configurations. Across 3 experimental scenarios, it matches human expert solutions and achieves up to 2.58× throughput improvement.
Balancing Performance and Costs in Best Arm Identification: This paper proposes to reformulate Best Arm Identification (BAI) from the fixed-budget/fixed-confidence paradigm into a risk functional minimization problem over misidentification probability (or simple regret) plus sampling cost. It derives lower bounds exhibiting a phase transition phenomenon (when the gap is too small, the optimal strategy is to guess directly), and designs the DBCARE algorithm that achieves optimality within logarithmic factors under a dynamic budget.
EMPATHIA: Multi-Faceted Human-AI Collaboration for Refugee Integration: This paper proposes EMPATHIA, a multi-agent framework grounded in Kegan's constructive-developmental theory. Three specialized agents—emotional, cultural, and ethical—engage in selector-validator negotiation to evaluate refugee resettlement recommendations. On real-world data from 6,359 refugees, the framework achieves an 87.4% convergence rate and 92.1% cultural expert agreement rate.
Estimating Hitting Times Locally At Scale: Two local (sublinear) algorithms are proposed for estimating hitting times on graphs — Algorithm 1 based on meeting times and Algorithm 3 based on spectral truncation. Both require only short random walks centered at \(u\) and \(v\) without full graph access, achieving relative error <1.4% on synthetic and real-world graphs. An optimal sample complexity lower bound for walk-based estimation is also established.
FACE: A General Framework for Mapping Collaborative Filtering Embeddings into LLM Tokens: FACE proposes mapping collaborative filtering (CF) embeddings into LLM pre-trained tokens (descriptors) via disentangled projection and residual quantization, followed by contrastive learning for semantic alignment — enabling semantic interpretation and recommendation enhancement of CF embeddings without fine-tuning the LLM.
Inference-Time Reward Hacking in Large Language Models: This paper mathematically proves that inference-time alignment methods (e.g., BoN) inevitably exhibit reward hacking (true reward first increases then decreases) when optimizing a proxy reward. It proposes Best-of-Poisson (BoP) sampling to approximate the optimal KL-reward trade-off distribution, and designs the HedgeTune algorithm to locate the optimal inference-time parameter via one-dimensional root-finding, effectively mitigating reward hacking in both mathematical reasoning and human preference settings.
Measuring What Matters: Construct Validity in Large Language Model Benchmarks: This paper presents a systematic review of 445 LLM benchmark papers conducted by 29 experts, examining existing LLM evaluation benchmarks through the lens of construct validity across four dimensions — phenomenon definition, task design, scoring metrics, and conclusion claims — and proposes 8 actionable recommendations for improvement.
MMPB: It's Time for Multi-Modal Personalization: This paper introduces MMPB, the first VLM personalization evaluation benchmark, comprising 111 personalizable concepts, 10k+ image-text QA pairs, and 15 task types. Evaluation of 23 VLMs reveals that even the strongest model, GPT-4o, performs poorly on personalization tasks, exposing critical limitations in preference reasoning, visual cue utilization, and conflicts between safety alignment and personalization.
NeurIPS Should Lead Scientific Consensus on AI Policy: This position paper argues that NeurIPS should proactively assume the role of facilitating scientific consensus in AI policy, drawing on the successful experience of the IPCC (Intergovernmental Panel on Climate Change) in climate science to fill the current gap in AI policy consensus mechanisms.
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning: This work identifies two categories of sparsity artifacts introduced by L1 loss in Crosscoders—Complete Shrinkage (which erroneously zeros out weakly shared concepts) and Latent Decoupling (which splits shared concepts into spurious model-specific latents)—and proposes Latent Scaling as a diagnostic tool and BatchTopK Crosscoder as an alternative training scheme, substantially improving the reliability of chat-tuning concept discovery.
PAC-Bayes Bounds for Multivariate Linear Regression and Linear Autoencoders: This paper extends PAC-Bayes generalization bounds from single-output linear regression to multivariate linear regression, and further adapts them to linear autoencoders (LAEs) in recommender systems. Through theoretical development, the computational complexity is reduced from O(n⁴) to O(n³), and experiments demonstrate that the bounds are tight and highly correlated with practical metrics such as Recall@K and NDCG@K.
Position: Towards Bidirectional Human-AI Alignment: This paper proposes a Bidirectional Human-AI Alignment framework grounded in a systematic review of 400+ papers, arguing that AI alignment should not be limited to the unidirectional goal of "aligning AI with humans," but must also encompass the critically underexplored direction of "aligning humans with AI," while identifying key gaps in the current research landscape.
R²ec: Towards Large Recommender Models with Reasoning: This paper proposes R²ec, the first unified large recommender model that endogenously integrates reasoning capabilities, achieving joint reasoning chain generation and efficient item prediction via a dual-head architecture, and introduces the RecPO reinforcement learning framework to jointly optimize reasoning and recommendation objectives without any annotated reasoning data.
Radial Neighborhood Smoothing Recommender System: This paper proposes the Radial Neighborhood Estimator (RNE), which approximates latent space distances using the row/column L2 norms of the observed matrix, constructs radial neighborhoods encompassing both overlapping and partially overlapping user–item pairs, and applies local kernel regression for smoothed imputation. RNE outperforms conventional collaborative filtering and matrix factorization methods in both theoretical guarantees and empirical evaluations, while naturally alleviating the cold-start problem.
Semantic Retrieval Augmented Contrastive Learning for Sequential Recommendation: This paper proposes SRA-CL, a framework that leverages the semantic understanding capabilities of LLMs to construct high-quality contrastive sample pairs. By combining semantic retrieval with a learnable sample synthesizer, SRA-CL enhances contrastive learning for sequential recommendation and achieves state-of-the-art performance across four datasets in a plug-and-play manner.
The Coming Crisis of Multi-Agent Misalignment: AI Alignment Must Be a Dynamic and Social Process: A position paper arguing that AI alignment in multi-agent systems (MAS) should be treated as a dynamic, interaction-dependent social process rather than an isolated problem. Drawing on social science theories, the paper analyzes how social structures can undermine collective and individual values, and calls on the AI community to develop dedicated simulation environments, benchmarks, and evaluation frameworks to address this challenge.
The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems: This paper systematically identifies four methodological pitfalls in current AI scientist systems—inappropriate benchmark selection, data leakage, metric misuse, and post-hoc selection bias—through controlled experiments on Agent Laboratory and The AI Scientist v2 using a carefully designed synthetic task (SPR). Both systems exhibit these issues to varying degrees. The paper further demonstrates that auditing trace logs and code achieves 27 percentage points higher detection accuracy than reviewing final papers alone (82% vs. 55%).
Think before Recommendation: Autonomous Reasoning-enhanced Recommender: This paper proposes RecZero (a pure RL paradigm) and RecOne (a hybrid SFT+RL paradigm), abandoning conventional teacher-student distillation. Both approaches leverage GRPO-based reinforcement learning to train a single LLM to autonomously develop reasoning capabilities for rating prediction. A structured "Think-before-Recommendation" template guides step-by-step reasoning (user analysis → item analysis → matching → rating), achieving significant improvements over existing baselines across four datasets.
Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning: This paper proposes the Transformer Copilot framework, which systematically records a "Mistake Log" during LLM fine-tuning, trains an auxiliary Copilot model to learn the Pilot's error patterns, and rectifies logits at inference time to improve generation quality, achieving up to 34.5% improvement across 12 benchmarks.
TV-Rec: Time-Variant Convolutional Filter for Sequential Recommendation: This paper proposes TV-Rec, a time-variant convolutional filter grounded in graph signal processing that replaces conventional fixed convolutions and self-attention mechanisms, achieving higher expressiveness for sequential recommendation with an average improvement of 7.49% across 6 benchmark datasets.
Validating LLM-as-a-Judge Systems under Rating Indeterminacy: This paper proposes a framework for validating LLM-as-a-Judge systems under rating indeterminacy, replacing forced-choice rating with a "response set" multi-label rating scheme, achieving up to 31% performance improvement in the selected judge system.
VisualLens: Personalization through Task-Agnostic Visual History: This paper proposes VisualLens, a framework that leverages users' task-agnostic visual history (everyday photos) to enable cross-domain personalized recommendation via spectrum user profiles and multimodal large language models. On the newly constructed Google Review-V and Yelp-V datasets, VisualLens surpasses GPT-4o by 2–5% in Hit@3.
Who You Are Matters: Bridging Topics and Social Roles via LLM-Enhanced Logical Recommendation: This paper proposes TagCF, a framework that employs MLLM to extract user role tags and item topic tags, then uses LLM reasoning to construct U2I/I2U logic graphs (causal associations between user roles and item types). Three integration strategies — a tag encoder, contrastive learning augmentation, and logic-based scoring — are used to enhance recommendations. On an industrial platform with hundreds of millions of users, online A/B testing yields a 0.946% improvement in engagement metrics and a 0.102% gain in diversity; offline experiments show an 8.06% improvement in NDCG@10.
Wide-Horizon Thinking and Simulation-Based Evaluation for Real-World LLM Planning with Multifaceted Constraints: This paper proposes MAoP (Multiple Aspects of Planning), a framework that endows LLMs with "wide-horizon thinking" by having a strategist perform multi-aspect pre-planning and routing into a coherent blueprint, enabling the planner to conduct in-depth per-aspect analysis in parallel. Coupled with the Travel-Sim causal simulation benchmark, MAoP substantially outperforms CoT and decomposition-based methods on travel planning tasks; a distilled 3B model achieves a PER of 66.9%.