🎁 Recommender Systems¶

🧪 ICML2025 · 17 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (24) · 💬 ACL2026 (22) · 🧪 ICML2026 (11) · 🤖 AAAI2026 (27) · 🧠 NeurIPS2025 (24) · 💬 ACL2025 (7)

🔥 Top topics: Recommendation ×3 · Alignment/RLHF ×2 · LLM ×2

Adaptive Elicitation of Latent Information Using Natural Language: An LLM-based adaptive information elicitation framework is proposed. By performing autoregressive forward simulation of future observations using a meta-learned predictive model, it quantifies and distinguishes epistemic and aleatoric uncertainties, and adaptively selects the most informative natural language questions to efficiently reduce epistemic uncertainty about a latent entity.
Aligning LLMs by Predicting Preferences from User Writing Samples: A new paradigm is proposed to achieve personalized LLM alignment by predicting user preferences based on user writing samples. It infers preference signals directly from user textual styles without requiring explicit preference annotations, opening up a new data source for personalized alignment.
Deprecating Benchmarks: Criteria and Framework: Proposes a set of 7 criteria to determine when an AI benchmark should be deprecated, alongside a three-phase deprecation framework (Assessment-Reporting-Notification), and provides an institutional implementation plan using the EU AI Office as a case study.
ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces: The ELMO framework is proposed to reduce the training memory of XMC models with 3 million labels from 39.7 GiB to 6.6 GiB without losing classification accuracy, achieved via pure BFloat16/Float8 low-precision training combined with peak memory optimizations such as gradient fusion and chunking strategies.
How to Set AdamW's Weight Decay as You Scale Model and Dataset Size: By interpreting the weight updates of AdamW as an Exponential Moving Average (EMA), this work reveals that the EMA timescale \(\tau = 1/(\eta\lambda)\) is a core hyperparameter. Its optimal value in terms of epochs remains stable across varying model and dataset scales, thereby providing clear scaling rules for weight decay.
LCRON: Learning Cascade Ranking as One Network: This work proposes LCRON, which trains multi-stage cascade ranking systems as a unified network in an end-to-end manner. Specifically, an end-to-end surrogate loss \(L_{e2e}\) constructed via differentiable ranking techniques directly optimizes the lower bound of the survival probability of ground truth items through the entire cascade. This is assisted by auxiliary individual stage losses \(L_{single}\) derived from the tightness of the lower bound to drive collaboration among stages. LCRON achieves significant improvements in both public benchmarks and online A/B tests of industrial advertising systems (Ad Revenue +4.10%, User Conversion +1.60%).
New Interaction Paradigm for Complex EDA Software Leveraging GPT: This work proposes the SmartonAI system, which integrates Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) into the EDA tool KiCad. It achieves task decomposition, document retrieval, and intelligent plugin recommendation and execution through natural language interaction, significantly reducing the learning curve for complex engineering software.
Not All Explanations for Deep Learning Phenomena Are Equally Valuable: This is a position paper arguing that "counter-intuitive phenomena" in deep learning (such as double descent, grokking, and the lottery ticket hypothesis) rarely occur in practical settings. Instead of pursuing isolated explanations for these phenomena, researchers should treat them as empirical testbeds to evaluate and refine broader deep learning theories.
PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model: This work proposes PARM, a single unified preference-aware autoregressive reward model, which conditions preference vectors into the ARM via PBLoRA (Preference-Aware Bilinear Low-Rank Adaptation) for efficient multi-objective test-time alignment—replacing \(k\) independent ARMs with a single reward model to reduce inference costs and support weak-to-strong guidance (e.g., a 7B model guiding a 65B model).
Position: Don't Use the CLT in LLM Evals with Fewer Than a Few Hundred Datapoints: As a position paper, this work argues that when the sample size for LLM evaluation is fewer than a few hundred, confidence intervals based on the Central Limit Theorem (CLT) severely underestimate uncertainty. It recommends using Bayesian credible intervals or Wilson score intervals as alternative solutions.
QuRe: Query-Relevant Retrieval through Hard Negative Sampling in Composed Image Retrieval: Proposes QuRe, which improves user satisfaction in Composed Image Retrieval (CIR) by simultaneously retrieving target images and other relevant images through a hard negative sampling strategy based on steep drops in relevance scores and a reward model optimization objective.
Recommendations and Reporting Checklist for Rigorous & Transparent Human Baselines in Model Evaluations: This paper systematically reviews the methodology of "human baselines" in AI evaluation. It reveals critical deficiencies in rigor and transparency across 115 existing human baseline studies, and proposes methodological recommendations and a reporting checklist covering the entire baseline lifecycle.
Recommendations with Sparse Comparison Data: Provably Fast Convergence for Nonconvex Matrix Factorization: Provides the first theoretical recovery guarantee for non-convex matrix factorization based on pairwise comparison data in recommendation systems: proving that under a warm start condition, projected gradient descent converges exponentially to the true low-rank feature matrix with a nearly optimal sample complexity of \(O(nr^2 \log n)\). The key technical contribution extends the matrix Bernstein inequality to the sampling matrix structure of pairwise comparisons.
RLTHF: Targeted Human Feedback for LLM Alignment: RLTHF proposes a hybrid human-AI framework for LLM alignment. By analyzing the reward distribution of the reward model to identify "hard samples" mislabeled by LLMs, it selectively annotates only these samples with human feedback, achieving or even surpassing the alignment quality of full-scale human annotation at only 6-7% of the cost.
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning: SIMPLEMIX finds that on-policy data excels at reasoning tasks while off-policy data excels at open-ended tasks. By simply mixing the two types of data sources, it achieves an average improvement of 6.03% on Alpaca Eval 2.0, outperforming complex methods such as HyPO by 3.05%.
Position: The Right to AI: This position paper introduces the concept of the "Right to AI," advocating that individuals and communities affected by AI systems should have the right to participate in their development and governance. Drawing on the "right to the city" theory from urban planning, the paper constructs a four-tiered citizen participation model.
MATCHA: Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition: This paper proposes the MATCHA multi-agent framework, which decomposes game conversational recommendation into six specialized agents (intent parsing, tool-augmented candidate generation, multi-LLM ranking, reflection re-ranking, risk control, and explainable generation). On real-world Roblox user data, it improves Hit@5 by 20%, reduces popularity bias by 24%, and achieves an adversarial defense rate of 97.9%.