🎁 Recommender Systems¶

💬 ACL2026 · 13 paper notes

Beyond Itinerary Planning: A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks: This paper proposes TravelBench, the first travel planning benchmark that integrates real user queries, implicit user preferences, multi-turn interaction, unsolvable task recognition, and 10 real-world tools. It enables reproducible evaluation through a sandbox environment and reveals that state-of-the-art models exhibit uneven performance across different capability dimensions.
Content Fuzzing for Escaping Information Cocoons on Social Media: This paper proposes ContentFuzz, a confidence-guided fuzzing framework from the content creator's perspective. It leverages LLMs to rewrite posts such that the machine-inferred stance label changes while the human-interpreted meaning remains unchanged, thereby breaking information cocoons on social media.
Decisive: Guiding User Decisions with Optimal Preference Elicitation from Unstructured Documents: This paper proposes DECISIVE, an interactive decision-making framework that extracts an objective option scoring matrix from unstructured documents and combines it with Bayesian preference inference to adaptively select pairwise comparison questions, efficiently learning users' latent preference vectors. The system minimizes user interaction burden while delivering transparent, personalized recommendations, achieving up to 20% higher decision accuracy over strong baselines.
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents: This paper proposes the Memora benchmark and the FAMA metric, extending long-term memory evaluation beyond shallow fact retrieval to memory consolidation and mutation handling spanning weeks to months, revealing systematic failures of existing LLMs and memory agents under frequent knowledge updates.
HARPO: Hierarchical Agentic Reasoning for User-Aligned Conversational Recommendation: This paper proposes HARPO, a framework that reformulates conversational recommendation as a structured decision-making problem explicitly optimizing for recommendation quality. HARPO integrates four components—hierarchical preference learning, value-network-guided tree search reasoning, virtual tool operation abstraction, and multi-agent refinement—achieving significant improvements over existing methods on three benchmarks: ReDial, INSPIRED, and MUSE.
HORIZON: A Benchmark for in-the-wild User Behaviour Modeling: This paper presents HORIZON, the first fully open-source large-scale cross-domain long-term recommendation benchmark. Built by merging all categories of Amazon Reviews into a unified interaction history covering 54M users and 35M items, HORIZON introduces a four-quadrant evaluation protocol that orthogonally decouples the temporal and user axes. The benchmark reveals that models such as BERT4Rec perform strongly in-distribution but degrade significantly under temporal extrapolation and unseen-user settings, and that LLMs do not consistently outperform dedicated architectures for user behaviour modeling.
IceBreaker for Conversational Agents: Breaking the First-Message Barrier with Personalized Starters: This paper proposes IceBreaker, a two-step "handshake" framework—resonance-aware interest distillation to capture trigger interests, followed by interaction-oriented starter generation with personalized preference alignment—to address the "first-message barrier" in conversational agents. A/B testing on one of the world's largest conversational products yields +1.84‰ active days and +94.25‰ CTR.
Learning to Retrieve User History and Generate User Profiles for Personalized Persuasiveness Prediction: This paper proposes ReCAP, a framework featuring a trainable query generator and a user profile generator that retrieves persuasion-relevant information from user history and constructs context-aware user profiles, significantly improving personalized persuasiveness prediction.
Personalized Benchmarking: Evaluating LLMs by Individual Preferences: This paper analyzes personalized rankings for 115 active Chatbot Arena users and finds that the average Spearman correlation between Bradley-Terry personalized rankings and the global ranking is only \(\rho=0.04\) (with 57% of users exhibiting near-zero or negative correlation), demonstrating that aggregated benchmarks fail to reflect individual user preferences. Topic and style features are shown to successfully predict user-specific model rankings.
Scripts Through Time: A Survey of the Evolving Role of Transliteration in NLP: This paper presents a systematic survey of the evolving role of transliteration in cross-lingual NLP. It proposes a five-category motivation taxonomy (named entity/OOV handling, code-mixing, cross-script similarity exploitation, English-centric transfer, and unified preprocessing), compares six integration strategies, and discusses whether transliteration remains necessary in the era of modern LLMs.
What Makes an Ideal Quote? Recommending "Unexpected yet Rational" Quotations via Novelty: NOVELQR proposes a novelty-driven quote recommendation framework that constructs a deep semantic knowledge base via a generative label agent to enable semantically rational retrieval, and employs a token-level novelty estimator to mitigate autoregressive continuation bias, achieving significant improvements on a bilingual benchmark.
What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context: This paper identifies that binary preference modeling in existing LLM-based recommender systems discards two critical signals—preference intensity and temporal context—and proposes RecPO, a framework that incorporates both factors into preference optimization via adaptive reward margins, substantially outperforming S-DPO and other baselines across five datasets.
Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation: SiPeR addresses the challenge of dynamically shifting and implicitly expressed user preferences in situated conversational recommendation (SCR) via two mechanisms — Scene Transition Estimation ("Where") and Bayesian Inverse Inference ("What") — achieving improvements of 10.9% and 10.6% on SIMMC 2.1 and SCREEN, respectively.