Skip to content

Bid Farewell to Seesaw: Towards Accurate Long-tail Session-based Recommendation via Dual Constraints of Hybrid Intents

Conference: AAAI 2026 arXiv: 2511.08378 Authors: Xiao Wang, Ke Qin, Dongyang Zhang, Xiurui Xie, Shuang Liang Code: Not released Area: Recommender Systems Keywords: Session-based recommendation, long-tail distribution, hybrid intents, spectral clustering, contrastive learning

TL;DR

This paper proposes the HID framework, which constructs hybrid intents via attribute-aware spectral clustering to distinguish session-relevant from session-irrelevant tail items, and introduces a dual-constraint loss (ICLoss) targeting both long-tail coverage and recommendation accuracy. The framework achieves a "win-win" between long-tail promotion and accuracy, breaking the traditional seesaw dilemma where improving one metric inevitably harms the other.

Background & Motivation

State of the Field

Session-based Recommendation (SBR) aims to predict the next interacted item from an anonymous user's short-term interaction sequence. In practical recommendation scenarios, item exposure frequency follows a severe long-tail distribution — a small number of head items account for the vast majority of interactions, while a large number of tail items are rarely recommended. This imbalance causes recommendation systems to repeatedly surface head items, creating a vicious cycle that substantially reduces recommendation diversity.

Limitations of Prior Work

Existing long-tail SBR methods fall into two categories: (1) Augmentation-based methods (e.g., LOAM, MelT, LLM-ESR), which enhance tail item embeddings or emphasize tail items when generating session embeddings; and (2) Re-ranking-based methods (e.g., TailNet, CSBR, LAP-SR), which predict the head/tail item distribution from interaction sessions and directly adjust the final ranking results.

Both categories share two common deficiencies:

Indiscriminate emphasis on tail items introduces noise: Not all tail items are relevant to the current session. For example, in a session centered on "books," tail items from "clothing" categories are noise, despite being long-tail. Existing methods uniformly boost tail item exposure without distinguishing relevant from irrelevant tail items, leading to degraded recommendation accuracy.

Lack of explicit long-tail supervision signals: Most methods still rely on cross-entropy loss for indirect optimization, and augmentation or re-ranking strategies often conflict with the cross-entropy objective, resulting in a seesaw effect — improving long-tail performance inevitably harms accuracy, and vice versa.

On the Tmall dataset, applying long-tail methods such as TailNet to GRU4Rec improves tCov@20 but causes a notable drop in HR@20, clearly demonstrating this seesaw phenomenon.

Root Cause

Resolving this contradiction requires answering two questions: how to effectively identify noise, and how to provide explicit supervision signals for both long-tail coverage and accuracy simultaneously. HID's answer is: (1) capture high-level user preferences through hybrid intent modeling to partition items into session-target intents and noise intents; and (2) apply dual-constraint losses to enforce head-tail consistency within target intents (promoting long-tail) and to maximize the distance between the session and noise intents (preserving accuracy).

Method

Overall Architecture

HID is a model-agnostic plug-and-play framework that can be integrated with any SBR backbone (e.g., GRU4Rec, STAMP, SRGNN, GCE-GNN). It consists of two core components: a Hybrid Intent Learning module and an Intent Constraint Loss.

Module 1: Hybrid Intent Learning

Conventional intent mining methods consider only intra-session temporal relationships, making them susceptible to noise and neglecting cross-session intent consistency. HID proposes attribute-aware spectral clustering to construct global hybrid intents in three steps:

Step 1 — Preliminary Intent Units: Item attribute information (e.g., product category, music genre) is used as preliminary intent units. Items sharing the same attribute are grouped into a preliminary intent group \(c'_i\).

Step 2 — Preliminary Intent Graph Construction: Item IDs in each session are replaced with attribute IDs, and co-occurrence frequencies between attributes are aggregated across all sessions to construct a preliminary intent graph \(\mathcal{G}=(\mathcal{P}, \mathcal{E}, \mathcal{W})\), where nodes represent attributes and edge weights represent co-occurrence frequencies. For example, "food" and "cookware" frequently co-occur in shopping sessions, yielding a high edge weight between them.

Step 3 — Spectral Clustering to Generate Hybrid Intents: The normalized Laplacian matrix \(L = I - D^{-1/2}WD^{-1/2}\) is computed for the preliminary intent graph. The eigenvectors corresponding to the \(q\) smallest eigenvalues are extracted, and K-means clustering is applied to the rows of the eigenvector matrix to reassign attributes into \(n\) clusters. Attributes within the same cluster are merged to form a hybrid intent — for instance, "food" and "cookware" may be aggregated into a "cooking" intent. The embedding of each hybrid intent is obtained via mean pooling over all contained item embeddings: \(\mathbf{c}_i = \frac{1}{|c_i|}\sum_{v_j \in c_i} \mathbf{v}_j\).

Notably, the entire hybrid intent construction process can be precomputed offline; during training and inference, only a lookup table retrieval is needed, introducing no additional online overhead.

Definition of Target and Noise Intents

For session \(S^u\), given the next item \(v_{l+1}^u\) (the ground-truth label known during training): - Target intent \(\mathcal{C}^u\): the set of hybrid intents containing \(v_{l+1}^u\) - Noise intent \(\hat{\mathcal{C}}^u\): target intents from other sessions in the same mini-batch that are not in \(\mathcal{C}^u\)

Target and noise intents are used solely as supervision signals in ICLoss during training, posing no risk of data leakage.

Module 2: Intent Constraint Loss (ICLoss)

After obtaining hybrid intent embeddings, HID applies \(L_2\) normalization to both session and intent embeddings, projecting them onto a unit hypersphere to ensure a consistent metric space. Two constraints are then imposed:

Constraint 1: Constraint for Long-tail

Core Idea: Minimize the variance of distances from the session to all items within the target intent. Head item embeddings are typically closer to the session representation, while tail items are farther; constraining the variance forces their distances to converge, granting tail items recommendation probabilities comparable to head items.

\[\min\ \mathcal{L}_l = \text{Var}_{v_i \in \mathcal{C}^u}[d(\mathbf{S}^u, \mathbf{v}_i)]\]

Direct variance computation has complexity \(O(Nd)\). The paper proves via Theorem 1 that minimizing this variance is optimization-equivalent to minimizing the distance from the session embedding to the target intent centroid \(d(\mathbf{S}^u, \mathbf{c}^u)\), reducing complexity to \(O(d)\).

Constraint 2: Constraint for Accuracy

Core Idea: Maximize the average distance from the session to noise intents, while constraining the variance to avoid degenerate cases.

\[\max\ \mathcal{L}_a = \mathbb{E}_{c^v \in \hat{\mathcal{C}}^u} d(\mathbf{S}^u, \mathbf{c}^v), \quad \text{s.t.}\ \text{Var}_{c^v \in \hat{\mathcal{C}}^u}(d(\mathbf{S}^u, \mathbf{c}^v)) < \eta\]

Unified Loss Function

Both constraints are unified into an InfoNCE-style loss. The paper proves via Theorem 2 that this loss is approximately equivalent to an \((N-1)\)-Triplet Loss with a fixed margin of 2. To accommodate different scenarios, a flexible temperature coefficient \(\sigma\) replaces the fixed margin, and the hard variance constraint is converted into a penalty term \(p^u\). The final ICLoss is:

\[\mathcal{L}_c = -\sum_{S^u \in \mathcal{B}} \log \frac{\mathbf{X}}{(1+\lambda p^u)(\mathbf{X}+\mathbf{Y})}\]

where \(\mathbf{X} = \exp(\cos(\mathbf{S}^u, \mathbf{c}^u)/\sigma)\) and \(\mathbf{Y} = \sum_{c^v \in \hat{\mathcal{C}}^u}\exp(\cos(\mathbf{S}^u, \mathbf{c}^v)/\sigma)\).

The total training loss is: \(\mathcal{L} = \mathcal{L}_p + \epsilon \mathcal{L}_c\), where \(\mathcal{L}_p\) is the cross-entropy loss of the original SBR backbone.

Key Experimental Results

Table 1: Main Results — STAMP and GRU4Rec Backbones (HR@20 / tCov@20)

Method Tmall HR↑ Tmall tCov↑ Diginetica HR↑ Diginetica tCov↑ Retailrocket HR↑ Retailrocket tCov↑
STAMP (base) 26.10 69.46 50.15 90.71 50.54 53.70
+ TailNet 20.61 ↓ 71.33 ↑ 45.39 ↓ 91.23 ↑ 47.00 ↓ 51.56 ↓
+ LOAM 24.31 ↓ 71.68 ↑ 46.19 ↓ 89.96 ↓ 50.27 ↓ 55.67 ↑
+ LAP-SR 25.21 ↓ 72.11 ↑ 49.87 ↓ 91.32 ↑ 49.59 ↓ 55.32 ↑
+ HID 28.26 ↑ 73.65 ↑ 50.39 ↑ 93.05 ↑ 52.38 ↑ 56.02 ↑
GRU4Rec (base) 19.69 49.60 50.23 84.97 45.01 69.98
+ HID 25.13 ↑ 63.21 ↑ 52.23 ↑ 90.73 ↑ 48.89 ↑ 73.21 ↑

Key finding: Existing long-tail methods almost universally sacrifice accuracy for long-tail gains (seesaw effect), whereas HID simultaneously improves both accuracy and long-tail coverage across all 4 backbone × 3 dataset combinations. GRU4Rec+HID achieves a 27.6% improvement in Tmall HR@20 (19.69→25.13) and a 27.4% improvement in tCov@20 (49.60→63.21) over the base model.

Table 2: Ablation Study (STAMP + SRGNN, HR@20 / tCov@20)

Variant Tmall HR Tmall tCov Diginetica HR Diginetica tCov Retailrocket HR Retailrocket tCov
STAMP+HID 28.26 73.65 50.39 93.05 52.38 56.02
w/o HI 27.43 69.29 50.17 91.96 51.75 55.31
w/o FC 26.77 70.20 49.76 92.15 50.89 55.67
SRGNN+HID 28.38 66.40 52.09 96.02 53.45 55.75
w/o HI 27.48 61.00 51.96 92.94 53.10 54.01
w/o FC 27.36 62.92 51.16 93.56 52.80 55.11

Removing hybrid intents (HI) has a greater impact on diversity, while removing the flexible coefficient (FC) has a greater impact on accuracy. The effect of hybrid intents is particularly pronounced on Tmall (tCov: 73.65→69.29), as Tmall sessions are longer and exhibit more frequent intent drift, making precise target intent modeling more critical.

Highlights & Insights

  • Conceptual Innovation: The paper is the first to explicitly attribute the root cause of the seesaw effect to the indiscriminate emphasis on tail items introducing session-irrelevant noise, and offers a principled solution through intent modeling.
  • Solid Theoretical Grounding: Theorem 1 reduces \(O(Nd)\) variance minimization to equivalent \(O(d)\) distance minimization; Theorem 2 approximates the unified loss as a Triplet Loss with a margin term, with complete theoretical derivations.
  • Plug-and-Play Design: HID integrates seamlessly with any SBR backbone (both sequential and graph-based), and hybrid intent construction can be performed offline with no additional inference overhead.
  • Well-Validated Win-Win: HID achieves simultaneous improvements in accuracy and long-tail coverage across all combinations of 4 backbones × 3 datasets × 6 metrics, with \(p\)-values generally below 0.001.

Limitations & Future Work

  • Dependency on Item Attribute Information: Hybrid intent construction requires item category attributes as preliminary intent units, limiting applicability in scenarios lacking attribute metadata (though the appendix includes experiments with semantic clustering as an alternative).
  • Cluster Count Requires Tuning: The optimal number of clusters \(n\) varies by dataset (Tmall=4, Diginetica=3), and no adaptive selection strategy is provided.
  • Validation Limited to Tabular SBR Models: The framework has not been validated on modern Transformer-based SBR models such as SASRec or BERT4Rec.
  • Target Intent Relies on Ground Truth: During training, target intents are defined via the next item \(v_{l+1}^u\), which limits extensibility to semi-supervised or weakly supervised settings.
  • TailNet / CSBR / LAP-SR (re-ranking-based): These methods directly modify ranking results without discriminating against noise; HID addresses this at the root through intent modeling to distinguish session-relevant from session-irrelevant items.
  • LOAM / MelT / LLM-ESR (augmentation-based): Augmenting tail embeddings likewise fails to filter noise; HID's dual-constraint loss explicitly pushes noise intents away from the session while improving tail coverage.
  • Intent Modeling Works (ICL, MISAR, STP): Existing intent mining extracts intents only from sliding windows or local subgraphs within individual sessions, ignoring cross-session consistency; HID constructs hybrid intents from global attribute co-occurrence relationships, yielding greater robustness.
  • Contrastive Learning for Recommendation (CL4SRec et al.): These methods leverage positive-negative pairs for representation learning but are not tailored for the long-tail problem; HID's ICLoss is essentially a long-tail-oriented variant of contrastive learning.

Rating

  • Novelty: ⭐⭐⭐⭐ — The combination of hybrid intents and dual constraints is novel, and the attribution analysis of the seesaw effect is convincing.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive validation across 4 backbones × 3 datasets, with complete ablation and hyperparameter analyses and rigorous statistical testing.
  • Writing Quality: ⭐⭐⭐⭐ — Problem motivation is clearly articulated, theoretical derivations are complete, and figures are intuitive.
  • Value: ⭐⭐⭐⭐ — The proposed plug-and-play framework is highly practical and makes a clear methodological contribution to the long-tail recommendation literature.
  • Value: TBD