Bid Farewell to Seesaw: Towards Accurate Long-tail Session-based Recommendation via Dual Constraints of Hybrid Intents¶
Conference: AAAI 2026 arXiv: 2511.08378 Authors: Xiao Wang, Ke Qin, Dongyang Zhang, Xiurui Xie, Shuang Liang Code: Not released Area: Recommender Systems Keywords: Session-based recommendation, long-tail distribution, hybrid intents, spectral clustering, contrastive learning
TL;DR¶
This paper proposes the HID framework, which constructs hybrid intents via attribute-aware spectral clustering to distinguish session-relevant from session-irrelevant tail items, and introduces a dual-constraint loss (ICLoss) targeting both long-tail coverage and recommendation accuracy. The framework achieves a "win-win" between long-tail promotion and accuracy, breaking the traditional seesaw dilemma where improving one metric inevitably harms the other.
Background & Motivation¶
State of the Field¶
Session-based Recommendation (SBR) aims to predict the next interacted item from an anonymous user's short-term interaction sequence. In practical recommendation scenarios, item exposure frequency follows a severe long-tail distribution — a small number of head items account for the vast majority of interactions, while a large number of tail items are rarely recommended. This imbalance causes recommendation systems to repeatedly surface head items, creating a vicious cycle that substantially reduces recommendation diversity.
Limitations of Prior Work¶
Existing long-tail SBR methods fall into two categories: (1) Augmentation-based methods (e.g., LOAM, MelT, LLM-ESR), which enhance tail item embeddings or emphasize tail items when generating session embeddings; and (2) Re-ranking-based methods (e.g., TailNet, CSBR, LAP-SR), which predict the head/tail item distribution from interaction sessions and directly adjust the final ranking results.
Both categories share two common deficiencies:
Indiscriminate emphasis on tail items introduces noise: Not all tail items are relevant to the current session. For example, in a session centered on "books," tail items from "clothing" categories are noise, despite being long-tail. Existing methods uniformly boost tail item exposure without distinguishing relevant from irrelevant tail items, leading to degraded recommendation accuracy.
Lack of explicit long-tail supervision signals: Most methods still rely on cross-entropy loss for indirect optimization, and augmentation or re-ranking strategies often conflict with the cross-entropy objective, resulting in a seesaw effect — improving long-tail performance inevitably harms accuracy, and vice versa.
On the Tmall dataset, applying long-tail methods such as TailNet to GRU4Rec improves tCov@20 but causes a notable drop in HR@20, clearly demonstrating this seesaw phenomenon.
Root Cause¶
Resolving this contradiction requires answering two questions: how to effectively identify noise, and how to provide explicit supervision signals for both long-tail coverage and accuracy simultaneously. HID's answer is: (1) capture high-level user preferences through hybrid intent modeling to partition items into session-target intents and noise intents; and (2) apply dual-constraint losses to enforce head-tail consistency within target intents (promoting long-tail) and to maximize the distance between the session and noise intents (preserving accuracy).
Method¶
Overall Architecture¶
HID is a model-agnostic plug-and-play framework that can be integrated with any SBR backbone (e.g., GRU4Rec, STAMP, SRGNN, GCE-GNN). It consists of two core components: a Hybrid Intent Learning module and an Intent Constraint Loss.
Module 1: Hybrid Intent Learning¶
Conventional intent mining methods consider only intra-session temporal relationships, making them susceptible to noise and neglecting cross-session intent consistency. HID proposes attribute-aware spectral clustering to construct global hybrid intents in three steps:
Step 1 — Preliminary Intent Units: Item attribute information (e.g., product category, music genre) is used as preliminary intent units. Items sharing the same attribute are grouped into a preliminary intent group \(c'_i\).
Step 2 — Preliminary Intent Graph Construction: Item IDs in each session are replaced with attribute IDs, and co-occurrence frequencies between attributes are aggregated across all sessions to construct a preliminary intent graph \(\mathcal{G}=(\mathcal{P}, \mathcal{E}, \mathcal{W})\), where nodes represent attributes and edge weights represent co-occurrence frequencies. For example, "food" and "cookware" frequently co-occur in shopping sessions, yielding a high edge weight between them.
Step 3 — Spectral Clustering to Generate Hybrid Intents: The normalized Laplacian matrix \(L = I - D^{-1/2}WD^{-1/2}\) is computed for the preliminary intent graph. The eigenvectors corresponding to the \(q\) smallest eigenvalues are extracted, and K-means clustering is applied to the rows of the eigenvector matrix to reassign attributes into \(n\) clusters. Attributes within the same cluster are merged to form a hybrid intent — for instance, "food" and "cookware" may be aggregated into a "cooking" intent. The embedding of each hybrid intent is obtained via mean pooling over all contained item embeddings: \(\mathbf{c}_i = \frac{1}{|c_i|}\sum_{v_j \in c_i} \mathbf{v}_j\).
Notably, the entire hybrid intent construction process can be precomputed offline; during training and inference, only a lookup table retrieval is needed, introducing no additional online overhead.
Definition of Target and Noise Intents¶
For session \(S^u\), given the next item \(v_{l+1}^u\) (the ground-truth label known during training): - Target intent \(\mathcal{C}^u\): the set of hybrid intents containing \(v_{l+1}^u\) - Noise intent \(\hat{\mathcal{C}}^u\): target intents from other sessions in the same mini-batch that are not in \(\mathcal{C}^u\)
Target and noise intents are used solely as supervision signals in ICLoss during training, posing no risk of data leakage.
Module 2: Intent Constraint Loss (ICLoss)¶
After obtaining hybrid intent embeddings, HID applies \(L_2\) normalization to both session and intent embeddings, projecting them onto a unit hypersphere to ensure a consistent metric space. Two constraints are then imposed:
Constraint 1: Constraint for Long-tail
Core Idea: Minimize the variance of distances from the session to all items within the target intent. Head item embeddings are typically closer to the session representation, while tail items are farther; constraining the variance forces their distances to converge, granting tail items recommendation probabilities comparable to head items.
Direct variance computation has complexity \(O(Nd)\). The paper proves via Theorem 1 that minimizing this variance is optimization-equivalent to minimizing the distance from the session embedding to the target intent centroid \(d(\mathbf{S}^u, \mathbf{c}^u)\), reducing complexity to \(O(d)\).
Constraint 2: Constraint for Accuracy
Core Idea: Maximize the average distance from the session to noise intents, while constraining the variance to avoid degenerate cases.
Unified Loss Function
Both constraints are unified into an InfoNCE-style loss. The paper proves via Theorem 2 that this loss is approximately equivalent to an \((N-1)\)-Triplet Loss with a fixed margin of 2. To accommodate different scenarios, a flexible temperature coefficient \(\sigma\) replaces the fixed margin, and the hard variance constraint is converted into a penalty term \(p^u\). The final ICLoss is:
where \(\mathbf{X} = \exp(\cos(\mathbf{S}^u, \mathbf{c}^u)/\sigma)\) and \(\mathbf{Y} = \sum_{c^v \in \hat{\mathcal{C}}^u}\exp(\cos(\mathbf{S}^u, \mathbf{c}^v)/\sigma)\).
The total training loss is: \(\mathcal{L} = \mathcal{L}_p + \epsilon \mathcal{L}_c\), where \(\mathcal{L}_p\) is the cross-entropy loss of the original SBR backbone.
Key Experimental Results¶
Table 1: Main Results — STAMP and GRU4Rec Backbones (HR@20 / tCov@20)¶
| Method | Tmall HR↑ | Tmall tCov↑ | Diginetica HR↑ | Diginetica tCov↑ | Retailrocket HR↑ | Retailrocket tCov↑ |
|---|---|---|---|---|---|---|
| STAMP (base) | 26.10 | 69.46 | 50.15 | 90.71 | 50.54 | 53.70 |
| + TailNet | 20.61 ↓ | 71.33 ↑ | 45.39 ↓ | 91.23 ↑ | 47.00 ↓ | 51.56 ↓ |
| + LOAM | 24.31 ↓ | 71.68 ↑ | 46.19 ↓ | 89.96 ↓ | 50.27 ↓ | 55.67 ↑ |
| + LAP-SR | 25.21 ↓ | 72.11 ↑ | 49.87 ↓ | 91.32 ↑ | 49.59 ↓ | 55.32 ↑ |
| + HID | 28.26 ↑ | 73.65 ↑ | 50.39 ↑ | 93.05 ↑ | 52.38 ↑ | 56.02 ↑ |
| GRU4Rec (base) | 19.69 | 49.60 | 50.23 | 84.97 | 45.01 | 69.98 |
| + HID | 25.13 ↑ | 63.21 ↑ | 52.23 ↑ | 90.73 ↑ | 48.89 ↑ | 73.21 ↑ |
Key finding: Existing long-tail methods almost universally sacrifice accuracy for long-tail gains (seesaw effect), whereas HID simultaneously improves both accuracy and long-tail coverage across all 4 backbone × 3 dataset combinations. GRU4Rec+HID achieves a 27.6% improvement in Tmall HR@20 (19.69→25.13) and a 27.4% improvement in tCov@20 (49.60→63.21) over the base model.
Table 2: Ablation Study (STAMP + SRGNN, HR@20 / tCov@20)¶
| Variant | Tmall HR | Tmall tCov | Diginetica HR | Diginetica tCov | Retailrocket HR | Retailrocket tCov |
|---|---|---|---|---|---|---|
| STAMP+HID | 28.26 | 73.65 | 50.39 | 93.05 | 52.38 | 56.02 |
| w/o HI | 27.43 | 69.29 | 50.17 | 91.96 | 51.75 | 55.31 |
| w/o FC | 26.77 | 70.20 | 49.76 | 92.15 | 50.89 | 55.67 |
| SRGNN+HID | 28.38 | 66.40 | 52.09 | 96.02 | 53.45 | 55.75 |
| w/o HI | 27.48 | 61.00 | 51.96 | 92.94 | 53.10 | 54.01 |
| w/o FC | 27.36 | 62.92 | 51.16 | 93.56 | 52.80 | 55.11 |
Removing hybrid intents (HI) has a greater impact on diversity, while removing the flexible coefficient (FC) has a greater impact on accuracy. The effect of hybrid intents is particularly pronounced on Tmall (tCov: 73.65→69.29), as Tmall sessions are longer and exhibit more frequent intent drift, making precise target intent modeling more critical.
Highlights & Insights¶
- Conceptual Innovation: The paper is the first to explicitly attribute the root cause of the seesaw effect to the indiscriminate emphasis on tail items introducing session-irrelevant noise, and offers a principled solution through intent modeling.
- Solid Theoretical Grounding: Theorem 1 reduces \(O(Nd)\) variance minimization to equivalent \(O(d)\) distance minimization; Theorem 2 approximates the unified loss as a Triplet Loss with a margin term, with complete theoretical derivations.
- Plug-and-Play Design: HID integrates seamlessly with any SBR backbone (both sequential and graph-based), and hybrid intent construction can be performed offline with no additional inference overhead.
- Well-Validated Win-Win: HID achieves simultaneous improvements in accuracy and long-tail coverage across all combinations of 4 backbones × 3 datasets × 6 metrics, with \(p\)-values generally below 0.001.
Limitations & Future Work¶
- Dependency on Item Attribute Information: Hybrid intent construction requires item category attributes as preliminary intent units, limiting applicability in scenarios lacking attribute metadata (though the appendix includes experiments with semantic clustering as an alternative).
- Cluster Count Requires Tuning: The optimal number of clusters \(n\) varies by dataset (Tmall=4, Diginetica=3), and no adaptive selection strategy is provided.
- Validation Limited to Tabular SBR Models: The framework has not been validated on modern Transformer-based SBR models such as SASRec or BERT4Rec.
- Target Intent Relies on Ground Truth: During training, target intents are defined via the next item \(v_{l+1}^u\), which limits extensibility to semi-supervised or weakly supervised settings.
Related Work & Insights¶
- TailNet / CSBR / LAP-SR (re-ranking-based): These methods directly modify ranking results without discriminating against noise; HID addresses this at the root through intent modeling to distinguish session-relevant from session-irrelevant items.
- LOAM / MelT / LLM-ESR (augmentation-based): Augmenting tail embeddings likewise fails to filter noise; HID's dual-constraint loss explicitly pushes noise intents away from the session while improving tail coverage.
- Intent Modeling Works (ICL, MISAR, STP): Existing intent mining extracts intents only from sliding windows or local subgraphs within individual sessions, ignoring cross-session consistency; HID constructs hybrid intents from global attribute co-occurrence relationships, yielding greater robustness.
- Contrastive Learning for Recommendation (CL4SRec et al.): These methods leverage positive-negative pairs for representation learning but are not tailored for the long-tail problem; HID's ICLoss is essentially a long-tail-oriented variant of contrastive learning.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of hybrid intents and dual constraints is novel, and the attribution analysis of the seesaw effect is convincing.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive validation across 4 backbones × 3 datasets, with complete ablation and hyperparameter analyses and rigorous statistical testing.
- Writing Quality: ⭐⭐⭐⭐ — Problem motivation is clearly articulated, theoretical derivations are complete, and figures are intuitive.
- Value: ⭐⭐⭐⭐ — The proposed plug-and-play framework is highly practical and makes a clear methodological contribution to the long-tail recommendation literature.
- Value: TBD