Hard vs. Noise: Resolving Hard-Noisy Sample Confusion in Recommender Systems via Large Language Models¶

Conference: AAAI2026 arXiv: 2511.07295
Code: GitHub
Area: Image Restoration Keywords: recommender systems, denoising, hard sample, LLM, contrastive learning

TL;DR¶

This paper proposes the LLMHNI framework, which leverages two types of auxiliary signals generated by LLMs—semantic relevance and logical relevance—to resolve the confusion between hard samples and noisy samples in recommender systems, significantly improving denoising recommendation performance.

Background & Motivation¶

State of the Field¶

Background: Recommender systems typically rely on implicit feedback (clicks, purchases, etc.) for training, treating interacted items as positive samples and non-interacted items as negative samples. However, this labeling scheme introduces two types of noise:

False positive noise: Spurious positive feedback caused by accidental clicks or positional bias.
False negative noise: Items of genuine interest to users that are incorrectly labeled as negative due to lack of exposure or similar reasons.

Existing denoising methods (sample dropping, sample reweighting) rely on numerical patterns such as loss values, prediction scores, and gradients to distinguish noisy samples from clean ones. However, the authors identify a critical issue: hard samples and noisy samples exhibit highly similar distributional patterns in terms of loss values and prediction scores, making it difficult for denoising methods to effectively differentiate between them. This confusion is particularly harmful because hard samples are essential for modeling user preferences, and mistakenly discarding them as noise severely degrades recommendation quality.

Root Cause¶

Goal: Numerical patterns derived solely from interaction data are insufficient to distinguish hard samples from noisy samples (hard-noisy confusion), necessitating auxiliary information beyond collaborative filtering signals. This paper leverages LLMs to provide two complementary relevance signals:

Semantic Relevance: Similarity computed from user/item text embeddings encoded by LLMs.
Logical Relevance: Logical associations between users and items inferred via LLM reasoning.

Two additional challenges must be addressed: the objective mismatch of LLM embeddings (optimized for language tasks rather than recommendation tasks) and unreliable interaction inferences caused by LLM hallucinations.

Method¶

LLMHNI consists of two core modules:

Module 1: Semantic Relevance Guided Hard Negative Mining¶

(1) Goal-Aligned Embedding Generation: An LLM encoder (text-embedding-ada-002) is used to encode textual profiles of users and items, which are then projected into a low-dimensional recommendation representation space via an MLP. During MLP training, pseudo-labels are constructed: items that simultaneously exhibit high textual embedding similarity and actual interaction records serve as reliable positive samples. An InfoNCE contrastive loss \(\mathcal{L}_{al}\) is used to train the projector, aligning the embeddings to better model relevance in recommendation scenarios.

(2) Semantically Guided Negative Sampling: A dynamic hard negative pool \(\mathbf{HN}_u^-\) is maintained and updated each round based on prediction scores from the recommendation model. Negative samples with the lowest semantic similarity are then selected as hard negatives for training. The core idea is that high-scoring negative samples may be either hard negatives or false negatives; by selecting those with low semantic similarity, false negatives (which tend to have high semantic similarity) are filtered out, retaining genuine hard negatives.

Module 2: Logical Relevance Guided Interaction Denoising¶

(1) Logical Relevance Inference: Candidate interaction pairs (high-scoring negatives and low-scoring positives) are first selected using a pre-trained recommendation model. An LLM is then queried from two perspectives to score these pairs: - User perspective: Logical relevance between the user and the target item is assessed based on the user profile. - Item perspective: Relevance is assessed based on the characteristics of items with which the user has had high-scoring interactions.

Scores are assigned at three levels: High, Mid, and Low. Only pairs rated High from both perspectives are identified as hard samples \(\mathcal{C}_H\); the remainder are classified as noisy samples \(\mathcal{C}_N\).

(2) Cross-Graph Contrastive Alignment: An augmented interaction graph \(G'\) is constructed by removing noisy edges from the original graph \(G\) and adding hard sample edges. Both graphs are used during training. A cross-graph contrastive loss \(\mathcal{L}_{de}\) aligns user/item representations across the two graphs: interactions that are consistent across both graphs are reinforced, while inconsistent ones are suppressed.

(3) Hallucination-Robust Contrastive Learning: Random edge dropout is applied to both \(G'\) and \(G\) to generate augmented views. A graph contrastive learning loss \(\mathcal{L}_{hal}\) aligns the representations of the two views. Random edge dropout probabilistically masks unreliable edges introduced by LLM hallucinations, rendering the model robust to hallucination noise.

Joint Optimization Objective: \(\mathcal{L}_{total} = \mathcal{L}_{rec} + \lambda_1 \mathcal{L}_{de} + \lambda_2 \mathcal{L}_{hal}\)

Key Experimental Results¶

Datasets: Amazon-Books, Yelp, Steam
Backbone Models: NGCF, LightGCN
Baselines: Instance-level (WBPR, T-CE, BOD), representation-level (SGL, SimGCL, XSimGCL), LLM-enhanced (RLMRec, LLaRD)

Main Results (LightGCN backbone):

Main Results¶

Dataset	Metric	LLMHNI	LLaRD (2nd best)	Gain
Amazon-Books	R@20	0.2040	0.2028	+0.6%
Amazon-Books	N@10	0.1168	0.1126	+3.7%
Yelp	N@10	0.0837	0.0809	+3.5%
Steam	N@10	0.0893	0.0868	+2.9%

Average improvement of 45.31% over the vanilla LightGCN backbone and 46.55% over vanilla NGCF.
Improvement of 11.78%–37.73% over traditional denoising methods (T-CE, BOD, etc.).
Improvement of 2.47%–33.86% over LLM-enhanced baselines (RLMRec, LLaRD).

Noise Robustness: Under noise injection experiments ranging from 5% to 20%, LLMHNI exhibits the most stable performance degradation rate and consistently outperforms all baselines.

Ablation Study¶

Removing each module from the full model (LightGCN / Amazon-Books) leads to a decrease in R@20: full model 0.2040 → w/o semantic negative sampling 0.1799 → w/o goal alignment 0.1848 → w/o interaction denoising 0.1772 → w/o hallucination robustness 0.1854.

Highlights & Insights¶

Precise Problem Identification: This work is the first to explicitly identify and systematically analyze the hard-noisy confusion problem in recommender system denoising, filling a recognized gap in the field.
Complementary Dual-Signal Design: Semantic relevance operates during the negative sampling stage (continuous values), while logical relevance operates during interaction graph denoising (discrete judgments); the two work synergistically at different granularities.
Engineering Practicality: LLM inference is performed offline prior to training, introducing no additional overhead to the online training of the recommendation model; the framework is compatible with different GNN backbones.
Principled Hallucination Mitigation: Handling LLM hallucinations via random edge dropout combined with contrastive learning is more elegant than simple filtering strategies.

Limitations & Future Work¶

High LLM Cost: Calling GPT-4o to score candidate interaction pairs one by one incurs non-negligible API costs and latency at large scale.
Oversimplified Classification Criterion: The binary hard/noise classification rule (requiring both dimensions to be rated High to qualify as a hard sample) may be overly conservative, with insufficient handling of intermediate cases.
Validation Limited to GNN Backbones: Experiments are conducted only on NGCF and LightGCN; generalizability to sequential recommendation models (e.g., SASRec) or non-graph models is not verified.
Dependence on a Specific LLM Embedding Model: The framework relies on OpenAI's text-embedding-ada-002; whether performance remains stable when substituting open-source LLMs is unknown.
Candidate Pair Selection Depends on Pre-trained Model Quality: The candidate set for logical relevance inference depends on the quality of the pre-trained recommendation model; a poor pre-trained model may propagate errors downstream.

Method Category	Representative Methods	Core Difference
Instance-level denoising	T-CE, BOD	Relies on loss/prediction score patterns to identify noise; cannot handle hard-noisy confusion
Representation-level denoising	SGL, SimGCL	Improves robustness via data augmentation but does not explicitly distinguish hard from noisy samples
LLM-enhanced recommendation	RLMRec	Leverages LLM embeddings to enhance representations but lacks hard sample identification capability
LLM-enhanced denoising	LLaRD	Uses LLMs to assist denoising but does not fully exploit dual semantic-logical signals
LLMHNI (Ours)	—	Jointly leverages semantic and logical relevance to resolve hard-noisy confusion, with a built-in hallucination mitigation mechanism

LLMs in recommender systems can serve not only as representation enhancers but also as "judges" to assess the reliability of interactions; this paradigm is generalizable to other scenarios requiring noise discrimination.
The goal alignment strategy (aligning LLM embeddings to a task-specific space) has broad applicability and can be adopted in any downstream task that employs LLM embeddings.
The cross-graph contrastive alignment paradigm—using graphs constructed from different signal sources to mutually supervise each other—is transferable to knowledge graph completion, social network analysis, and related fields.

Rating¶

Novelty: 8/10 — First systematic treatment of the hard-noisy confusion problem; the dual-signal design is innovative.
Experimental Thoroughness: 8/10 — Three datasets, two backbones, complete ablations, and noise robustness validation are provided, though non-GNN backbone verification is absent.
Writing Quality: 7/10 — Motivation is clearly articulated, but the notation-heavy formalism and some verbose passages reduce readability.
Value: 7/10 — Addresses a practical problem, though LLM costs limit the practicality of large-scale deployment.