Task-Aware Retrieval Augmentation for Dynamic Recommendation¶

Conference: AAAI 2026 arXiv: 2511.12495 Code: None Area: Time Series / Dynamic Recommendation Systems Keywords: Retrieval Augmentation, Dynamic Graph Recommendation, Task-Awareness, Graph Transformer, Temporal Generalization

TL;DR¶

This paper proposes TarDGR, a framework that automatically constructs training data via a task-aware evaluation mechanism, trains a Graph Transformer to assess the task relevance of historical subgraphs, and retrieves and integrates task-relevant subgraphs at inference time to enhance temporal generalization in recommendation.

Background & Motivation¶

Problem Setting¶

Dynamic recommendation systems provide personalized suggestions by modeling the temporal evolution of user–item interactions. Mainstream approaches adopt a pretrain-finetune paradigm: structural patterns are first learned from historical graph snapshots and then fine-tuned on new temporal graphs.

Limitations of Prior Work¶

Temporal distribution shift: Interaction graphs differ between the pretraining and fine-tuning phases; as user interests continuously evolve, previously learned patterns may become inapplicable, degrading generalization performance.

Existing RAG methods ignore task-semantic relevance: Prior graph retrieval-augmented methods (e.g., RAGRAPH) retrieve subgraphs based solely on structural and feature similarity, disregarding the semantic task relevance between retrieved subgraphs and the query graph. Structurally similar subgraphs may be harmful when semantically inconsistent with the target task.

Lack of automated task-aware data construction: The complexity of graph data makes manual annotation of task relevance practically infeasible, and existing frameworks lack mechanisms to evaluate whether a retrieved subgraph genuinely benefits a specific recommendation task.

Core Challenges¶

C1: How to effectively identify historical subgraphs that are truly beneficial for the recommendation task?
C2: How to enable the model to understand task-specific requirements without manual annotation?

Method¶

Overall Architecture¶

TarDGR consists of three core components: 1. Task-Aware Evaluation Mechanism: Automatically constructs task-aware training data. 2. Graph Transformer-based Task-Aware Model: Assesses the relevance of subgraphs to the current task. 3. Task-Aware Retrieval Inference: Retrieves and integrates task-relevant subgraphs.

Key Designs¶

1. Task-Aware Evaluation Mechanism¶

This mechanism automatically quantifies the contribution of a candidate historical subgraph \(G(v_r)\) to the current recommendation query \(G(v_q)\), without requiring manual annotation.

Core Idea: Measure the change in similarity between the query graph and the positive sample set before and after integrating the candidate subgraph.

Pre-fusion similarity: The average cosine similarity between the query subgraph embedding and the positive subgraph set is computed as:

\[\overline{\mathrm{Sim}}_{\text{before}} = \frac{1}{N^+} \sum_{i=1}^{N^+} \mathrm{Cos}(\mathrm{Enc}(G(v_q)), \mathrm{Enc}(G(v_q)_i^+))\]

Candidate subgraph fusion: A fused representation is obtained by constructing edges between central nodes and applying graph convolution:

\[\mathrm{Enc}(\widetilde{G(v_q)}) = f_{\text{fuse}}(\mathrm{Enc}(G(v_q) \oplus G(v_r)))\]

Post-fusion similarity: \(\overline{\mathrm{Sim}}_{\text{after}}\) is computed using the same procedure.

Task relevance score: \(\Delta \mathrm{Rel} = \overline{\mathrm{Sim}}_{\text{after}} - \overline{\mathrm{Sim}}_{\text{before}}\)

\(\Delta \mathrm{Rel} > 0\): the candidate subgraph benefits the task (positive sample)
\(\Delta \mathrm{Rel} \approx 0\): irrelevant
\(\Delta \mathrm{Rel} < 0\): harmful to the task (negative sample)

This yields the task-aware dataset: \(\mathcal{D}_{\text{aware}} = \{(G(v_q), G(v_r), C_r)\}\)

Design Motivation: Conventional methods retrieve nearest neighbors based solely on embedding distance, yet structurally similar subgraphs may be semantically irrelevant or even conflicting with the target recommendation task. By directly measuring whether fusion brings the query closer to positive samples, the mechanism precisely captures task-level utility.

2. Graph Transformer-based Task-Aware Model¶

This model jointly encodes the query subgraph and the candidate subgraph to assess their task relevance.

Subgraph Semantic Encoder: - Node embeddings are initialized using a pretrained dynamic GNN to encode historical temporal dependencies. - Query–candidate subgraph pairs are jointly encoded and concatenated into a semantic representation \(h\). - After adding positional encodings, multi-head self-attention captures fine-grained relational dependencies between the query and the candidate. - Output: semantic representation \(h_{\text{sem}}\).

Subgraph Structure Encoder: - Position-augmented embeddings are linearly projected and processed through multi-layer attention. - Subgraph structural patterns are aggregated via normalized adjacency propagation: \(h_{\text{str}} = \mathcal{D}^{-1}(\mathcal{A}_s + \mathbf{I}) h_{\text{ffn}} W\) - This encodes the topological structure of the graph.

Fusion and Scoring: Semantic and structural encodings are concatenated, and a parameterized scoring function outputs a scalar relevance score:

\[s_i = \mathcal{S}_\psi(h_{\text{task}}) = w^\top \mathrm{ReLU}(W h_{\text{task}} + b)\]

3. BiSCL Pretraining¶

A Bi-Level Supervised Correlation Loss is employed to jointly supervise numerical fidelity and ordinal consistency.

Numerical fitting loss:

\[\mathcal{L}_{\text{mtl}} = \frac{1}{N} \sum_{k=1}^{N} (\mathcal{R}_\theta(h_{q,r}, \mathcal{A}_s) - C)^2\]

Ordinal consistency loss (preserving the ranking order across samples):

\[\mathcal{L}_{\text{ocl}} = \log\left[1 + \sum_{k,l} \exp\left(\frac{\mathcal{R}_\theta(h_{q,r}^{(l)}) - \mathcal{R}_\theta(h_{q,r}^{(k)})}{\tau}\right)\right]\]

Total loss: \(\mathcal{L}_{\text{BiSCL}} = \rho \cdot \mathcal{L}_{\text{ocl}} + (1-\rho) \cdot \mathcal{L}_{\text{mtl}}\)

Inference and Training Strategy¶

Inference: 1. Retrieve Top-K candidate subgraphs via FAISS using the semantic encoder. 2. Evaluate the relevance score of each candidate using the task-aware model. 3. Select the Top-M most relevant subgraphs and aggregate them via soft evidence fusion: \(H_{\text{rag}} = \sum_{i=1}^{M} \alpha_i \cdot h_m^i\) 4. Residual fusion: \(\tilde{h}_q = \beta h_q + (1-\beta) H_{\text{rag}}\)

Training losses: - BPR ranking loss (\(\mathcal{L}_{\text{bpr}}\)) - Margin Ranking Loss (\(\mathcal{L}_{\text{mrl}}\)) - Regularization loss (\(\mathcal{L}_{\text{reg}}\)) - \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{bpr}} + \lambda \cdot \mathcal{L}_{\text{mrl}} + \mu \cdot \mathcal{L}_{\text{reg}}\)

Key Experimental Results¶

Main Results¶

Results on three dynamic graph recommendation datasets (Recall@20 / nDCG@20):

Method	TAOBAO Recall	TAOBAO nDCG	KOUBEI Recall	KOUBEI nDCG	AMAZON Recall	AMAZON nDCG
LightGCN	22.47	21.89	30.21	22.24	15.07	6.53
SimGCL	22.18	23.15	33.07	23.08	16.10	7.58
RAGRAPH/FT	24.78	24.35	34.27	24.82	18.69	9.09
TarDGR/FT	25.20	24.59	36.52	26.63	19.56	9.70

TarDGR achieves the best performance across all datasets. On Amazon, it improves nDCG by 16.6% and Recall by 14.5% over PRODIGY, and improves nDCG by 6.3% and Recall by 4.4% over RAGRAPH.

Ablation Study¶

Configuration	TAOBAO Recall	TAOBAO nDCG	AMAZON Recall	AMAZON nDCG	Description
w/o all	24.63	24.02	18.42	8.91	No task-aware retrieval
w/o SEM	24.95	24.48	19.10	9.45	Semantic encoder removed
w/o STR	—	—	—	—	Structure encoder removed
TarDGR	25.20	24.59	19.56	9.70	Full model

Removing any component leads to performance degradation, validating the complementarity of semantic and structural encoding.

Key Findings¶

Task-aware retrieval substantially outperforms conventional structure-similarity-based retrieval: TarDGR consistently surpasses RAGRAPH, demonstrating that semantic task relevance is more important than pure structural similarity.
Benefits persist even without fine-tuning: TarDGR/NF (non-fine-tuned variant) outperforms most fine-tuned baselines on KOUBEI and AMAZON.
BiSCL's bi-level supervision is effective: Jointly optimizing numerical fidelity and ordinal consistency outperforms single-objective training.
Expressive power of the Graph Transformer is critical: Joint semantic and structural encoding outperforms either branch used in isolation.

Highlights & Insights¶

Automated task-aware evaluation mechanism: By measuring the change in similarity to positive samples before and after fusion, the mechanism elegantly avoids costly manual annotation and automatically constructs high-quality task-aware training data.
Paradigm shift from "structural similarity" to "task utility": The paper explicitly proposes and empirically validates the important observation that structural similarity does not imply task utility.
Modular design: TarDGR can be integrated as a plug-and-play module into various dynamic graph recommendation frameworks.

Limitations & Future Work¶

Computational overhead: The task-aware evaluation mechanism requires fusion and evaluation for each candidate subgraph, resulting in high training data construction costs.
Validation limited to link prediction: Recommendation is inherently a link prediction task; applicability to other graph learning tasks remains unverified.
Dependence on positive sample definition: The task relevance score relies on the embedding quality of the pretrained GNN; if the pretrained model is biased, the evaluation mechanism may generate noisy labels.
Retrieval efficiency not discussed: In real-world deployment, real-time retrieval and evaluation over large-scale subgraph repositories may become a bottleneck.

RAG in NLP: Extending retrieval augmentation from language models to graph-based recommendation introduces the key challenge of defining "task relevance."
RAGRAPH: The preceding work retrieves subgraphs based on structural similarity; TarDGR demonstrates the necessity of incorporating task-awareness.
GraphPro: A pretrain-finetune baseline for dynamic recommendation; TarDGR achieves significant gains over it via retrieval augmentation.
Inspiration: The task-aware evaluation paradigm can be generalized to RAG systems in other domains.

Rating¶

Novelty: ⭐⭐⭐⭐ — Introducing task-awareness into graph RAG is pioneering; the automated evaluation mechanism is elegantly designed.
Experimental Thoroughness: ⭐⭐⭐⭐ — Three datasets, diverse baselines, and comprehensive ablation studies.
Writing Quality: ⭐⭐⭐⭐ — Problem formulation is clear, though the notation involves numerous subscripts and symbols.
Value: ⭐⭐⭐⭐ — Provides important reference value for both dynamic recommendation and graph RAG research.