Skip to content

Kernelized Edge Attention: Addressing Semantic Attention Blurring in Temporal Graph Neural Networks

Conference: AAAI 2026 arXiv: 2602.00596 Code: N/A Area: Graph Learning Keywords: Temporal Graph Neural Networks, Edge Attention, Continuous-Time Kernel, Semantic Blurring, Link Prediction

TL;DR

This paper proposes KEAT (Kernelized Edge Attention for Temporal Graphs), which addresses the semantic attention blurring problem caused by the entanglement of node and edge representations in temporal graph neural networks. By modulating edge features with continuous-time kernels (Laplacian, RBF, and learnable MLP), KEAT achieves up to 18% MRR improvement over DyGFormer and 7% over TGN on link prediction tasks.

Background & Motivation

Background: Temporal Graph Neural Networks (TGNNs) aim to capture the temporally evolving structure of interactions in dynamic graphs. Representative methods such as TGN (Temporal Graph Network) and DyGFormer (Dynamic Graph Transformer) incorporate temporal information via time encodings or dedicated architectural designs.

Limitations of Prior Work: Existing TGNNs entangle node representations and edge features when computing attention, failing to distinguish their fundamentally different temporal behaviors: (1) node embeddings change slowly—they aggregate long-term structural context and reflect persistent node properties; (2) edge features change rapidly—each edge corresponds to a timestamped interaction (e.g., a message, transaction, or transfer) and is transient yet information-rich. This mismatch gives rise to semantic attention blurring, where attention weights cannot differentiate slowly drifting node states from rapidly changing edge interactions.

Key Challenge: Nodes and edges operate on different temporal scales, yet the attention computation conflates them—slowly evolving node embeddings dilute the fine-grained temporal dependency information carried by rapidly changing edge features.

Goal: To design an attention mechanism that preserves the distinct roles of nodes and edges, enabling more accurate, interpretable, and temporally aware correlation computation.

Key Insight: Rather than jointly processing nodes and edges within attention, the paper proposes independently modulating the temporal weights of edge features using continuous-time kernel functions, so that more recent edge interactions receive higher weights.

Core Idea: Replace standard mixed attention with kernelized edge attention that uses temporal convolution kernels to decouple the roles of nodes and edges.

Method

Overall Architecture

KEAT is an attention module that can be seamlessly integrated into both Transformer-style (DyGFormer) and message-passing-style (TGN) TGNNs. Given node features and timestamped edge features from a temporal graph, KEAT modulates edge features with a temporal kernel prior to attention computation, producing temporally aware node updates.

Key Designs

  1. Continuous-Time Kernel Function Family

  2. Function: Provides time-difference-based weight modulation for edge features.

  3. Mechanism: Three temporal kernels are designed: (a) Laplacian kernel \(K(t) = \exp(-|t|/\sigma)\), applying exponential decay over time differences; (b) RBF kernel \(K(t) = \exp(-t^2/(2\sigma^2))\), applying Gaussian decay; (c) learnable MLP kernel, which directly learns the mapping from time differences to weights via a small MLP. Each kernel acts on the difference between an edge's timestamp and the current time, assigning high weights to recent interactions and attenuating distant ones.
  4. Design Motivation: Fixed time encodings (e.g., sinusoidal encodings) cannot distinguish temporal spans of varying importance. Continuous-time kernels provide a natural "recency-first" decay pattern, and different kernel forms accommodate different temporal dependency structures.

  5. Kernelized Edge Attention Mechanism

  6. Function: Maintains a clear separation between the roles of nodes and edges during attention computation.

  7. Mechanism: Standard attention concatenates node and edge features before computing attention scores. KEAT instead first modulates edge features with the temporal kernel as \(e'_{ij} = K(\Delta t_{ij}) \cdot e_{ij}\), then explicitly distinguishes node query/key from kernelized edge values during attention. Attention scores are primarily determined by inter-node similarity, while the content of information aggregation is determined by the kernelized edge features.
  8. Design Motivation: The "who to attend to" aspect of attention should be governed by node relationships (structurally similar nodes attend to each other), while "what to aggregate" should be governed by edge features (recent interactions carry the most current information). Separating the two avoids semantic confusion.

  9. Architecture-Agnostic Plug-and-Play Design

  10. Function: Enables KEAT to be combined with diverse TGNN backbone architectures.

  11. Mechanism: KEAT adheres to the standard attention interface (Q, K, V) and can directly replace the self-attention layers in DyGFormer or the message aggregation layers in TGN. The only modification is that edge features are passed through the temporal kernel modulation before entering the attention computation.
  12. Design Motivation: Given the architectural diversity in the TGNN field, a strong attention improvement should be general rather than tailored to a specific architecture.

Loss & Training

Standard link prediction training objective (binary cross-entropy loss) is used to predict whether future edges exist. The parameters of the temporal kernels (e.g., \(\sigma\), MLP parameters) are trained jointly with the main network in an end-to-end fashion.

Key Experimental Results

Main Results

Backbone Dataset Metric (MRR) KEAT Original Gain
DyGFormer Multiple datasets MRR Best Baseline Up to +18%
TGN Multiple datasets MRR Best Baseline Up to +7%

Ablation Study

Kernel Type MRR Description
Laplacian Kernel Good Exponential decay; simple and effective
RBF Kernel Good Gaussian decay; more sensitive to recent interactions
MLP Kernel Most flexible Learns non-monotonic temporal weight patterns
No Temporal Kernel Baseline Original mixed attention

Key Findings

  • KEAT yields significant improvements on two architecturally distinct TGNNs, demonstrating that semantic attention blurring is a pervasive problem.
  • An 18% MRR gain represents a substantial improvement in the link prediction domain.
  • The learnable MLP kernel is the most flexible but not always the best—simple Laplacian/RBF kernels achieve comparable or superior performance on certain datasets.
  • The \(\sigma\) parameter of the temporal kernel reflects the interaction timescale of each dataset—high-frequency interaction datasets favor smaller \(\sigma\) (faster decay).

Highlights & Insights

  • The identification and analysis of "semantic attention blurring" clearly defines a fundamental problem that had previously been overlooked, offering a new improvement direction for future TGNN research.
  • The elegant design of temporal kernel modulation: a single-line modification (multiplying edge features by the temporal kernel) yields significant performance gains.
  • Architecture-agnosticism gives the method broad plug-and-play applicability.

Limitations & Future Work

  • Continuous-time kernels assume monotonic temporal decay (i.e., more recent interactions are always more important), whereas some scenarios may exhibit periodic temporal patterns.
  • Node-level temporal decay is not explored—KEAT performs temporal modulation only at the edge level.
  • The method could be combined with attention sparsification techniques to improve efficiency on large-scale graphs.
  • vs. DyGFormer: DyGFormer employs patch-based attention operating over temporal sequences but still mixes node and edge information. KEAT explicitly separates the two.
  • vs. TGN: TGN uses a temporal graph attention layer; KEAT's kernel modulation can directly enhance its message passing.
  • vs. Time Encoding Methods (e.g., Time2Vec): Time2Vec encodes time as fixed features, whereas KEAT's kernel modulation offers a more flexible form of temporal integration.

Rating

  • Novelty: ⭐⭐⭐⭐ The identification of semantic attention blurring and the kernelized edge attention solution are novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated across two architectures, multiple datasets, and multiple kernel variants.
  • Writing Quality: ⭐⭐⭐⭐ Problem motivation is analyzed in depth; method description is clear.
  • Value: ⭐⭐⭐⭐ Offers plug-and-play improvement value for the TGNN field.