Adaptive Initial Residual Connections for GNNs with Theoretical Guarantees¶
Conference: AAAI 2026 arXiv: 2511.06598 Code: Adaptive-IRC Area: Graph Neural Networks Keywords: Adaptive residual connections, over-smoothing, Dirichlet energy, PageRank, heterophilic graphs, GNN depth
TL;DR¶
This paper proposes Adaptive Initial Residual Connections (Adaptive IRC), which allows each node to have a personalized residual strength learned from its initial features. It provides the first theoretical proof of a positive lower bound on the Dirichlet energy of initial residual connections with activation functions (guaranteeing the absence of over-smoothing), and introduces a PageRank-based heuristic variant that achieves comparable or superior performance without learning additional parameters.
Background & Motivation¶
Background: The core of GNNs is message passing—nodes update their embeddings by aggregating neighborhood information. However, deep GNNs suffer from over-smoothing: repeated neighborhood averaging causes all node embeddings to converge to an indistinguishable state.
Limitations of Prior Work: (1) Static IRC methods (e.g., GCNII) use a shared, fixed residual strength and cannot differentiate across nodes; (2) existing theoretical guarantees are restricted to the linear (activation-free) setting; (3) existing adaptive residual methods are complex and lack theoretical guarantees.
Key Challenge: A mechanism is needed that is both theoretically grounded and capable of adaptively modulating residual strength.
Key Insight: Inspired by the Friedkin–Johnsen opinion dynamics model, in which individuals differ in their susceptibility to external information.
Core Idea: Node-level personalized residual strength + theoretical guarantee via a positive lower bound on Dirichlet energy + a PageRank-based heuristic requiring zero additional parameters.
Method¶
Overall Architecture¶
The Adaptive IRC message-passing update is: $\(H^{(\ell+1)} = \sigma\!\left(\Lambda \mathcal{A} H^{(\ell)} W^{(\ell)} + (I - \Lambda) H^{(0)} \Theta^{(\ell)}\right),\)$ where \(\Lambda = \text{diag}(\lambda_1, \dots, \lambda_n)\) denotes node-level residual strengths.
Key Designs¶
- Parameterization of Adaptive Residual Strength
- Function: Generates personalized residual weights for each node.
- Mechanism: \(\Lambda = \text{diag}(\sigma(H^{(0)} W_{\text{att}}))\), where sigmoid ensures outputs lie in \((0,1)\).
-
Design Motivation: Weights derived from initial features generalize to unseen nodes; sharing weights across layers reduces parameter count.
-
Proof of a Positive Lower Bound on Dirichlet Energy (Theorem 2)
- Core Result: \(\mathcal{E}(H^{(\ell+1)}) \geq \dfrac{\zeta \bar{\sigma}_r(\Theta)}{1 - \eta \bar{\sigma}_r} \mathcal{E}(H^{(0)}) > 0\)
- Key Quantities: \(\eta = \alpha^2 \lambda_{\min}^2 \sigma_r^2(\mathcal{A})\), \(\zeta = \alpha^2 (1 - \lambda_{\max})^2\)
- Proof Outline: Lemma 1 (energy lower bound for weight matrices) + Lemma 2 (energy lower bound for the adjacency operation) → Corollary 1 → energy preservation under Leaky ReLU → recursive unrolling to obtain the convergence lower bound.
-
Significance: First theoretical guarantee of over-smoothing mitigation for nonlinear IRC.
-
PageRank-Based Heuristic Variant
- Function: Replaces learned residual strengths with PageRank scores.
- Mechanism: The top-\(k\)% of nodes by PageRank are assigned \(\lambda_{\max}\); the remainder are assigned \(\lambda_{\min}\).
- Design Motivation: Node centrality is positively correlated with optimal residual strength.
- Advantage: Requires no additional learned parameters while achieving comparable or superior performance.
Complexity¶
\(O(|E|d + nd^2)\) per layer, identical to vanilla GCN.
Key Experimental Results¶
Node Classification — Comparison with SOTA¶
| Method | Cora (H:0.83) | Texas (H:0.11) | Wisconsin (H:0.21) | Chameleon (H:0.23) | Squirrel (H:0.22) |
|---|---|---|---|---|---|
| GCN | 79.2±0.4 | 55.9±6.4 | 47.1±8.5 | 33.4±2.2 | 27.2±0.7 |
| GCNII | 79.9±0.5 | 59.5±5.3 | 60.4±7.4 | 36.2±2.7 | 28.8±1.0 |
| DirGNN | 77.5±1.2 | 84.6±6.1 | 82.2±2.3 | 60.6±2.2 | 45.3±1.5 |
| IRC (Learned) | 80.1±1.0 | 73.0±5.8 | 82.4±4.7 | 64.1±1.1 | 47.7±2.2 |
| IRC (PageRank) | 80.7±0.4 | 77.0±6.8 | 79.0±3.3 | 65.0±2.0 | 49.0±2.2 |
Improvements on Heterophilic Graphs (vs. GCNII)¶
| Dataset | Texas | Wisconsin | Cornell | Chameleon | Squirrel |
|---|---|---|---|---|---|
| Gain | +17.5% | +18.6% | +25.4% | +28.8% | +20.2% |
Over-Smoothing Mitigation¶
- The Dirichlet energy of GCN/GAT/GraphSAGE decays exponentially with depth, approaching zero.
- Adaptive IRC maintains positive energy, remaining stable at 16 layers.
- Both variants sustain consistently high performance as depth increases beyond 6 layers.
Key Findings¶
- The PageRank variant performs comparably to or better than the learned variant, indicating that the heuristic is sufficient.
- Improvements are largest on heterophilic graphs (+17–29%), as adaptive residuals can distinguish between similar and dissimilar neighbors.
- The proposed method outperforms all baselines on all datasets except Actor.
Highlights & Insights¶
- Theorem 2 is rigorous: the first proof of a positive lower bound on Dirichlet energy for nonlinear IRC.
- The PageRank variant is surprisingly effective: centrality-based heuristics provide adaptive capability almost for free.
- Opinion dynamics analogy: GNN message passing ↔ opinion propagation in social networks.
- Rank preservation (Theorem 1): in the simplified setting, the rank of the embedding matrix is fully preserved.
Limitations & Future Work¶
- The PageRank threshold and \(\lambda\) values still require tuning.
- Theorem 2 relies on a cross-layer positive alignment assumption (Property 2).
- Only node classification is evaluated; graph classification remains untested.
- Comparison with more recent GNN methods (e.g., GREAD) is absent.
- Leaky ReLU is a necessary condition, restricting the choice of activation function.
Related Work & Insights¶
- The Friedkin–Johnsen model demonstrates that sociological models can provide useful inductive biases for GNN architecture design.
- The success of the PageRank variant suggests that graph-topological priors are underutilized.
- Dirichlet energy analysis is broadly applicable for evaluating any novel message-passing mechanism.
Rating¶
⭐⭐⭐⭐
- Novelty ⭐⭐⭐⭐: The theoretical proof and the PageRank variant represent significant contributions.
- Experimental Thoroughness ⭐⭐⭐⭐: 9 datasets, in-depth analysis, and energy visualization.
- Writing Quality ⭐⭐⭐⭐: Theoretical derivations are clearly presented.
- Value ⭐⭐⭐⭐: Provides a theoretically grounded, lightweight solution for deep GNNs.