Forest-Based Graph Learning for Semi-Supervised Node Classification¶
Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=5asbtzIVpS
Code: https://anonymous.4open.science/r/FGL/
Area: Graph Learning / Semi-Supervised Node Classification
Keywords: Spanning trees, Forest, Long-range info propagation, Homophily, Linear complexity, Graph Transformer
TL;DR¶
This work reinterprets message passing on graphs as "transmission across multiple spanning trees (forests)." By using homophily-guided sampling to select high-quality trees and a linear-time tree aggregator, the method achieves a global receptive field with \(O(n+m)\) complexity, outperforming both deep GNNs and Graph Transformers in semi-supervised node classification.
Background & Motivation¶
Background: To capture long-range knowledge, current graph neural networks follow two main paths: deep local models (e.g., GCNII) expand receptive fields by stacking many layers, while shallow global models (Graph Transformers like Graphormer/DIFFormer) cover all node pairs via 1-2 layers of global attention.
Limitations of Prior Work: Both paths are bottlenecked by complexity. Deep GNNs suffer from many layers, serial execution, and over-smoothing; Graph Transformers face quadratic complexity \(O(n^2)\), leading to OOM on large graphs. Attempted sparsification techniques (adaptive selection, graph rewiring) often sacrifice global coverage or rely on complex strategies that may lose vital node interactions.
Key Challenge: The authors abstract this into a formula: Total Cost = Single Structure Cost × Number of Structures. Local primitives (1-hop neighborhoods, short random walks) have cheap single-structure costs but require a massive number of structures to cover long distances; global operators use few structures but have explosive single-structure costs. Balancing these two factors is an inherent flaw in existing paradigms.
Goal: Find an intermediate hierarchical structure that suppresses both factors simultaneously to break the "cost-efficiency vs. global receptive field" deadlock.
Core Idea: [Spanning trees as the sparsest global structures] A spanning tree is the smallest subgraph connecting all nodes. When the number of structures is limited, it is the "simplest yet globally covering" structure. Since a single tree contains incomplete knowledge, a forest (multiple complementary spanning trees) is used. Graph learning is reformulated as "information transmission over forests," utilizing homophily-guided sampling to ensure tree quality and a linear-time tree aggregator for efficiency.
Method¶
Overall Architecture¶
The FGL (Forest-based Graph Learning) pipeline consists of four steps: Preprocessing to augment the graph for connectivity and homophily → Tree Sampler to draw \(N_T\) spanning trees from a distribution biased toward high homophily → Tree Aggregator to propagate global messages in linear time on each tree → Tree Fuser to merge multi-tree results with local information for final classification.
flowchart LR
G[Original Graph G] --> P[Preprocessing: <br/>Pseudo-labels + kNN Augmentation]
P --> Ghat[Augmented Graph Ĝ: <br/>Connected & More Homophilous]
Ghat --> HE[Homophily Estimator: <br/>Local Attention Edge Scoring]
HE --> S[Tree Sampler: <br/>Wilson's Algo for N_T Trees]
S --> A[Tree Aggregator: <br/>Two Recursions O(n)]
A --> F[Tree Fuser: <br/>RowNorm + Mean + Local Residual]
F --> H[Final Embedding H'': <br/>Linear Prediction]
Key Designs¶
1. Homophily-guided Tree Sampler: Biasing the forest toward "homophilous connections." For node classification, a "good" tree has a high homophily rate (edges connecting similar nodes) and diversity (avoiding excessive overlap). The tree score is defined as the product of edge scores, inducing a distribution \(P_{\hat G}(T)=\prod_{e\in T}s(e)\big/\sum_{T\subseteq\hat G}\prod_{e\in T}s(e)\). By assigning high scores to homophilous edges and low scores to heterophilous ones, the distribution biases toward high-quality trees. Edge scores are provided by a local attention homophily estimator: \(\alpha_{i\to j}=\mathrm{softmax}_j(Q_iK_j^\top/\sqrt c)\), trained with pseudo-labels \(Y'\). Edge scores are \(s(e)=(\alpha_{i\to j}+\alpha_{j\to i})/2\). Finally, \(N_T\) trees are independently sampled using a weighted Wilson’s algorithm at nearly \(O(n)\) cost per tree. Theoretically (Theorem 2), the authors prove that as the score ratio \(\Delta=p/q\) (homophilous/heterophilous score) increases, the expected homophily rate of the distribution rises monotonically toward an upper bound—a more accurate estimator induces a higher quality tree distribution.
2. Linear-time Tree Aggregator: Compressing quadratic interactions via dual recursions. The core observation is that for adjacent nodes \(u,v\) on a tree, global aggregated messages differ only in the direction of one edge. If the aggregator \(f_{\mathrm{Agg}}\) satisfies "composability/decomposability" properties—existing operators \(\mathcal M_+/\mathcal M_-\) such that merged results can be reversed (Property I/II)—then global propagation can be completed via two recursions (Theorem 1). Specifically, a bottom-up recursion computes subtree aggregations \(S_u=f_{\mathrm{Agg}}(\{S_v\}_{v\in \mathrm{Child}(u)}\cup\{g(H_u)\})\) (Recursion I), followed by a top-down recursion that removes self-contribution via \(\mathcal M_-\) and adds parent information via \(\mathcal M_+\) to obtain \(H'_v=\mathcal M_+(S_v,\mathcal M_-(H'_{\mathrm{Fa}(v)},S_v))\) (Recursion II). The implementation uses weighted sum/difference as linear variants:
This design achieves global propagation equivalent to quadratic node-pair interactions with linear runtime. It supports parallelism across trees and within single trees (by restructuring trees to be "short and wide").
3. Tree Fuser: Recovering local knowledge + stabilizing representations. Since single trees are sparse, local information is first calculated as \(H=(\beta_1\hat A_{\hat G}+\beta_2\alpha+(1-\beta_1-\beta_2)I)^{K_L}XW_H\) (low-order propagation with \(K_L\le2\)). After running the aggregator on \(N_T\) trees to get \(H'^{(k)}\), RowNorm is applied for numerical stability, and the results are averaged for global information \(H'=\mathrm{Mean}\{\mathrm{RowNorm}(H'^{(k)})\}\). Finally, a residual coefficient \(\gamma\) balances global and local components: \(H''=(1-\gamma)H'+\gamma H\). Training consumes only \(O((n+m)Kd)\) time and space per epoch.
Key Experimental Results¶
Main Results¶
On 9 datasets (Homophilous: Cora/Citeseer/Pubmed/ArXiv; Heterophilous: Actor/Cornell/Texas/Wisconsin/Flickr), FGL achieved an average rank of 1.22 against 26 baselines:
| Method | Category | Cora | Citeseer | Pubmed | Actor | Cornell | Texas | Wisconsin | ArXiv | Flickr | Avg.Rank |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GCNII | DeepGNN | 85.34 | 73.24 | 79.88 | 34.64 | 74.61 | 69.19 | 70.31 | 51.91 | 41.79 | 8.78 |
| SGFormer | GT | 82.38 | 71.82 | 80.64 | 37.80 | 68.65 | 78.92 | 80.00 | 45.73 | 40.13 | 7.22 |
| DIFFormer | GT | 83.32 | 74.46 | 78.16 | 34.51 | 60.00 | 68.11 | 63.92 | 53.60 | 44.25 | 10.56 |
| GraphMamba | Mamba | 54.36 | 58.98 | 70.90 | 36.05 | 74.05 | 77.29 | 80.39 | 33.59 | 42.30 | 13.89 |
| Ours | Forest | 85.46 | 74.42 | 81.00 | 39.88 | 83.24 | 91.89 | 86.27 | 56.47 | 47.22 | 1.22 |
Gain: Average accuracy improved by 16.2%, 16.1%, and 11.9% compared to GT, DIFFormer, and GCNII, respectively. On Wisconsin, the lead over these baselines reached 20-35%.
Ablation Study¶
| No. | Variant | Cora | Pubmed | Actor | Texas | Wisconsin | Flickr |
|---|---|---|---|---|---|---|---|
| (1) | w.o. Global Sub-module | 80.00 | 76.13 | 34.73 | 82.88 | 83.92 | 39.63 |
| (2) | w.o. Local Sub-module | 82.18 | 77.48 | 35.08 | 69.93 | 75.49 | 32.17 |
| (3) | Uniform Tree Sampling | 83.63 | 78.45 | 36.13 | 82.58 | 84.80 | 42.77 |
| (4) | Single Homophily-guided Tree | 83.73 | 78.55 | 36.32 | 84.83 | 85.29 | 42.96 |
| (5) | FGL-Full | 85.46 | 81.00 | 39.88 | 91.89 | 86.27 | 47.22 |
Findings: (4) vs. (3) shows homophily guidance is critical; (5) vs. (4) shows forests capture complementary topology better than single trees.
Key Findings¶
- Efficiency (sec/epoch): On ArXiv, Ours takes 0.246s vs. ANS-GT's 24.5s; it is 2-5x faster than DIFFormer/GCNII. Several GTs suffered OOM on large graphs.
- Tree Count: 6–10 trees is the optimal range across datasets. A small number of trees captures the intrinsic graph structure, while more trees bring diminishing returns.
- Estimator Accuracy: Accuracy increases monotonically with the homophily edge score \(p\), validating Theorem 2.
Highlights & Insights¶
- Paradigm Abstraction: The "Total Cost" formula simplifies the GNN vs. Transformer trade-off, identifying spanning trees as the optimal "low structure count, low structure cost" solution.
- Theoretical-Empirical Loop: Theorem 2 (accuracy vs. distribution quality) and Theorem 1 (linear propagation) create a rare "provable, runnable, and verifiable" trinity.
- Aggregator Versatility: The tree aggregator can swap in linear attention, linear RNNs, or SSMs, providing high extensibility.
- Heterophily Strength: Exceptional performance on heterophilous graphs suggests that homophily-guided sampling effectively identifies useful structural paths in "difficult" graphs.
Limitations & Future Work¶
- Dependence on Pseudo-labels: If pseudo-labels are extremely noisy or sparse, the homophily estimator may degrade, draggin down the tree quality.
- Hyperparameter Sensitivity: \(N_T\), \(\beta\), \(K_L\), \(\gamma\), and \(k\) require tuning; cross-dataset robustness requires more validation.
- Task Scope: Currently limited to semi-supervised node classification; extensions to graph classification and link prediction are needed.
Related Work & Insights¶
- Deep Local Models: Receptive fields expand by stacking layers, but serial bottlenecks remain.
- Shallow Global Models (GTs): Direct global modeling is \(O(n^2)\); sparsification often hurts inductive bias.
- Generative/Random Walk Methods: Using spanning trees as carriers for long-range propagation leverages classical graph theory (Wilson's algorithm) to innovate GNN architectures.
- Insight: When a field faces a "Performance vs. Efficiency" trade-off, rather than approximating at the extremes, it is effective to find an intermediate structure—like the "forest"—that minimizes both cost factors by design.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ A new paradigm; homophily-guided forest transmission is a distinct departure from incremental improvements.
- Experimental Thoroughness: ⭐⭐⭐⭐ Deep baseline comparison across 9 datasets; lacks only extremely large-scale (billion-edge) graph testing.
- Writing Quality: ⭐⭐⭐⭐ Motivation via the "Cost Formula" is highly compelling; clear theoretical-to-empirical flow.
- Value: ⭐⭐⭐⭐⭐ Significant improvements in performance and efficiency suggests this could become a primary methodology for long-range graph learning.