PKD: Preference-driven Knowledge Distillation for Few-shot Node Classification¶

Conference: NeurIPS 2025 arXiv: 2510.10116 Code: https://github.com/GEEX-Weixing/PKD Area: Graph Learning / Few-shot Learning Keywords: Few-shot node classification, LLM-GNN collaboration, knowledge distillation, RL-based teacher selection, text-attributed graphs

TL;DR¶

PKD is a framework that jointly leverages LLMs and multiple GNN teachers for few-shot node classification on text-attributed graphs. A GNN-preference node selector (GNS) uses KL divergence-based uncertainty to identify nodes requiring LLM annotation, while a node-preference GNN selector (NGS) employs RL to match each node with its optimal GNN teacher. PKD achieves consistent state-of-the-art performance across 9 datasets (e.g., Cornell 87% vs. baselines 59–82%).

Background & Motivation¶

Background: Few-shot node classification on text-attributed graphs (TAGs) requires combining LLMs' language understanding with GNNs' graph-structural modeling. Existing methods either use LLMs to generate labels for GNN training or align the two via adapter modules.

Limitations of Prior Work: (a) The embedding spaces of decoder-only LLMs and encoder-only GNNs are substantially misaligned, making direct alignment difficult; (b) the heterogeneous local topology across nodes makes a single GNN architecture suboptimal for all nodes; (c) LLM inference is expensive and should not be applied uniformly to all nodes.

Key Challenge: World knowledge from LLMs and structural awareness from GNNs are complementary yet computationally asymmetric — the key challenge is how to allocate them intelligently.

Goal: (a) Determine which nodes warrant LLM annotation; (b) identify the optimal GNN teacher for each node.

Key Insight: Bidirectional preference-driven collaboration — GNNs inform the LLM of nodes they are uncertain about, while the LLM guides each node's choice of message-passing mechanism.

Core Idea: GNN uncertainty → select nodes for LLM annotation + RL-tuned LLM → select the optimal GNN teacher per node = bidirectional preference-driven LLM-GNN collaboration.

Method¶

Overall Architecture¶

GNS module: \(B\) GNN teachers independently produce predictions → KL divergence-based consensus quantifies uncertainty \(\delta_K(v)\) → high-uncertainty nodes and their KNN neighbors are sent to the LLM for annotation. NGS module: A fine-tuned LLM serves as a PPO-based RL agent → state = node semantic/structural/prediction features → action = GNN teacher selection → reward = classification accuracy + distillation loss. KD: The selected teacher supervises the student GNN.

Key Designs¶

GNN-Preference Node Selector (GNS):
- Function: Select nodes requiring LLM annotation.
- Mechanism: \(\delta_K(v) = \sum_{i<j} [D_{KL}(f_{T_i}(v) \| f_{T_j}(v)) + D_{KL}(f_{T_j}(v) \| f_{T_i}(v))]\) — the sum of pairwise KL divergences among \(B\) GNN teachers quantifies prediction uncertainty. High uncertainty implies disagreement among teachers, signaling the need for LLM assistance. DNS: KNN retrieval expands the neighborhood context.
- Design Motivation: Costly LLM calls should be reserved for nodes that GNNs cannot handle reliably.
Node-Preference GNN Selector (NGS):
- Function: Select the optimal GNN teacher for each node.
- Mechanism: The LLM is fine-tuned as a PPO agent. State = node prompt (semantic + structural + prediction attributes). Action = teacher selection probability distribution \(\pi_T\). Reward \(R = \eta(\mathcal{L}_{DL}' - \mathcal{L}_{CE}) + (1-\eta) A_{cc}\) — jointly optimizing distillation loss improvement and classification accuracy.
- Design Motivation: Nodes with different topologies suit different message-passing patterns — dense regions benefit from multi-hop GCN, while sparse regions are better handled by attention-based GAT.
GTA Prompt Fine-tuning:
- Function: Enable the LLM to understand graph structure.
- Mechanism: The LLM is fine-tuned on four graph topology tasks (connectivity, degree, cycle detection, text generation) to acquire graph-structural awareness.
- Design Motivation: Vanilla LLMs lack graph comprehension; GTA prompts inject the necessary structural knowledge.

Loss & Training¶

\(\mathcal{L}_{KD} = \alpha \cdot \text{soft KD} + \beta \cdot \text{hard label CE} + \gamma \cdot \text{entropy reg}\)
PPO optimizes the RL agent.
Compatible with multiple LLM backbones including Llama-3.1, Qwen2.5, and Mixtral.

Key Experimental Results¶

Main Results (5 labeled nodes/class)¶

Dataset	PKD	Runner-up	Gain
Cornell	87.0%	82.0% (IceBerg)	+5.0%
Cora	91.14%	79.5% (PopT)	+11.6%
Washington	83.74%	79.76% (IceBerg)	+3.98%
Texas	86.31%	84.85% (FairGKD)	+1.46%

PKD achieves consistently best or second-best performance across all 9 datasets.

Ablation Study¶

Component	Effect of Removal
GTA prompts	Degraded graph comprehension
DNS module	Lower neighbor selection quality
K-uncertainty	Random node selection performs poorly
Full model	Best

Key Findings¶

The framework generalizes across different LLM backbones (Llama/Qwen/Mixtral), demonstrating backbone-agnostic applicability.
K-uncertainty-based selection outperforms random selection, validating the value of intelligent node allocation.
RL-based teacher selection outperforms fixed single-teacher assignment, confirming that topological diversity necessitates varied processing strategies.

Highlights & Insights¶

The bidirectional preference-driven design elegantly exploits the complementary strengths of LLMs and GNNs — GNNs identify their own uncertainty, while LLMs determine how to resolve it.
RL-based teacher selection is more adaptive than static assignment, accommodating diverse node topologies.

Limitations & Future Work¶

Training \(B\) GNN teachers alongside LLM fine-tuning incurs substantial computational cost.
The framework is limited to node-level classification and has not been extended to edge- or graph-level tasks.
RL training may exhibit instability.

vs. GraphLLM: GraphLLM replaces GNNs with LLMs entirely; PKD promotes bidirectional collaboration between the two.
vs. TAPE: TAPE uses LLMs to generate textual features for GNNs; PKD engages in deeper, bidirectional interaction.

Rating¶

Novelty: ⭐⭐⭐⭐ Bidirectional preference-driven design + RL-based teacher selection
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 9 datasets + 3 LLM backbones + ablation study
Writing Quality: ⭐⭐⭐⭐ Framework description is clear and well-structured
Value: ⭐⭐⭐⭐ A practical solution for few-shot graph learning
The two-level preference-driven design — LLM selects nodes, RL selects GNNs — reflects the key insight that different nodes suit different GNN architectures.
PKD surpasses prior state-of-the-art on few-shot node classification and maintains its advantage as the number of labeled samples increases.
The core contribution lies in the conceptual simplicity and empirical effectiveness of its design.
Experimental results comprehensively validate the central hypotheses.