Skip to content

PKD: Preference-driven Knowledge Distillation for Few-shot Node Classification

Conference: NeurIPS 2025 arXiv: 2510.10116 Code: https://github.com/GEEX-Weixing/PKD Area: Graph Learning / Few-shot Learning Keywords: Few-shot node classification, LLM-GNN collaboration, knowledge distillation, RL-based teacher selection, text-attributed graphs

TL;DR

PKD is a framework that jointly leverages LLMs and multiple GNN teachers for few-shot node classification on text-attributed graphs. A GNN-preference node selector (GNS) uses KL divergence-based uncertainty to identify nodes requiring LLM annotation, while a node-preference GNN selector (NGS) employs RL to match each node with its optimal GNN teacher. PKD achieves consistent state-of-the-art performance across 9 datasets (e.g., Cornell 87% vs. baselines 59–82%).

Background & Motivation

Background: Few-shot node classification on text-attributed graphs (TAGs) requires combining LLMs' language understanding with GNNs' graph-structural modeling. Existing methods either use LLMs to generate labels for GNN training or align the two via adapter modules.

Limitations of Prior Work: (a) The embedding spaces of decoder-only LLMs and encoder-only GNNs are substantially misaligned, making direct alignment difficult; (b) the heterogeneous local topology across nodes makes a single GNN architecture suboptimal for all nodes; (c) LLM inference is expensive and should not be applied uniformly to all nodes.

Key Challenge: World knowledge from LLMs and structural awareness from GNNs are complementary yet computationally asymmetric — the key challenge is how to allocate them intelligently.

Goal: (a) Determine which nodes warrant LLM annotation; (b) identify the optimal GNN teacher for each node.

Key Insight: Bidirectional preference-driven collaboration — GNNs inform the LLM of nodes they are uncertain about, while the LLM guides each node's choice of message-passing mechanism.

Core Idea: GNN uncertainty → select nodes for LLM annotation + RL-tuned LLM → select the optimal GNN teacher per node = bidirectional preference-driven LLM-GNN collaboration.

Method

Overall Architecture

GNS module: \(B\) GNN teachers independently produce predictions → KL divergence-based consensus quantifies uncertainty \(\delta_K(v)\) → high-uncertainty nodes and their KNN neighbors are sent to the LLM for annotation. NGS module: A fine-tuned LLM serves as a PPO-based RL agent → state = node semantic/structural/prediction features → action = GNN teacher selection → reward = classification accuracy + distillation loss. KD: The selected teacher supervises the student GNN.

Key Designs

  1. GNN-Preference Node Selector (GNS):

    • Function: Select nodes requiring LLM annotation.
    • Mechanism: \(\delta_K(v) = \sum_{i<j} [D_{KL}(f_{T_i}(v) \| f_{T_j}(v)) + D_{KL}(f_{T_j}(v) \| f_{T_i}(v))]\) — the sum of pairwise KL divergences among \(B\) GNN teachers quantifies prediction uncertainty. High uncertainty implies disagreement among teachers, signaling the need for LLM assistance. DNS: KNN retrieval expands the neighborhood context.
    • Design Motivation: Costly LLM calls should be reserved for nodes that GNNs cannot handle reliably.
  2. Node-Preference GNN Selector (NGS):

    • Function: Select the optimal GNN teacher for each node.
    • Mechanism: The LLM is fine-tuned as a PPO agent. State = node prompt (semantic + structural + prediction attributes). Action = teacher selection probability distribution \(\pi_T\). Reward \(R = \eta(\mathcal{L}_{DL}' - \mathcal{L}_{CE}) + (1-\eta) A_{cc}\) — jointly optimizing distillation loss improvement and classification accuracy.
    • Design Motivation: Nodes with different topologies suit different message-passing patterns — dense regions benefit from multi-hop GCN, while sparse regions are better handled by attention-based GAT.
  3. GTA Prompt Fine-tuning:

    • Function: Enable the LLM to understand graph structure.
    • Mechanism: The LLM is fine-tuned on four graph topology tasks (connectivity, degree, cycle detection, text generation) to acquire graph-structural awareness.
    • Design Motivation: Vanilla LLMs lack graph comprehension; GTA prompts inject the necessary structural knowledge.

Loss & Training

  • \(\mathcal{L}_{KD} = \alpha \cdot \text{soft KD} + \beta \cdot \text{hard label CE} + \gamma \cdot \text{entropy reg}\)
  • PPO optimizes the RL agent.
  • Compatible with multiple LLM backbones including Llama-3.1, Qwen2.5, and Mixtral.

Key Experimental Results

Main Results (5 labeled nodes/class)

Dataset PKD Runner-up Gain
Cornell 87.0% 82.0% (IceBerg) +5.0%
Cora 91.14% 79.5% (PopT) +11.6%
Washington 83.74% 79.76% (IceBerg) +3.98%
Texas 86.31% 84.85% (FairGKD) +1.46%

PKD achieves consistently best or second-best performance across all 9 datasets.

Ablation Study

Component Effect of Removal
GTA prompts Degraded graph comprehension
DNS module Lower neighbor selection quality
K-uncertainty Random node selection performs poorly
Full model Best

Key Findings

  • The framework generalizes across different LLM backbones (Llama/Qwen/Mixtral), demonstrating backbone-agnostic applicability.
  • K-uncertainty-based selection outperforms random selection, validating the value of intelligent node allocation.
  • RL-based teacher selection outperforms fixed single-teacher assignment, confirming that topological diversity necessitates varied processing strategies.

Highlights & Insights

  • The bidirectional preference-driven design elegantly exploits the complementary strengths of LLMs and GNNs — GNNs identify their own uncertainty, while LLMs determine how to resolve it.
  • RL-based teacher selection is more adaptive than static assignment, accommodating diverse node topologies.

Limitations & Future Work

  • Training \(B\) GNN teachers alongside LLM fine-tuning incurs substantial computational cost.
  • The framework is limited to node-level classification and has not been extended to edge- or graph-level tasks.
  • RL training may exhibit instability.
  • vs. GraphLLM: GraphLLM replaces GNNs with LLMs entirely; PKD promotes bidirectional collaboration between the two.
  • vs. TAPE: TAPE uses LLMs to generate textual features for GNNs; PKD engages in deeper, bidirectional interaction.

Rating

  • Novelty: ⭐⭐⭐⭐ Bidirectional preference-driven design + RL-based teacher selection
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 9 datasets + 3 LLM backbones + ablation study
  • Writing Quality: ⭐⭐⭐⭐ Framework description is clear and well-structured
  • Value: ⭐⭐⭐⭐ A practical solution for few-shot graph learning
  • The two-level preference-driven design — LLM selects nodes, RL selects GNNs — reflects the key insight that different nodes suit different GNN architectures.
  • PKD surpasses prior state-of-the-art on few-shot node classification and maintains its advantage as the number of labeled samples increases.
  • The core contribution lies in the conceptual simplicity and empirical effectiveness of its design.
  • Experimental results comprehensively validate the central hypotheses.