Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision?¶

Conference: ACL 2025
arXiv: 2412.08174
Code: Violet24K/Morpher
Area: Graph Learning
Keywords: Graph Neural Networks, Multimodal Prompt Learning, Graph-Text Alignment, Few-shot Learning, Zero-shot Classification

TL;DR¶

This paper proposes Morpher, a multimodal prompt learning paradigm. Under extremely weak text supervision (only a few tokens of label names), Morpher aligns a pre-trained GNN into the semantic space of an LLM by simultaneously learning graph prompts and text prompts, enabling cross-task and cross-domain graph classification transfer, as well as the first CLIP-style zero-shot GNN classification prototype.

Background & Motivation¶

Background: CLIP constructs high-quality vision-language alignment models through joint vision-language pre-training, but extending this paradigm to graph data faces significant challenges. Graph data is naturally scarce, text supervision is extremely weak (with label names containing only a few tokens), tasks span three levels (nodes, edges, and graphs), and the same graph structure can express entirely different semantics across different domains.

Limitations of Prior Work: (1) Joint graph-text pre-training is only feasible in molecular domains and text-attribute graphs, which is impractical for general graph data due to data scarcity; (2) Existing GNN prompting methods (such as GPF) suffer from cross-connections overwhelming the original graph structure in practical applications, leading to training instability; (3) Prompting only a single modality limits the flexibility of adjusting the other modality.

Design Motivation: Leveraging the high-quality semantic space already established by the LLM encoder, dual-modal prompt learning is employed to align graph embeddings to this semantic space while freezing both GNN and LLM parameters.

Method¶

Overall Architecture¶

Morpher consists of three learnable components: graph prompts $\mathbf{P}_\theta^g$, text prompts $\mathbf{P}_\theta^t$, and a cross-modal projector $\text{Proj}_\theta$. The parameters of the GNN and the LLM are frozen entirely, and alignment is achieved solely through the prompts and the projector.

Key Designs¶

1. Improved Graph Prompt Design: The fundamental issue of existing graph prompting (Sun et al., 2023) is analyzed: since prompt tokens are initialized close to zero vectors, their sigmoid values are close to 0.5, leading to excessively dense cross-connections where the prompt graph's features overwhelm the original graph. Solution: The number of cross-connections is restricted to not exceed the number of original graph edges $n_e$, with each node connecting to at most $\lfloor n_e/a \rfloor$ prompt tokens, and cosine similarity is used instead of sigmoid to calculate connection weights.

2. Cross-Modal Projector: A linear layer with tanh activation is used to map the graph embedding space to the text embedding space: $$\widetilde{\mathbf{v}} = \text{Proj}_\theta(\mathbf{v}) := \tanh(\mathbf{W}\mathbf{v} + \mathbf{b}) \in \mathbb{R}^{1 \times d_t}$$

3. Text Embedding Normalization: Considering that a small number of label texts might be close in semantics, the mean $\mu$ is subtracted before applying L2 normalization to separate semantically similar category embeddings.

Loss & Training¶

In-batch contrastive loss is adopted for graph-text alignment training:

\[\mathcal{L}_{G \rightarrow T} = -\frac{1}{B}\sum_{i=1}^{B} \log \frac{\exp(\mathbf{z}_i^{\mathcal{G}} \cdot \mathbf{z}_i^t / \tau)}{\sum_{j=1}^{B} \exp(\mathbf{z}_i^{\mathcal{G}} \cdot \mathbf{z}_j^t / \tau)}\]

During inference, classification is performed by computing the cosine similarity between graph embeddings and text embeddings of each category.

Key Experimental Results¶

Main Results (Few-shot Graph Classification)¶

Training Method	GNN Pre-training	MUTAG	ENZYMES	PROTEINS	MSRC_21C
Supervised	N/A+GCN	66.00	16.67	65.89	38.85
Pre-train+FT	GraphCL+GCN	70.00	17.91	65.89	40.00
Graph Prompt	GPF+GCN	64.67	17.02	63.50	43.46
Morpher	GCN+LLaMA	75.33	22.39	68.32	50.86

Ablation Study¶

Component	Effect
Without modified graph prompts	Unstable training, fail to converge on some datasets
Without text prompts	Performance drops by 2-5%, insufficient flexibility in single-modality adjustments
Without cross-modal projector	Dimension mismatch makes training impossible
Original GPF graph prompts	Overly dense cross-connections lead to performance degradation

Key Findings¶

Efficacy of Extremely Weak Supervision: With only class names (a few tokens) as text supervision, Morpher significantly improves GNN classification performance.
Cross-Domain Transfer: Under cross-domain settings (e.g., molecule $\rightarrow$ social network), Morpher remains highly competitive.
Zero-Shot Prototype: The first CLIP-style zero-shot classification is realized on GNNs—after projecting graph embeddings into the text space, classification can directly utilize unseen class names.
Diagnosis of Graph Prompting Issues: The root cause of excessively dense cross-connections in existing graph prompt designs is revealed, and an effective solution is proposed.

Highlights & Insights¶

The first graph-text multimodal prompt learning framework under extremely weak text supervision, where both GNN and LLM parameters are completely frozen.
Deep analysis and correction of the core flaw of overly dense cross-connections in existing graph prompting designs.
Implementation of the first GNN CLIP-style zero-shot classification prototype, demonstrating the viability of graph models learning language.
Superb performance across few-shot, multi-task, and cross-domain settings.

Limitations & Future Work¶

Reliance on the quality of pre-trained GNNs and LLMs; pre-training domain bias of both models may affect performance.
The cross-modal projector only uses a simple linear layer + tanh, which has limited expressiveness.
Zero-shot classification was validated only on limited scenarios, and its generalization ability requires further evaluation.
Experiments were primarily conducted on small-to-medium graph datasets, lacking validation on large-scale industrial graphs.

CLIP and Vision-Language Alignment: The CLIP framework of Radford et al. (2021) is the core inspiration of this work.
Graph Prompt Learning: GPF (Sun et al., 2023) pioneered the concept of graph prompts, and this work identifies and resolves its design flaws.
Graph Self-Supervised Pre-training: GraphCL (You et al., 2020), GCC (Qiu et al., 2020), etc., provide pre-trained GNNs.
Multimodal Prompt Learning: CoCoOp (Zhou et al., 2022) and MaPLe (Khattak et al., 2023) utilize dual-modality prompts in vision-language tasks.

Rating¶

Metric	Score
Novelty	⭐⭐⭐⭐
Technical Depth	⭐⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Value	⭐⭐⭐⭐
Overall	7.5/10