Towards Improved Sentence Representations using Token Graphs¶

Conference: ICLR 2026 arXiv: 2603.03389 Code: https://github.com/ipsitmantri/GLOT Area: NLP / Graph Learning Keywords: Sentence Representation, Graph Neural Networks, Token Graph, Pooling, Frozen LLM

TL;DR¶

This paper proposes Glot, a lightweight structure-aware pooling module that constructs a latent similarity graph from token-level hidden states of a frozen LLM, refines them via a GNN, and aggregates them into a sentence representation. Glot achieves competitive performance with fine-tuning-based methods on GLUE/MTEB while requiring 20× fewer parameters and 100× faster training.

Background & Motivation¶

Background: LLMs produce token-level hidden states, yet many downstream tasks require a single-vector sentence representation. Standard practice relies on mean/max/[CLS] pooling, which treats tokens as an unordered set.

Limitations of Prior Work: (a) Standard pooling discards the rich relational structure captured by self-attention layers; (b) mean pooling is overwhelmed by noise when only a few tokens carry task-relevant signals; (c) the causal attention of decoder-only LLMs is optimized for next-token prediction rather than sentence understanding. Full model fine-tuning is prohibitively expensive.

Key Challenge: How can high-quality sentence representations be obtained from a frozen model's outputs without fine-tuning the LLM?

Goal: Reformulate pooling as "relational learning first, then aggregation" — treating tokens as a graph rather than an independent set.

Key Insight: Token hidden states from LLMs naturally carry similarity structure (cosine similarity), enabling the construction of a latent graph. GNNs that propagate information over such graphs before aggregation are strictly more expressive than the DeepSets framework.

Core Idea: Glot = Token similarity graph construction + Token-GNN refinement + Learnable readout. The LLM backbone is frozen; only the lightweight GNN head is trained.

Method¶

Overall Architecture¶

Frozen LLM produces \(\mathbf{X} \in \mathbb{R}^{L \times d}\) → construct token similarity graph \(\mathcal{G}\) → Token-GNN refinement → weighted aggregation readout → sentence vector \(\mathbf{z}\).

Key Designs¶

Token Graph Construction:
- Function: Construct a sparse graph based on cosine similarity.
- Mechanism: \(\mathbf{S}_{ij} = \cos(\mathbf{x}_i, \mathbf{x}_j)\); an edge is created only when \(\mathbf{S}_{ij} > \tau\), where \(\tau\) is a hyperparameter.
- Design Motivation: Preserve connections between semantically related tokens while discarding irrelevant ones. The threshold controls graph sparsity.
Token-GNN Refinement:
- Function: Propagate information over the token graph.
- Mechanism: A \(K\)-layer GNN with \(\mathbf{a}_i^{(\ell)} = \text{AGGREGATE}_{j \in \mathcal{N}_i}(\mathbf{h}_j^{(\ell)})\) and \(\mathbf{h}_i^{(\ell+1)} = \sigma(\mathbf{W}^{(\ell)} \text{CONCAT}(\mathbf{h}_i^{(\ell)}, \mathbf{a}_i^{(\ell)}))\).
- Design Motivation: GNNs capture inter-token dependencies, such as the negation of "good" by "not" in "not good." DeepSets (\(K=0\)) cannot model such interactions.
Learnable Readout:
- Function: Adaptively aggregate refined token representations.
- Mechanism: \(m_i = \mathbf{v}^\top \tanh(\mathbf{W}_m \mathbf{u}_i + \mathbf{b}_m)\), \(\pi = \text{softmax}(\mathbf{m})\), \(\mathbf{z} = \sum_i \pi_i \mathbf{u}_i\).
- Design Motivation: Adaptive weights outperform fixed mean/max aggregation. Theoretically, Glot is shown to generalize mean/max/CLS pooling and AdaPool.

Loss & Training¶

Task-specific losses are used (cross-entropy for classification, cosine loss for similarity). Only the GNN head and task classifier are trained; the backbone is fully frozen. The number of trainable parameters is 20× smaller than PEFT methods such as LoRA.

Key Experimental Results¶

Main Results (GLUE + Frozen BERT)¶

Method	CoLA (MCC)	SST-2 (Acc)	STS-B (Spea)	MRPC (F1)	QQP (F1)
[CLS]	22.66	83.83	61.08	79.58	19.70
Mean	19.55	82.91	74.96	80.28	29.01
AdaPool	29.20	87.72	80.01	77.99	40.15
Glot	47.49	90.25	83.86	82.58	62.19

Ablation Study (Signal Dilution Stress Test)¶

Method	0% Noise	50% Noise	90% Noise
Mean	~92%	~70%	~52%
AdaPool	~93%	~78%	~60%
Glot	~95%	~94%	97%+

Key Findings¶

Glot achieves +25 MCC over [CLS] on CoLA (47.49 vs. 22.66) — relational modeling is critical for linguistic understanding.
Signal dilution stress test: With 90% randomly injected noise tokens, Mean and AdaPool collapse (~50–60%), while Glot maintains 97%+.
Decoder-only LLMs benefit most: Glot yields substantially larger gains over Mean pooling on SmolLM2 and TinyLlama.
Extreme parameter efficiency: 20× fewer parameters than PEFT methods such as LoRA, with 100×+ faster training.
Theoretical guarantee: Glot strictly generalizes DeepSets; GNN message passing is provably more expressive than pure set functions.

Highlights & Insights¶

A new paradigm of "pooling as relational learning": Tokens are treated not as an independent set but as a graph, enabling relation-based compression.
Practical value of frozen LLM + lightweight GNN: Expensive fine-tuning is avoided, with only a small number of trainable parameters required.
Diagnostic value of stress testing: The robustness gap at 90% noise clearly exposes the fundamental difference between relational learning and independent aggregation.

Limitations & Future Work¶

Graph construction relies on a cosine similarity threshold \(\tau\), requiring hyperparameter tuning.
The GNN introduces additional memory and computational overhead (though far less than fine-tuning).
The possibility of pre-training the GNN head for cross-task transfer has not been explored.
For long texts such as documents, the token graph may become prohibitively large.

vs. AdaPool (Brothers, 2025): AdaPool learns token weights but operates within the DeepSets framework, precluding token interaction modeling. Glot has a structural advantage via GNNs.
vs. TextGCN: TextGCN constructs word co-occurrence graphs at the corpus level for classification, whereas Glot constructs token graphs at the sentence level for representation learning.
vs. ColBERT: ColBERT retains multi-vector representations, while Glot compresses them into a single vector.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Redefining pooling as graph-based relational learning represents a novel and theoretically grounded paradigm shift.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation spanning GLUE, MTEB, IMDB, stress tests, and 6 backbone architectures.
Writing Quality: ⭐⭐⭐⭐⭐ Motivation is clearly articulated; the logical flow from theory to practice is complete.
Value: ⭐⭐⭐⭐⭐ Highly efficient and practical, with immediate value for downstream applications of frozen LLMs.