Structurally Human, Semantically Biased: Detecting LLM-Generated References with Embeddings and GNNs¶

Conference: ICLR 2026 arXiv: 2601.20704 Code: None Area: AI Safety / Graph Learning Keywords: LLM reference detection, citation graph, graph neural networks, semantic embeddings, academic integrity

TL;DR¶

By constructing paired citation graphs (human vs. GPT-4o-generated vs. random baseline) for 10,000 papers, this work finds that LLM-generated reference lists are nearly indistinguishable from human ones in terms of graph topology (RF accuracy only 60%), yet are effectively detectable via semantic embeddings (RF 83%, GNN 93%). This indicates that LLMs accurately mimic citation topology while leaving detectable semantic fingerprints.

Background & Motivation¶

Background: LLMs are increasingly used to synthesize scientific knowledge, draft literature reviews, and suggest references. Prior studies have found that LLM-generated references resemble human ones on coarse-grained metrics (title length, team size, citation count), but exhibit systematic biases at finer granularity (amplified Matthew effect, preference for recent papers, reduced self-citations).

Limitations of Prior Work: It remains unclear whether LLM-generated and human-generated reference lists can be reliably distinguished. Single-reference auditing approaches (e.g., LLM-Check) are insufficient to capture list-level patterns.

Key Challenge: Do LLMs genuinely understand citation structure, or merely imitate it superficially? If topological structure is similar, where do the differences lie?

Goal: To systematically evaluate the structural and semantic differences between LLM-generated and human citation graphs, and to develop corresponding detection methods.

Key Insight: A progressive modeling strategy—from interpretable graph structural features to semantic embeddings to GNNs—that incrementally isolates the contributions of topology vs. semantics.

Core Idea: LLM references are "structurally human, semantically biased"—detection should target content signals rather than graph structure.

Method¶

Overall Architecture¶

10,000 papers are sampled from SciSciNet → paired citation graphs are constructed (ground-truth, GPT-4o-generated, and domain-matched random baseline) → structural features are extracted (degree/closeness/eigenvector centrality, clustering coefficient, edge count) → semantic embeddings are extracted (OpenAI 3072-D) → RF + GNN three-class classification evaluation.

Key Designs¶

Citation Graph Construction:
- Function: Construct paired ground-truth/generated citation graphs for each paper.
- Mechanism: The focal paper serves as the main node; cited papers are child nodes; citation relations are retrieved from SciSciNet. GPT-4o generates references purely parametrically given title, abstract, author information, etc. The random baseline uniformly reshuffles citations within the same field, preserving the degree distribution.
- Design Motivation: Controlled experiment—three citation graphs for the same focal paper are directly comparable.
Structural Features vs. Semantic Embeddings:
- Structural features: Degree/closeness/eigenvector centrality, clustering coefficient, edge count → RF classification.
- Semantic embeddings: OpenAI text-embedding-3-large (3072-D) → graph-level aggregation → RF / used as GNN node features.
- Design Motivation: Disentangle the contributions of topological signals and content signals.
GNN Graph Classification:
- Function: Graph-level binary classification using GCN/GAT/GIN/GraphSAGE.
- Mechanism: Node features are either structural attributes (5-D) or semantic embeddings (3072-D); graph-level readout is followed by binary classification.
- Design Motivation: GNNs can jointly exploit both structural and semantic signals.

Loss & Training¶

Adam optimizer, 70/15/15 split, balanced dataset. Robustness validated with dual LLMs (GPT-4o + Claude Sonnet 4.5) and dual embedding models (SPECTER + OpenAI).

Key Experimental Results¶

Main Results¶

Method	GT vs GPT	GT vs Random	GPT vs Random
RF (structural features)	0.608	0.896	0.928
RF (semantic embeddings)	0.835	0.908	0.953
GNN (structural features)	~0.55	~0.90	~0.93
GNN (semantic embeddings)	0.93	~0.95	~0.97

Ablation Study¶

Configuration	GT vs GPT Accuracy	Notes
GNN + embeddings	93%	Best
RF + embeddings	83.5%	Semantic embeddings contribute significantly
RF + structure	60.8%	Near random
GNN + structure	~55%	Structure alone entirely insufficient
Random embedding substitution	~50%	Confirms effect is not due to dimensionality
Cross-generator (GPT train → Claude test)	~72%	Generalizes to other LLMs

Key Findings¶

Topology is nearly indistinguishable: Centrality and clustering coefficients of GPT citation graphs heavily overlap with ground-truth graphs; RF achieves only 60%.
Semantic fingerprints are detectable: Embedding features improve accuracy from 60% to 83% (RF) / 93% (GNN).
Random baseline is easily separated: Ground-truth vs. random achieves 89%+, GPT vs. random achieves 93%+—demonstrating that GPT does generate structurally plausible citations.
Cross-LLM generalization: A classifier trained on GPT-4o achieves 72% accuracy on Claude.
Replacing embeddings with random vectors drops accuracy to 50%, confirming that discriminative power derives from semantic structure rather than dimensionality.

Highlights & Insights¶

The finding of "structurally human, semantically biased" has direct implications for auditing and debiasing strategies—detection efforts should focus on content signals rather than graph structure.
The domain-matched random baseline design is methodologically rigorous—reshuffling citations within the same field controls for topic distribution.
The progressive analysis (structure → embeddings → GNN) clearly delineates the contribution of each modeling layer.

Limitations & Future Work¶

Only parametric generation (without RAG) is tested; in practice, LLMs may employ retrieval-augmented generation.
The specific semantic dimensions underlying the observed differences (e.g., recency bias, prestige bias) are not deeply analyzed.
It remains unclear which dimensions of the 3072-D embeddings drive discriminative power.
Only binary classification is explored; multi-class settings (e.g., partially LLM-generated references) are not investigated.

vs. LLM-Check: LLM-Check audits the existence of individual citations, whereas this paper evaluates graph-level patterns of entire reference lists.
vs. Algaba et al.: Prior work identified coarse-grained consistency; this paper achieves high-accuracy automatic detection via GNNs and embeddings.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of citation graphs and GNNs is novel, though the analytical framework itself is relatively straightforward.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 10,000 graphs, dual LLMs, dual embedding models, multiple baselines, and random embedding controls—highly comprehensive.
Writing Quality: ⭐⭐⭐⭐ Excellent visualizations and clear layer-by-layer analysis.
Value: ⭐⭐⭐⭐ Practically meaningful for academic integrity and AI-assisted writing.