Skip to content

Structurally Human, Semantically Biased: Detecting LLM-Generated References with Embeddings and GNNs

Conference: ICLR 2026 arXiv: 2601.20704 Code: None Area: AI Safety / Graph Learning Keywords: LLM reference detection, citation graph, graph neural networks, semantic embeddings, academic integrity

TL;DR

By constructing paired citation graphs (human vs. GPT-4o-generated vs. random baseline) for 10,000 papers, this work finds that LLM-generated reference lists are nearly indistinguishable from human ones in terms of graph topology (RF accuracy only 60%), yet are effectively detectable via semantic embeddings (RF 83%, GNN 93%). This indicates that LLMs accurately mimic citation topology while leaving detectable semantic fingerprints.

Background & Motivation

Background: LLMs are increasingly used to synthesize scientific knowledge, draft literature reviews, and suggest references. Prior studies have found that LLM-generated references resemble human ones on coarse-grained metrics (title length, team size, citation count), but exhibit systematic biases at finer granularity (amplified Matthew effect, preference for recent papers, reduced self-citations).

Limitations of Prior Work: It remains unclear whether LLM-generated and human-generated reference lists can be reliably distinguished. Single-reference auditing approaches (e.g., LLM-Check) are insufficient to capture list-level patterns.

Key Challenge: Do LLMs genuinely understand citation structure, or merely imitate it superficially? If topological structure is similar, where do the differences lie?

Goal: To systematically evaluate the structural and semantic differences between LLM-generated and human citation graphs, and to develop corresponding detection methods.

Key Insight: A progressive modeling strategy—from interpretable graph structural features to semantic embeddings to GNNs—that incrementally isolates the contributions of topology vs. semantics.

Core Idea: LLM references are "structurally human, semantically biased"—detection should target content signals rather than graph structure.

Method

Overall Architecture

10,000 papers are sampled from SciSciNet → paired citation graphs are constructed (ground-truth, GPT-4o-generated, and domain-matched random baseline) → structural features are extracted (degree/closeness/eigenvector centrality, clustering coefficient, edge count) → semantic embeddings are extracted (OpenAI 3072-D) → RF + GNN three-class classification evaluation.

Key Designs

  1. Citation Graph Construction:

    • Function: Construct paired ground-truth/generated citation graphs for each paper.
    • Mechanism: The focal paper serves as the main node; cited papers are child nodes; citation relations are retrieved from SciSciNet. GPT-4o generates references purely parametrically given title, abstract, author information, etc. The random baseline uniformly reshuffles citations within the same field, preserving the degree distribution.
    • Design Motivation: Controlled experiment—three citation graphs for the same focal paper are directly comparable.
  2. Structural Features vs. Semantic Embeddings:

    • Structural features: Degree/closeness/eigenvector centrality, clustering coefficient, edge count → RF classification.
    • Semantic embeddings: OpenAI text-embedding-3-large (3072-D) → graph-level aggregation → RF / used as GNN node features.
    • Design Motivation: Disentangle the contributions of topological signals and content signals.
  3. GNN Graph Classification:

    • Function: Graph-level binary classification using GCN/GAT/GIN/GraphSAGE.
    • Mechanism: Node features are either structural attributes (5-D) or semantic embeddings (3072-D); graph-level readout is followed by binary classification.
    • Design Motivation: GNNs can jointly exploit both structural and semantic signals.

Loss & Training

Adam optimizer, 70/15/15 split, balanced dataset. Robustness validated with dual LLMs (GPT-4o + Claude Sonnet 4.5) and dual embedding models (SPECTER + OpenAI).

Key Experimental Results

Main Results

Method GT vs GPT GT vs Random GPT vs Random
RF (structural features) 0.608 0.896 0.928
RF (semantic embeddings) 0.835 0.908 0.953
GNN (structural features) ~0.55 ~0.90 ~0.93
GNN (semantic embeddings) 0.93 ~0.95 ~0.97

Ablation Study

Configuration GT vs GPT Accuracy Notes
GNN + embeddings 93% Best
RF + embeddings 83.5% Semantic embeddings contribute significantly
RF + structure 60.8% Near random
GNN + structure ~55% Structure alone entirely insufficient
Random embedding substitution ~50% Confirms effect is not due to dimensionality
Cross-generator (GPT train → Claude test) ~72% Generalizes to other LLMs

Key Findings

  • Topology is nearly indistinguishable: Centrality and clustering coefficients of GPT citation graphs heavily overlap with ground-truth graphs; RF achieves only 60%.
  • Semantic fingerprints are detectable: Embedding features improve accuracy from 60% to 83% (RF) / 93% (GNN).
  • Random baseline is easily separated: Ground-truth vs. random achieves 89%+, GPT vs. random achieves 93%+—demonstrating that GPT does generate structurally plausible citations.
  • Cross-LLM generalization: A classifier trained on GPT-4o achieves 72% accuracy on Claude.
  • Replacing embeddings with random vectors drops accuracy to 50%, confirming that discriminative power derives from semantic structure rather than dimensionality.

Highlights & Insights

  • The finding of "structurally human, semantically biased" has direct implications for auditing and debiasing strategies—detection efforts should focus on content signals rather than graph structure.
  • The domain-matched random baseline design is methodologically rigorous—reshuffling citations within the same field controls for topic distribution.
  • The progressive analysis (structure → embeddings → GNN) clearly delineates the contribution of each modeling layer.

Limitations & Future Work

  • Only parametric generation (without RAG) is tested; in practice, LLMs may employ retrieval-augmented generation.
  • The specific semantic dimensions underlying the observed differences (e.g., recency bias, prestige bias) are not deeply analyzed.
  • It remains unclear which dimensions of the 3072-D embeddings drive discriminative power.
  • Only binary classification is explored; multi-class settings (e.g., partially LLM-generated references) are not investigated.
  • vs. LLM-Check: LLM-Check audits the existence of individual citations, whereas this paper evaluates graph-level patterns of entire reference lists.
  • vs. Algaba et al.: Prior work identified coarse-grained consistency; this paper achieves high-accuracy automatic detection via GNNs and embeddings.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of citation graphs and GNNs is novel, though the analytical framework itself is relatively straightforward.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 10,000 graphs, dual LLMs, dual embedding models, multiple baselines, and random embedding controls—highly comprehensive.
  • Writing Quality: ⭐⭐⭐⭐ Excellent visualizations and clear layer-by-layer analysis.
  • Value: ⭐⭐⭐⭐ Practically meaningful for academic integrity and AI-assisted writing.