Comparing LLM-generated and human-authored news text using formal syntactic theory¶

Conference: ACL 2025
arXiv: 2506.01407
Code: https://github.com/olzama/llm-syntax
Authors: Olga Zamaraeva, Dan Flickinger, Francis Bond, Carlos Gómez-Rodríguez
Affiliations: Universidade da Coruña, Independent Researcher, Palacký University at Olomouc
Area: AIGC Detection
Keywords: HPSG, English Resource Grammar, LLM text analysis, syntactic diversity, authorship analysis

TL;DR¶

This study is the first to systematically investigate the grammatical differences between 6 LLMs and human NYT news writing across three levels—syntactic constructions (298 types), lexical types (1,398 types), and morphological rules (100 types)—using HPSG formal syntactic theory (via the English Resource Grammar, ERG). The findings reveal that LLMs represent an "averaged" projection of human authors in terms of grammatical features: grammatical differences among individual human authors are actually greater than any difference between humans and LLMs, while LLMs exhibit almost no differences among themselves.

Background & Motivation¶

Background: There is a growing body of research comparing LLM-generated text with human text. However, existing work primarily focuses on training classifiers (to detect AI generation) or analyzing shallow features like lexical distribution and Universal Dependencies (UD), lacking deep grammatical analysis grounded in independent linguistic theories.
Limitations of Prior Work: Annotation schemes such as Universal Dependencies (UD) and Penn Treebank (PTB) are designed for NLP tasks. They have limited granularity and are not task-independent. For example, the obj relation in UD only refers to the dependency of a verb on its direct object, failing to distinguish more general head-complement constructions (nouns and adjectives can also take complements). Analyzing the output of NLP systems using NLP-oriented tools introduces inherent bias.
Key Challenge: Prior research has found that LLMs tend to repeat POS sequence templates (Shaib et al. 2024) and prefer specific Biber rhetorical features (Reinhart et al. 2024). However, these are top-down feature sets that fail to provide a comprehensive, consistent, and reproducible grammatical analysis framework to cover the long-tail distribution of English syntax.
Goal: How can we utilize formal linguistic theories independent of NLP to conduct a systematic and fine-grained comparison between LLM-generated and human-authored texts, down to the level of lexical-syntactic behavior?
Key Insight: By utilizing the computational implementation of HPSG—the English Resource Grammar (ERG, covering 94% of well-edited English text)—each sentence can be parsed into a complete, typed syntactic structure. The distributional differences can then be analyzed statistically across three independent levels (syntactic constructions, lexical types, and morphological rules).

Method¶

Overall Architecture¶

The research consists of three phases: Data Preparation \(\rightarrow\) ERG Formal Syntactic Parsing \(\rightarrow\) Multi-dimensional Statistical Analysis. The core idea is to map text to an explicit typed hierarchical structure using HPSG theory, and then compare humans and LLMs in the type distribution space.

Key Designs¶

1. HPSG/ERG Three-level Parsing System¶

HPSG (Head-driven Phrase Structure Grammar) is a fully explicit formal syntactic theory that represents syntactic structures and semantic interfaces as complex feature-value graphs. ERG (English Resource Grammar) is its largest-scale implementation for English, with the following scale:

Component	Total ERG Types	NYT Covered Types	Coverage
Syntactic types	298	289	97%
Lexical types	1,398	1,105	79%
Lexical entries	44,366	27,311	61%
Morphological rules	100	99	99%

The core advantage of ERG lies in its lexical type hierarchy: the same word can belong to different lexical types, encoding distinct syntactic behaviors. For example, "law" has two lexical entries—law_n1 (general countable/uncountable noun, "the law") and law_n2 (noun taking a clausal complement, "There is a law that..."). Human texts utilize both, while LLMs only use law_n1. This granularity is unavailable in UD or POS tagging.

2. Multi-dataset Cross-validation Design¶

Data sources cover three dimensions:

NYT Human Text: Lead paragraphs of New York Times articles (26,102 sentences) from 2023.10.01 to 2024.01.24, retrieved via the NYT Archive API.
LLM-generated Text: 6 models (LLaMA-7B/13B/30B/65B, Falcon-7B, Mistral-7B), generated using NYT headlines + first 3 words as prompts (approx. 214K sentences in total). All LLMs were released prior to 2023.10.01 to ensure they had not seen the corresponding human articles.
Redwoods Treebank: Portions of WSJ (43,043 sentences) and Wikipedia (10,726 sentences)—used to verify whether the findings hold across styles/genres.

The experimental design deliberately isolates two factors: ① Model scaling (the LLaMA series with the same architecture but different sizes); ② Model architecture (LLaMA vs. Falcon vs. Mistral).

3. Statistical Analysis Methods¶

Cosine Similarity + PCA Projection: Normalize the HPSG type frequencies of each dataset into vectors, calculate pairwise cosine similarity, and use PCA projection to visualize differences within the 98%-100% similarity range.
Shannon Entropy \(H\) and Gini-Simpson Index \(1-\lambda\): Quantify the diversity (evenness) of construction usage, with statistical significance validated using 10,000 permutation tests.
Individual Author Analysis: Select 12 NYT reporters who published \(>100\) articles, calculate pairwise cosine similarities of their HPSG type distributions, and cross-compare with LLMs.
Mann-Whitney U Test: Perform statistical significance tests (with FDR correction) on the relative frequency differences of individual HPSG types.

Key Experimental Results¶

Table 1: Syntactic Construction Frequency Differences (25K sentence sample)¶

Construction	Example	Human Freq.	LLM Mean	Direction
Head-complement	"It's not acceptable for democracy"	164,806	224,529	LLM >> Human
Subject-head	"The house passed the measure…"	17,850	27,753	LLM >> Human
Quantity NP	"many in Europe"	23,611	40,881	LLM >> Human
Relative clauses	"a vote that many have seen…"	4,929	6,721	LLM >> Human
Clause with extracted subject	"Chris Snow became an advocate…"	5,072	7,327	LLM >> Human
Marker clause	"and that's a good thing"	2,891	5,660	LLM >> Human
Clause conjunction fragment	"But the observation suits him."	939	2,076	LLM >> Human
Questions	"How do you stay safe?"	268	428	LLM >> Human
Participial clause	"having tried that,…"	1,736	1,116	Human >> LLM
Modifier clause apposition	"his critics, mostly unnamed"	826	434	Human >> LLM
Bare NP coordination	"author and commentator"	311	117	Human >> LLM
Paired marker	"Both this and other discussions"	326	185	Human >> LLM
Adjective-participle modifier	"right-handed", "red-colored"	125	64.6	Human >> LLM
Double NP apposition	"an eye for detail, decades of…"	11	5.2	Human >> LLM
Absolute VP	"As told, …"	10	3.8	Human >> LLM

Core Pattern: LLMs heavily utilize the most general, basic constructions (head-complement, subject-head), whereas humans employ a richer variety of low-frequency stylistic constructions (participial modifiers, double NP apposition, absolute VP).

Table 2: Diversity Metric Comparison (Shannon Entropy)¶

Dimension	Human NYT	LLaMA-7B	LLaMA-13B	LLaMA-30B	LLaMA-65B	Falcon-7B	Mistral-7B	All LLMs Combined
Syntactic Construction \(H\)	3.342	3.259	3.249	3.270	3.284	3.221	3.267	3.265
Lexical Type \(H\)	4.727	4.844	4.877	4.858	4.860	4.700	4.847	—

All differences were validated as significant via permutation tests (10,000 resamples, \(p < 0.01\)).
Syntactic construction diversity: Human is the highest (\(H=3.342\)), LLaMA-65B is the closest (\(H=3.284\)), and Falcon is the lowest (\(H=3.221\)).
Lexical type diversity shows a reversal: most LLMs are higher than humans (LLaMA-13B is the highest with \(H=4.877\) vs. Human \(H=4.727\)).
After combining all LLM outputs, syntactic diversity actually drops to \(H=3.265\)—aggregation amplifies the high-frequency general constructions shared across models.

Key Findings on Cosine Similarity¶

Syntactic Construction Cosine Similarity (Selected Raw Data):

Comparison Pair	Cosine Similarity
LLaMA-30B vs LLaMA-65B	0.9999
LLaMA-7B vs Mistral-7B	0.9999
Falcon-7B vs LLaMA-7B	0.9966
LLaMA-65B vs Human NYT	0.9964
LLaMA-7B vs Human NYT	0.9955
WSJ vs Human NYT	0.9949
Wikipedia vs Human NYT	0.9833

The syntactic similarity between LLMs (0.9966–0.9999) is consistently higher than that of any LLM compared to humans (0.9950–0.9965), which in turn is higher than the cross-genre similarity of human texts (Wikipedia vs. NYT = 0.9833).

Vocabulary Footprint Differences (25K sentence sample)¶

Model	Human-Exclusive Lexical Types	LLM-Exclusive Lexical Types	Human-Exclusive Lexical Entries	LLM-Exclusive Lexical Entries
LLaMA-7B	62	70	5,704	2,519
LLaMA-13B	71	80	5,557	2,617
LLaMA-30B	65	62	5,531	2,608
LLaMA-65B	66	74	5,302	2,745
Mistral-7B	73	76	5,809	2,353
Falcon-7B	91	55	6,212	2,015
All LLMs	66	70	1,721	2,398

Humans alone use about twice as many lexical entries as a single LLM (5,000–6,000 vs. 2,000–2,700), but when considering all LLMs combined (2,398 vs. 1,721), the collective vocabulary coverage of LLMs surpasses humans.

Core Findings: Individual Authors vs. LLMs¶

Inter-human differences > Human-LLM differences: The variance in cosine similarity of syntactic distributions among the 12 NYT reporters is significantly larger than any difference between humans and LLMs.
Minimal differences among LLMs: The 6 LLMs are tightly clustered across all type dimensions.
Human variance is largest in the lexical type dimension: Individual humans differ significantly in lexical type usage, whereas LLMs exhibit minimal variance.
Minimal difference in the morphological rule dimension: Under the NYT genre constraints, humans and LLMs are almost indistinguishable in inflectional/derivational morphology (cosine similarity 0.9962–0.9990), with Falcon being the sole exception.

Highlights & Insights¶

The "LLM as a Grammatical Average" Hypothesis: This is the most profound finding of the paper. LLM-generated text behaves as an "averaged" projection of human authors in the grammatical dimension. The fact that inter-human variance is greater than Human-LLM variance is precisely because each LLM learns a "compromise" grammatical style, smoothing out individual idiosyncrasies. This explains why LLMs prefer the most general head-complement constructions.
Importance of Three-Level Decoupling: Humans are more diverse in syntactic constructions (\(H=3.342 > 3.284\)), but LLMs are more diverse in lexical types (\(H=4.877 > 4.727\)), and there is almost no difference in morphological rules. Without decoupled analysis, these patterns would be obscured. This indicates that linguistic analysis must distinguish between morphological, syntactic, and lexical levels.
Diagnostic Value of Long-Tail Constructions: ERG covers the full long-tail distribution of English syntax, revealing that LLMs overuse several constructions rarely used by humans (e.g., number sequences, parenthetical modifiers, fragmented lexical conjunctions like "But!"), while lacking stylistic constructions that humans occasionally employ (absolute VPs, measure-noun modifier phrases).
"Informality" of Human Text: Despite strict style guidelines at the NYT, human authors still use informal vocabulary ("haven't", "a couple dozen"), imperative sentences ("See the results..."), and direct, emphatic expressions ("at your own risk") more frequently than LLMs. LLMs adhere more consistently to the formal prompted style, with their unique lexical entries heavily featuring numbers and punctuation types.
Superiority of Formal Grammar as an Analytical Tool: ERG distinguishes grammatical phenomena that UD cannot handle—head-complement is not equivalent to UD's obj, and lexical types distinguish different syntactic usages of the same word. This granularity enables this paper to discover novel patterns of discrepancy.

Limitations & Future Work¶

Limited to English NYT Genre: ERG is currently the only large-scale HPSG grammar reaching 94% coverage. HPSG grammars for other languages are not mature enough to support a similar scale of analysis, restricting cross-lingual generalization.
Outdated LLM Versions: The evaluated LLMs are LLaMA-1, Falcon-7B, and Mistral-7B (all released before 2023), without covering newer generation models like GPT-4, Claude, or LLaMA-3.
Limited Statistical Power: Only 9 datasets were compared. After FDR correction for multiple comparisons, differences in high-frequency constructions became statistically non-significant—more diverse datasets are required for robust statistical conclusions.
Single Generation Control: LLMs only used one prompting strategy (headline + first 3 words). Different prompts and temperature settings could affect syntactic preferences.
Counter-intuitive Consistency in Morphological Rules: Why humans and LLMs align so tightly on morphological rules was not deeply explored—is it due to genre constraints or the low inherent variability of English morphology?

Work	Analysis Framework	Granularity	Core Findings
Muñoz-Ortiz et al. 2024	UD Dependency Syntax	Dependency relations + Vocabulary	Human text is shorter, dependency distance is more optimized, vocabulary is more diverse
Shaib et al. 2024	POS Sequence Templates	POS n-gram	LLMs tend to repeat POS templates
Reinhart et al. 2024	Biber Rhetorical Features	Predefined feature sets	LLMs prefer participial clauses, "that"-clauses, nominalization
Sardinha 2024	Biber Features	Predefined feature sets	LLMs and humans differ systematically across rhetorical dimensions
Ours	HPSG/ERG	298 constructions + 1398 lexical types + 100 morphological rules	LLMs represent the "grammatical average"; three levels show distinct difference patterns

Insights for Future Work: - HPSG analysis can serve as a complementary feature for AIGC detection—especially since long-tail construction distribution patterns may be more robust than surface-level statistical features. - The "LLM as a grammatical average" hypothesis can guide text watermarking design—by injecting low-frequency stylistic constructions to enhance watermark imperceptibility. - The three-level decoupling capability can be transferred to other languages for LLM evaluation—even without a large-scale HPSG grammar, a more fine-grained syntactic analysis framework can be used to replace coarse-grained UD.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First study using formal syntactic theory independent of NLP to analyze LLM text, demonstrating high methodological originality.
Experimental Thoroughness: ⭐⭐⭐⭐ Cross-validated with 6 LLMs and 3 human datasets (NYT/WSJ/Wikipedia), though statistical power is somewhat limited by the number of datasets.
Writing Quality: ⭐⭐⭐⭐ Accessible to both linguistics and NLP readers with rich examples, though discussion on some conclusions is slightly scattered.
Value: ⭐⭐⭐⭐ The finding of "LLMs as a grammatical average" is profound and inspiring; the methodology is reusable but restricted by the availability of English ERG.
Technical Depth: ⭐⭐⭐⭐ Solid application of HPSG theory with sound statistical analysis, though no new models or algorithms are proposed.