ACL 2025 AIGC Detection Low-perplexity sequences training data attribution memorization verbatim reproduction Infinigram Pythia

Low-Perplexity LLM-Generated Sequences and Where To Find Them¶

Conference: ACL 2025
arXiv: 2507.01844
Code: GitHub
Area: AIGC Detection
Keywords: Low-perplexity sequences, training data attribution, memorization, verbatim reproduction, Infinigram, Pythia

TL;DR¶

This paper proposes a systematic pipeline to analyze low-perplexity sequences (token prediction probability \(\ge 0.9\)) generated by LLMs and trace them back to training data sources. It is found that 30-60% of low-perplexity segments cannot be matched to the training data, and the matchable segments are categorized into four types of memorization behaviors.

Background & Motivation¶

Background: Training Data Attribution (TDA) is a crucial direction for understanding how LLMs utilize training data. It is primarily divided into causal methods (retraining/gradient analysis, where computational costs explode as model size scales) and similarity-based methods (embedding/exact matching, which are scalable but only provide approximate attribution).

Limitations of Prior Work: Existing research on verbatim memorization primarily focuses on "whether models can be induced to output training data," lacking a systematic analysis of the relationship between low-perplexity generated text and training data.

Key Challenge: Intuitively, when an LLM generates high-confidence (low-perplexity) text, it should be copying the training data. However, does this assumption hold? Does low perplexity necessarily imply verbatim reproduction?

Key Insight: This paper focuses on specialized domains (genetics, nuclear physics, cryptography, and pharmacology), where utilizing rich professional terminology makes it easier to extract long low-perplexity segments, and constructs a complete extraction \(\rightarrow\) matching \(\rightarrow\) classification pipeline.

Method¶

Overall Architecture¶

Random 20-40 token segments are extracted from Wikipedia articles to serve as prompts, and Pythia-6.9B is used to continue writing. For the generated text, the perplexity of each token is extracted to identify all continuous low-perplexity sequences (\(\log_2(P) \le 0.152\), i.e., probability \(\ge 0.9\)). A sliding 6-token window is applied over these sequences, using Infinigram to perform exact matching in the training data (The Pile, 300B tokens). Based on the match count \(c\), windows are categorized into four behavioral types.

Key Designs¶

Low-Perplexity Sequence Extraction:
- Token perplexity threshold is set to \(\log_2(P) \le 0.152\) (probability \(\ge 0.9\)), extracting the longest continuous subsequence that meets this condition.
- Selecting 40 Wikipedia articles for each of the four domains to extract random quotes as prompts. Generating 5 continuations per prompt, resulting in a total of 800 generations.
- Generation parameters: top_k=20, top_p=0.8, T=0.7.
- The average length of low-perplexity sequences is approximately 12-14 tokens, with a standard deviation of 11-15 tokens.
Fixed-Window Matching (6-token windows):
- Sliding a 6-token window (stride=1) across the low-perplexity sequences, executing exact matching for each window against the training data.
- A 6-token window is long enough to avoid random matching yet short enough to capture meaningful segments.
- Utilizing Infinigram for large-scale, high-efficiency index matching (which outperforms Elasticsearch in scalability and efficiency).
- For a low-perplexity sequence of length \(L\), this produces \(L+1-6\) windows.
Four Categories of Memorization Behavior Classification (based on match count \(c\)):
- Synthetic Coherence (\(c = 0\)): No matches; coherent text generated entirely by the model itself. Standalone perplexity varies significantly, but even high-perplexity generations maintain coherence.
- Memorization (\(0 < c < 5\)): A small number of matches, which can be traced back to specific training documents with high precision. This is highly valuable for privacy and PII leakage detection.
- Segmental Replication (\(5 \le c < 50\)): Medium frequency, reflecting standardized expressions and terminology in the domain.
- Frequently Encountered Text (\(c > 50\)): A large number of matches, typically consisting of highly repetitive boilerplate text such as legal disclaimers, licensing terms, and HTML tags.
- The thresholds of 5 and 50 are manually selected, with gradient colors representing smooth transitions between categories.
Standalone Perplexity Evaluation:
- Recalculating window perplexity without context to evaluate the fluency and coherence of the text itself.
- Low standalone perplexity indicates that the text itself is fluent, coherent, and human-like.
- Used to distinguish between "low perplexity due to context" and "naturally fluent text itself."

Key Experimental Results¶

Main Results: Match Statistics of Low-Perplexity Windows against Training Data¶

Domain	Total Windows \(N\)	Matched Windows \(N_{c>0}\)	Match Ratio	Prompt Overlap Ratio
Cryptography	1336	505	38%	32%
Pharmacology	988	659	67%	7.9%
Genetics	1337	481	36%	29%
Nuclear Physics	1040	264	25%	15%
Total	4701	1909	41%	21%

Behavioral Classification Distribution¶

Domain	Synthetic Coherence (STH)	Memorization (MEM)	Segmental Replication (SEG)	Frequently Encountered Text (FET)
Cryptography	62%	11%	13%	14%
Pharmacology	33%	7.5%	9.3%	50%
Genetics	64%	7.7%	11%	17%
Nuclear Physics	75%	8.1%	9.3%	8%

Model Scale Ablation (Genetics Domain)¶

Model Size	Low-Perplexity Windows \(N\)	Matched Windows \(N_{c>0}\)	Match Ratio	Standalone Perplexity
70M	8528	2874	34%	9.2
410M	2274	716	31%	8.4
1B	2766	878	32%	8.6
2.8B	1714	488	28%	8.6
6.9B	1337	481	36%	8.5

Temperature Ablation (Genetics Domain, Pythia-6.9B)¶

Temperature \(T\)	Low-Perplexity Windows \(N\)	Matched Windows \(N_{c>0}\)	Match Ratio	Standalone Perplexity
0.2	8787	2908	33%	8.7
0.4	4523	1461	32%	8.9
0.5	3297	1091	33%	8.8
0.7	1337	481	36%	8.5

Statistics of Low-Perplexity Sequence Lengths¶

Domain	Average Length	Standard Deviation
Cryptography	12	11
Pharmacology	14	15
Genetics	14	14
Nuclear Physics	13	12

Key Findings¶

59% of low-perplexity windows have no matches in the training data: This challenges the intuitive assumption that "low perplexity = verbatim copy," demonstrating that a large amount of high-confidence generation stems from model generalization ability.
Significant domain differences: Pharmacology exhibits the highest match rate (67%), because The Pile contains a vast amount of PubMed biomedical literature. Nuclear physics has the lowest (25%), reflecting less coverage of training data in this field.
Approximately 20% fall into the 'manually reviewable' range: The number of matched documents in the Memorization and Segmental Replication categories is small enough to allow manual source auditing.
Larger models generate fewer low-perplexity windows: From 70M to 6.9B, the number of windows drops from 8,528 to 1,337, indicating higher generation diversity in larger models.
Temperature has limited impact on the match ratio: It remains stable within the 33-36% range, but lower temperatures significantly increase the total number of windows and degradation.
Frequently Encountered Text accounts for 50% in pharmacology: This is due to the high repetition of drug names and standardized biomedical terminology in PubMed.

Highlights & Insights¶

Empirical refutation of the 'Low Perplexity = Verbatim Copying' assumption: Nearly 60% of high-confidence generations cannot be traced back to the training data, indicating that models possess "synthetic coherence" capability. This challenges the theoretical basis of perplexity-based AIGC detection methods.
Practical utility of the four-class behavior classification framework: Although the thresholds are subjective, they provide an actionable analytical tool for LLM memorization behaviors. The "approximately 20% traceable" portion has direct practical value for privacy auditing and copyright compliance.
Domain discrepancies in matching serve as probes for training data coverage: Differences in match rates across domains reflect the distribution of training data over various fields, which can be utilized to evaluate model data exposure in specific domains.
Open source and reproducible: A complete open-source pipeline is provided, facilitating experimental replication across different models and datasets.

Limitations & Future Work¶

The choice of thresholds (\(c = 5, c = 50\)) is arbitrary, lacks clustering validation, and suffers from fuzzy classification boundaries.
Evaluated only on the Pythia model family; has not been validated on mainstream closed-source or open-source models like GPT or LLaMA.
Prompts are derived from the Pile dataset itself, which may artificially inflate the match rate.
Tokenizer mismatch between Pythia and Infinigram (which uses the LLaMA-2 tokenizer), potentially leading to missed true matches.
High standalone perplexity does not consistently indicate text degradation; the reliability of this metric requires further validation.
Only covers 4 scientific domains, lacking diverse scenarios such as daily conversations, news, and code.

Carlini et al. (2021): Explores how to "extract" training data from LLMs; in contrast, this paper focuses on the alignment between low-perplexity segments in natural generation and the training data.
Liu et al. (2025a) Infinigram: Provides high-efficiency TDA tools; this work employs this tool but focuses on the analytical framework for low-perplexity segments.
McCoy et al. (2023), Merrill et al. (2024): Studies on LLM novelty/memorization; this work approaches this topic from a perplexity perspective, offering a fresh angle.
Prashanth et al. (2025): Hypothesizes that low-perplexity sequences imply degradation or verbatim copying; this study partially refutes this hypothesis through empirical experiments.
Gonen et al. (2024): Uses standalone perplexity to evaluate text quality; this paper introduces it into training data attribution analysis.

Rating¶

Novelty: ⭐⭐⭐⭐ The perspective of combining low-perplexity sequences with training data attribution is novel, and the discovery of "synthetic coherence" is highly valuable.
Experimental Thoroughness: ⭐⭐⭐ Only used Pythia on 4 domains; scale is limited but ablation studies are relatively complete.
Writing Quality: ⭐⭐⭐⭐ Clearly structured with good visualizations, and the classification framework is intuitive and easy to understand.
Value: ⭐⭐⭐⭐ Provides useful empirical insights for the fields of training data attribution and AIGC detection.