Skip to content

Low-Perplexity LLM-Generated Sequences and Where To Find Them

Conference: ACL 2025
arXiv: 2507.01844
Code: GitHub
Area: AIGC Detection
Keywords: Low-perplexity sequences, training data attribution, memorization, verbatim reproduction, Infinigram, Pythia

TL;DR

This paper proposes a systematic pipeline to analyze low-perplexity sequences (token prediction probability \(\ge 0.9\)) generated by LLMs and trace them back to training data sources. It is found that 30-60% of low-perplexity segments cannot be matched to the training data, and the matchable segments are categorized into four types of memorization behaviors.

Background & Motivation

Background: Training Data Attribution (TDA) is a crucial direction for understanding how LLMs utilize training data. It is primarily divided into causal methods (retraining/gradient analysis, where computational costs explode as model size scales) and similarity-based methods (embedding/exact matching, which are scalable but only provide approximate attribution).

Limitations of Prior Work: Existing research on verbatim memorization primarily focuses on "whether models can be induced to output training data," lacking a systematic analysis of the relationship between low-perplexity generated text and training data.

Key Challenge: Intuitively, when an LLM generates high-confidence (low-perplexity) text, it should be copying the training data. However, does this assumption hold? Does low perplexity necessarily imply verbatim reproduction?

Key Insight: This paper focuses on specialized domains (genetics, nuclear physics, cryptography, and pharmacology), where utilizing rich professional terminology makes it easier to extract long low-perplexity segments, and constructs a complete extraction \(\rightarrow\) matching \(\rightarrow\) classification pipeline.

Method

Overall Architecture

Random 20-40 token segments are extracted from Wikipedia articles to serve as prompts, and Pythia-6.9B is used to continue writing. For the generated text, the perplexity of each token is extracted to identify all continuous low-perplexity sequences (\(\log_2(P) \le 0.152\), i.e., probability \(\ge 0.9\)). A sliding 6-token window is applied over these sequences, using Infinigram to perform exact matching in the training data (The Pile, 300B tokens). Based on the match count \(c\), windows are categorized into four behavioral types.

Key Designs

  1. Low-Perplexity Sequence Extraction:

    • Token perplexity threshold is set to \(\log_2(P) \le 0.152\) (probability \(\ge 0.9\)), extracting the longest continuous subsequence that meets this condition.
    • Selecting 40 Wikipedia articles for each of the four domains to extract random quotes as prompts. Generating 5 continuations per prompt, resulting in a total of 800 generations.
    • Generation parameters: top_k=20, top_p=0.8, T=0.7.
    • The average length of low-perplexity sequences is approximately 12-14 tokens, with a standard deviation of 11-15 tokens.
  2. Fixed-Window Matching (6-token windows):

    • Sliding a 6-token window (stride=1) across the low-perplexity sequences, executing exact matching for each window against the training data.
    • A 6-token window is long enough to avoid random matching yet short enough to capture meaningful segments.
    • Utilizing Infinigram for large-scale, high-efficiency index matching (which outperforms Elasticsearch in scalability and efficiency).
    • For a low-perplexity sequence of length \(L\), this produces \(L+1-6\) windows.
  3. Four Categories of Memorization Behavior Classification (based on match count \(c\)):

    • Synthetic Coherence (\(c = 0\)): No matches; coherent text generated entirely by the model itself. Standalone perplexity varies significantly, but even high-perplexity generations maintain coherence.
    • Memorization (\(0 < c < 5\)): A small number of matches, which can be traced back to specific training documents with high precision. This is highly valuable for privacy and PII leakage detection.
    • Segmental Replication (\(5 \le c < 50\)): Medium frequency, reflecting standardized expressions and terminology in the domain.
    • Frequently Encountered Text (\(c > 50\)): A large number of matches, typically consisting of highly repetitive boilerplate text such as legal disclaimers, licensing terms, and HTML tags.
    • The thresholds of 5 and 50 are manually selected, with gradient colors representing smooth transitions between categories.
  4. Standalone Perplexity Evaluation:

    • Recalculating window perplexity without context to evaluate the fluency and coherence of the text itself.
    • Low standalone perplexity indicates that the text itself is fluent, coherent, and human-like.
    • Used to distinguish between "low perplexity due to context" and "naturally fluent text itself."

Key Experimental Results

Main Results: Match Statistics of Low-Perplexity Windows against Training Data

Domain Total Windows \(N\) Matched Windows \(N_{c>0}\) Match Ratio Prompt Overlap Ratio
Cryptography 1336 505 38% 32%
Pharmacology 988 659 67% 7.9%
Genetics 1337 481 36% 29%
Nuclear Physics 1040 264 25% 15%
Total 4701 1909 41% 21%

Behavioral Classification Distribution

Domain Synthetic Coherence (STH) Memorization (MEM) Segmental Replication (SEG) Frequently Encountered Text (FET)
Cryptography 62% 11% 13% 14%
Pharmacology 33% 7.5% 9.3% 50%
Genetics 64% 7.7% 11% 17%
Nuclear Physics 75% 8.1% 9.3% 8%

Model Scale Ablation (Genetics Domain)

Model Size Low-Perplexity Windows \(N\) Matched Windows \(N_{c>0}\) Match Ratio Standalone Perplexity
70M 8528 2874 34% 9.2
410M 2274 716 31% 8.4
1B 2766 878 32% 8.6
2.8B 1714 488 28% 8.6
6.9B 1337 481 36% 8.5

Temperature Ablation (Genetics Domain, Pythia-6.9B)

Temperature \(T\) Low-Perplexity Windows \(N\) Matched Windows \(N_{c>0}\) Match Ratio Standalone Perplexity
0.2 8787 2908 33% 8.7
0.4 4523 1461 32% 8.9
0.5 3297 1091 33% 8.8
0.7 1337 481 36% 8.5

Statistics of Low-Perplexity Sequence Lengths

Domain Average Length Standard Deviation
Cryptography 12 11
Pharmacology 14 15
Genetics 14 14
Nuclear Physics 13 12

Key Findings

  • 59% of low-perplexity windows have no matches in the training data: This challenges the intuitive assumption that "low perplexity = verbatim copy," demonstrating that a large amount of high-confidence generation stems from model generalization ability.
  • Significant domain differences: Pharmacology exhibits the highest match rate (67%), because The Pile contains a vast amount of PubMed biomedical literature. Nuclear physics has the lowest (25%), reflecting less coverage of training data in this field.
  • Approximately 20% fall into the 'manually reviewable' range: The number of matched documents in the Memorization and Segmental Replication categories is small enough to allow manual source auditing.
  • Larger models generate fewer low-perplexity windows: From 70M to 6.9B, the number of windows drops from 8,528 to 1,337, indicating higher generation diversity in larger models.
  • Temperature has limited impact on the match ratio: It remains stable within the 33-36% range, but lower temperatures significantly increase the total number of windows and degradation.
  • Frequently Encountered Text accounts for 50% in pharmacology: This is due to the high repetition of drug names and standardized biomedical terminology in PubMed.

Highlights & Insights

  • Empirical refutation of the 'Low Perplexity = Verbatim Copying' assumption: Nearly 60% of high-confidence generations cannot be traced back to the training data, indicating that models possess "synthetic coherence" capability. This challenges the theoretical basis of perplexity-based AIGC detection methods.
  • Practical utility of the four-class behavior classification framework: Although the thresholds are subjective, they provide an actionable analytical tool for LLM memorization behaviors. The "approximately 20% traceable" portion has direct practical value for privacy auditing and copyright compliance.
  • Domain discrepancies in matching serve as probes for training data coverage: Differences in match rates across domains reflect the distribution of training data over various fields, which can be utilized to evaluate model data exposure in specific domains.
  • Open source and reproducible: A complete open-source pipeline is provided, facilitating experimental replication across different models and datasets.

Limitations & Future Work

  • The choice of thresholds (\(c = 5, c = 50\)) is arbitrary, lacks clustering validation, and suffers from fuzzy classification boundaries.
  • Evaluated only on the Pythia model family; has not been validated on mainstream closed-source or open-source models like GPT or LLaMA.
  • Prompts are derived from the Pile dataset itself, which may artificially inflate the match rate.
  • Tokenizer mismatch between Pythia and Infinigram (which uses the LLaMA-2 tokenizer), potentially leading to missed true matches.
  • High standalone perplexity does not consistently indicate text degradation; the reliability of this metric requires further validation.
  • Only covers 4 scientific domains, lacking diverse scenarios such as daily conversations, news, and code.
  • Carlini et al. (2021): Explores how to "extract" training data from LLMs; in contrast, this paper focuses on the alignment between low-perplexity segments in natural generation and the training data.
  • Liu et al. (2025a) Infinigram: Provides high-efficiency TDA tools; this work employs this tool but focuses on the analytical framework for low-perplexity segments.
  • McCoy et al. (2023), Merrill et al. (2024): Studies on LLM novelty/memorization; this work approaches this topic from a perplexity perspective, offering a fresh angle.
  • Prashanth et al. (2025): Hypothesizes that low-perplexity sequences imply degradation or verbatim copying; this study partially refutes this hypothesis through empirical experiments.
  • Gonen et al. (2024): Uses standalone perplexity to evaluate text quality; this paper introduces it into training data attribution analysis.

Rating

  • Novelty: ⭐⭐⭐⭐ The perspective of combining low-perplexity sequences with training data attribution is novel, and the discovery of "synthetic coherence" is highly valuable.
  • Experimental Thoroughness: ⭐⭐⭐ Only used Pythia on 4 domains; scale is limited but ablation studies are relatively complete.
  • Writing Quality: ⭐⭐⭐⭐ Clearly structured with good visualizations, and the classification framework is intuitive and easy to understand.
  • Value: ⭐⭐⭐⭐ Provides useful empirical insights for the fields of training data attribution and AIGC detection.