Frankentext: Stitching Random Text Fragments into Long-Form Narratives¶

Conference: ACL 2026 arXiv: 2505.18128 Code: GitHub Area: AIGC Detection Keywords: AIGC detection, mixed authorship attribution, controllable text generation, AI text detector, human-AI collaborative writing

TL;DR¶

This paper proposes Frankentext, a paradigm where LLMs stitch random human text fragments into coherent long-form narratives under extreme constraints (90% content verbatim-copied from human writing), revealing severe failures of existing AI text detectors in mixed-authorship scenarios (72% of Frankentext is misclassified as human-written).

Background & Motivation¶

Background: As LLM-generated text quality continues to improve, AI text detection has become critical for academic integrity and content provenance. Existing detectors primarily assume a binary (AI vs. human) classification paradigm.

Limitations of Prior Work: In reality, there exists a substantial "gray area" of human-AI collaborative writing—text is neither purely AI nor purely human-authored, but a mixture. Existing binary detectors (e.g., Binoculars, FastDetectGPT) cannot effectively identify such mixed text.

Key Challenge: Current detection methods rely on surface features (e.g., perplexity, statistical signatures), but when AI-generated content heavily embeds genuine human text, these statistical features are diluted, causing detection failure.

Goal: Systematically study an extreme controllable generation paradigm—Frankentext, where LLMs must generate coherent narratives with most tokens verbatim-copied from human writing, to expose detector vulnerabilities and advance fine-grained detection methods.

Key Insight: Inspired by Frankenstein—assembling a complete "creature" from fragments of different sources. The LLM acts as a composer rather than a writer, selecting, arranging, and stitching coherent stories from thousands of random human text fragments.

Core Idea: Through a prompt-based pipeline, LLMs select and stitch randomly sampled human text paragraphs while maintaining a specified copy rate (e.g., 90%) and generating coherent, relevant narratives, thereby posing a fundamental challenge to existing AI detectors.

Method¶

Overall Architecture¶

The Frankentext pipeline has two main stages: first, randomly sample 1500 human text fragments (~103K BPE tokens) from a large book corpus (Books3, containing 197K books and 156M paragraphs), then input them alongside a writing prompt to the LLM for constrained draft generation; second, iteratively edit to fix contradictions and incoherences.

Key Designs¶

Draft Generation:
- Function: Have LLMs select and combine random human text fragments into a preliminary narrative
- Mechanism: Provide the LLM with a writing prompt and 1500 randomly sampled paragraph-level human text fragments, requiring it to generate a ~500-word story with 90% content verbatim-copied from given fragments, allowing only minimal connective words and transitional phrases
- Design Motivation: Leverage LLMs' ability to implicitly search combinatorial spaces, finding composable fragments from unrelated human text—a task nearly infeasible for humans
Iterative Polishing:
- Function: Fix contradictions, continuity errors, and grammatical issues in the draft
- Mechanism: Use the same LLM to identify and apply minimal edits (similar to self-correction), maintaining the verbatim copy constraint and writing prompt while improving coherence, with up to 3 iterations
- Design Motivation: Drafts may contain character contradictions, factual conflicts, and irrelevant content; the editing stage can improve coherence from 68% to 81%
Copy Rate Control and Detection Feedback:
- Function: Ensure final output meets the preset human text copy ratio
- Mechanism: Optionally use ROUGE-L or AI text detectors themselves as feedback signals; if the copy rate is insufficient or the text is detected as AI-generated, prompt the LLM for revision
- Design Motivation: Balance between copy rate and coherence while maximizing detection evasion probability

Key Experimental Results¶

Main Results¶

Model	Copy Rate	Relevance%	Coherence%	Pangram AI%	Binoculars%	FastDetectGPT%
Gemini-2.5-Pro (vanilla)	-	100	100	100	52	99
Gemini-2.5-Pro (FT)	75%	100	81	4	0	1
o4-mini (FT)	68%	56	29	24	0	1
Claude-3.7-Sonnet (FT)	64%	95	73	17	2	16
DeepSeek-R1 (FT)	42%	91	72	74	0	0
Qwen-3-32B thinking (FT)	36%	91	54	85	52	92

Ablation Study¶

Config	Key Metric	Note
Remove editing stage	Coherence 68%→81%	Editing significantly improves coherence
Use related paragraphs (FAISS retrieval)	Copy rate only 38%, detection rate drops	Relevance paradoxically reduces copy rate
Use sentences instead of paragraphs	Coherence 89%, detection rate +14%	Shorter units improve coherence but are easier to detect
Maximum context (100K paragraphs)	Copy rate 76%, coherence 89%	Best performance but 10× cost
Single book (Monte Cristo)	Coherence 89%, relevance 97%	Single source also works effectively
Require 25% human text	Pangram AI detection 54%	More AI text → easier to detect
Require 90% human text	Pangram AI detection 4%	More human text → harder to detect

Key Findings¶

Gemini-2.5-Pro is the best Frankentext generator, balancing coherence (81%), relevance (100%), and low detection rate (4%)
Binoculars and FastDetectGPT almost completely fail against Frankentext (detection rate <2%); Pangram performs better but still misses 59%
Human evaluators can identify 56% of Frankentext as AI-involved on average, outperforming automatic detectors
Copy rate and detection rate are inversely related: higher human text ratio makes detection harder but decreases coherence
AI keyword frequency (e.g., "Elara") drops sharply from 686 in vanilla to 10 in Frankentext

Highlights & Insights¶

Gray area discovery: Frankentext breaks the "AI vs. human" binary assumption, revealing a mixed-authorship space that detectors struggle with
Cost-effectiveness: Each Frankentext costs only $1.32 (Gemini), much less than the CoAuthor human-AI dataset at $2.50/piece, with no complex setup required
Token-level annotation: Each Frankentext comes with copy-vs-generated token-level labels, directly usable for training mixed-authorship detection models
Unique human feel: Evaluators praised Frankentext for its distinctive "human quality"—imaginative premises, vivid descriptions, and dry humor—precisely because most content genuinely comes from human writing

Limitations & Future Work¶

Depends on large-scale high-quality same-domain human text corpora; low-resource languages and specialized domains (e.g., technical manuals) are difficult to apply directly
The copy rate metric may underestimate actual human text proportion
This paper only exposes the attack surface without proposing specific defense mechanisms
Non-fiction domain (e.g., news) Frankentext quality still has room for improvement, as generated text tends toward narrative style
Books3 contains copyrighted works, raising creative attribution and copyright concerns

vs Binoculars/FastDetectGPT: These perplexity-based detectors almost completely fail against Frankentext, demonstrating that surface statistical features are insufficient for mixed-authorship text
vs Pangram: As a trained classifier, Pangram can partially detect mixed text (37% labeled as mixed), but still misses 59% of Gemini Frankentext
vs CoAuthor: Frankentext provides a cheaper, scalable way to generate mixed-authorship data, covering word-level and sentence-level granularities
vs Paraphrasing attacks: Unlike rewriting original text to evade detection, Frankentext directly uses raw human text—an entirely new attack vector

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Proposes an entirely new text generation paradigm, positioning LLMs as "composers" rather than "authors"—a highly novel perspective
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 5 model families, 3 detectors, human evaluation, multiple ablation experiments with extremely broad coverage
Writing Quality: ⭐⭐⭐⭐ Clear paper structure with vivid Frankenstein analogy, though some sections could be more refined
Value: ⭐⭐⭐⭐⭐ Critically important warning for the AI text detection field, pushing the shift from binary classification to fine-grained detection