Learning to Rewrite: Generalized LLM-Generated Text Detection¶
Conference: ACL 2025
arXiv: 2408.04237
Code: None
Area: AIGC Detection
Keywords: AI Text Detection, Rewrite Distance, Out-of-Distribution Generalization, Adversarial Robustness, LoRA Fine-tuning
TL;DR¶
The Learning2Rewrite (L2R) framework is proposed, which fine-tunes an LLM-based rewriting model to amplify the difference in rewrite edit distance between human-written and AI-generated text, thereby achieving highly generalized AI text detection across domains. L2R achieves an average AUROC of 0.9009 across 21 independent domains, outperforming RAIDAR by 4.67% and direct classification fine-tuning by 51.35% in out-of-distribution tests.
Background & Motivation¶
Existing LLM-generated text detection methods face the core challenge of generalization. Classifier-based training methods perform well within the training domain but overfit severely on out-of-distribution (OOD) data. Zero-shot methods based on statistical metrics (e.g., DetectGPT, Fast-DetectGPT) rely on specific statistical characteristics, which are easily corrupted by simple adversarial attacks and remain unstable across domains.
The RAIDAR method observes that the amount of modification generated by LLMs when rewriting AI text is less than when rewriting human-written text, utilizing edit distance as a detection signal. However, the limitation of this method is that the rewriting threshold varies significantly across different domains—the optimal threshold for legal documents and product reviews is completely different, making it difficult for a single classifier to generalize across domains.
The core insight of L2R is: since the default rewriting LLM does not produce a sufficiently large and stable difference when processing AI and human-written texts, training the rewriting model itself to maximize this difference can yield a domain-stable detection signal.
Method¶
Overall Architecture¶
L2R consists of three stages: 1. Rewrite the input text using an LLM. 2. Calculate the edit distance (Levenshtein distance normalized similarity) between the original text and the rewritten result. 3. Classify AI-generated vs. human-written text based on a threshold.
Its key innovation lies in fine-tuning the rewriting model, instead of using pre-trained models to rewrite directly.
Key Designs¶
-
Rewrite Edit Distance as a Detection Signal: For input text x, use LLM F to rewrite it and calculate the Levenshtein similarity: \(D_k(x, F(p,x)) = 1 - \frac{Levenshtein(F(p,x), x)}{max(len(F(p,x)), len(x))}\). AI-generated text tends to change less after rewriting (high similarity), while human-written text changes more (low similarity).
-
Fine-tuning Objective Function Design: Train the rewriting model F' to rewrite human-written text \(x_h\) as much as possible and rewrite AI text \(x_{ai}\) as little as possible: \(\max\{D(x_h, F'(p, x_h)) - D(x_{ai}, F'(p, x_{ai}))\}\). Since edit distance is non-differentiable, cross-entropy loss is used as a proxy. For human-written text, a negative sign is applied to "encourage more rewriting" through gradient directions, while for AI text, the loss is minimized normally to "encourage keeping it as is".
-
Calibration Loss: Unconstrained maximization of human text rewriting is prone to model degradation (verbose output, overfitting). Therefore, a threshold t is introduced: for human-written text, gradients are backpropagated only when loss L < t; for AI text, gradients are backpropagated only when loss L > t. This forces the model to optimize only "hard samples"—those not yet correctly classified by the current threshold, similar to the preference learning concept in DPO. The threshold t is determined by performing a forward pass on the training set and training a logistic regression model before fine-tuning. Calibration loss improves the average AUROC by 4.54% (0.8555 \(\rightarrow\) 0.9009).
-
Diverse Prompt Dataset: Constructed 200 different rewriting prompts (ranging from formal to casual, simple to complex) that are randomly sampled during training. This allows the model to capture more realistic AI text distributions. Compared to a single prompt, the AUROC on Gemini rewriting improved from 0.7302 to 0.7566.
Dataset Construction¶
- Collected human-written text from 21 independent domains (academia, law, sports, food, religion, etc.), with 200 paragraphs per domain.
- Generated corresponding AI texts using four models: GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, and Llama-3-70B.
- Totaling 21,000 samples.
- Strictly ensured that all human-written texts were collected before the release of ChatGPT (November 30, 2022).
Key Experimental Results¶
Main Results (21-Domain Average AUROC)¶
| Detector | Average AUROC | Standard Deviation |
|---|---|---|
| Fast-DetectGPT | 0.6705 | 0.1015 |
| Ghostbusters | 0.7053 | 0.1259 |
| RAIDAR (Gemini Rewrite) | 0.7566 | 0.0928 |
| RAIDAR (Llama Rewrite) | 0.7970 | 0.1212 |
| Llama L2R | 0.9009 | 0.0634 |
L2R outperforms Fast-DetectGPT in 20 out of 21 domains, with an average improvement of 23.04%; it outperforms Ghostbusters in 20 domains, with an average improvement of 19.56%. It also achieved the lowest standard deviation (0.0634), showing the best cross-domain stability.
Out-of-Distribution Generalization (M4 Dataset)¶
| Model | In-Distribution | Out-of-Distribution |
|---|---|---|
| Ghostbusters | 0.7053 | 0.3888 |
| Llama Logits (Direct Classification) | 0.9774 | 0.1426 |
| RAIDAR (Llama) | 0.7970 | 0.6931 |
| Llama L2R (Low-Parameter) | 0.8315 | 0.7398 |
Directly fine-tuning Llama for classification yields the highest ID performance (0.9774), but OOD performance plummets to 0.1426—indicating severe overfitting. The low-parameter version of L2R achieves an OOD of 0.7398, which is 4.67% higher than RAIDAR and 51.35% higher than direct classification.
Adversarial Attacks¶
| Model | No Attack | Decoherence Attack | Rewrite Attack |
|---|---|---|---|
| Fast-DetectGPT | 0.6705 | 0.4984 | 0.5100 |
| Llama Logits | 0.9774 | 0.7281 | 0.6563 |
| RAIDAR (Llama) | 0.7970 | 0.7681 | 0.7944 |
| Llama L2R | 0.9009 | 0.8746 | 0.8927 |
L2R maintains the highest AUROCs under both attacks, dropping only by 2.63% under Decoherence attack and only by 0.82% under Rewrite attack.
Ablation Study¶
| Configuration | Key Metrics | Description |
|---|---|---|
| With Calibration Loss | AUROC 0.9009 | Full L2R |
| Without Calibration Loss | AUROC 0.8555 | Drops by 4.54%, risk of model degradation |
| Trained on MAGE Dataset vs. Custom Dataset | OOD AUROC gap of 15.98% | Validates the value of diverse datasets |
| Parameter count increased from 1M to 7M | ID \(\uparrow\), OOD \(\downarrow\) | More parameters lead to more severe overfitting |
Key Findings¶
- Training objective determines generalization: With the same parameter size for Llama fine-tuning, L2R's rewriting objective outperforms direct classification by 51.35% on OOD.
- Overfitting is positively correlated with parameter size: As LoRA parameters increase, ID performance improves but OOD performance declines.
- Edit distance is a domain-agnostic feature: After fine-tuning, the average edit ratio of human-written text is 0.6981, and AI text is 0.8606. This gap is sufficiently large and stable across domains.
- Calibration Loss acts like DPO: It optimizes only hard samples to prevent model degradation.
Highlights & Insights¶
- Elegant design of training objective: Instead of training a classifier, it trains a rewriter—the indirectly obtained detection signal is naturally more generalizable.
- Consistent with theory and intuition: LLMs have "little to rewrite" for their own generated texts but "much to rewrite" for human-written texts. Fine-tuning amplifies this natural tendency.
- Cleverly designed Calibration Loss: A hard-sample mining strategy similar to DPO, with the threshold automatically determined.
- Dataset design worth learning from: 21 domains ensure the training distribution is sufficiently diverse, and 200 prompts avoid prompt bias.
- Interpretability: The rewritten differences naturally highlight "what AI likely wrote".
Limitations & Future Work¶
- Large inference overhead: Each detection requires a full rewriting step by the LLM (~13.5 seconds/120 words), making it unsuitable for large-scale deployment.
- The rewriting model itself could be utilized by attackers to generate text that is harder to detect.
- Llama-3-8B as a rewriting model might not be strong enough for certain domains; using larger models could improve quality but would cost more.
- Performance on Chinese or other non-English languages has not been tested.
- The choice of domain for fine-tuning data impacts performance; how to optimally select training domains is worth studying.
- The specific phrasing of rewriting prompts might affect detection performance.
Related Work & Insights¶
- Directly improves RAIDAR, solving its core issue of cross-domain threshold instability.
- Complementary to statistical-feature-based methods like Fast-DetectGPT—L2R is more robust but slower.
- The design of Calibration Loss can be transferred to other tasks requiring "distribution separation".
- Insight: In AI detection, "forcing the model to expose itself" might be more promising than "training a classifier".
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Training a rewriting model instead of a classifier for detection is a unique idea with clear intuition.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 21 domains, 4 LLMs, ID/OOD/adversarial attacks, ablation studies, and parameter sensitivity analysis—extremely comprehensive.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear motivation, precise methodology, intuitive figures, and well-structured.
- Value: ⭐⭐⭐⭐ Breakthrough in generalization is significant, but inference efficiency remains the main bottleneck for practical deployment.