Iron Sharpens Iron: Defending Against Attacks in Machine-Generated Text Detection with Adversarial Training¶

Conference: ACL 2025
arXiv: 2502.12734
Code: GitHub
Area: AIGC Detection
Keywords: machine-generated text detection, adversarial training, robustness, text perturbation, adversarial attacks

TL;DR¶

This paper proposes the GREATER adversarial training framework, which simultaneously trains an adversarial attacker (Greater-A) and an MGT detector (Greater-D). The attacker identifies critical tokens through surrogate model gradients and perturbs them in the embedding space to generate adversarial samples. The detector learns generalized defense from these curriculum-style adversarial samples. Under 16 attacks, the ASR drops to 5.53% (compared to SOTA's 6.20%), while the attack efficiency is 4 times faster than SOTA.

Background & Motivation¶

Background: MGT detectors (such as DetectGPT, watermarking, etc.) perform well under normal conditions but face a 30-50% decline in accuracy when encountering simple perturbations (editing/paraphrasing/prompt modifications).

Limitations of Prior Work: (a) Existing defense methods (such as Text-RS, CERT-ED) fail to generalize to unseen attacks; (b) Adversarial attack methods either require white-box access or demand excessively high query volumes; (c) Adversarial training in the text domain lacks generalization.

Key Challenge: Effectively constructing efficient adversarial samples under a black-box setting for adversarial training, and ensuring that the trained defense generalizes to various different attacks.

Goal: Build a general adversarial training framework for MGT detection that simultaneously improves attack efficiency and defense generalization.

Key Insight: "Attack-to-defend" — synchronously update the attacker and detector so that the detector learns defense capabilities from increasingly stronger attacks.

Core Idea: Adversarially train the attacker and detector synchronously, and perform a greedy search in the embedding space to generate efficient adversarial samples, achieving generalized defense against various attacks.

Method¶

Overall Architecture¶

Greater-A (Attacker): surrogate model extracts token embeddings \(\rightarrow\) scoring network identifies key tokens \(\rightarrow\) embedding space gradient ascent perturbation \(\rightarrow\) greedy search + pruning generates substitution words \(\rightarrow\) outputs adversarial samples. Greater-D (Detector): trained on a mixture of benign samples and adversarial samples, updated synchronously with Greater-A.

Key Designs¶

Important Token Identification + Embedding Perturbation:
- Function: Train a scoring network \(\mathcal{F}_\theta\) using the hidden states of the surrogate model to identify the top-\(k\) most important tokens, and perform gradient ascent perturbation in the embedding space.
- Mechanism: The perturbed embedding is \(\tilde{e}_t = e_t + \mathbf{1}_{[t \in \mathbf{I}]} \delta_t\), where \(\delta_t\) is optimized via gradient ascent to maximize the detector loss.
- Design Motivation: Perturbation in the embedding space provides more precise guidance for candidate word generation compared to discrete token replacement.
Greedy Search + Pruning:
- Function: Map the perturbed embeddings back to the vocabulary to find candidate words, and greedily select substitutions that maximize the change in detector predictions.
- Mechanism: Attempt substitutions sequentially based on importance ranking, and prune candidate words that do not alter the prediction to minimize query count.
- Design Motivation: High query efficiency is required in black-box settings; greedy search + pruning requires 4× fewer queries than brute force search.
Synchronous Adversarial Training:
- Function: Simultaneously update Greater-A and Greater-D within the same training step.
- Mechanism: As training progresses, Greater-A becomes increasingly stronger, allowing Greater-D to learn from curriculum-style adversarial samples and generalize to unseen attacks.
- Design Motivation: Unlike two-stage methods that separate attack and defense phases, synchronous updates prevent the defense from overfitting to specific attacks.

Key Experimental Results¶

Defense Effectiveness (ASR%, lower is better)¶

Method	Average of 10 Perturbations	Average of 6 Adversarial Attacks	Total Average
Undefended	~35%	~85%	~55%
Text-RS	-	-	6.20%
TAVAT	-	-	~8%
Greater-D	-	-	5.53%

Attack Effectiveness (Greater-A vs SOTA Attacks)¶

Method	ASR%	Average Query Count
SOTA Attacks (TextFooler, etc.)	88.13%	~400+
Greater-A	96.58%	~100 (4× fewer)

Key Findings¶

Greater-D comprehensively outperforms 10 existing defense methods under 16 attacks (10 perturbations + 6 adversarial attacks).
Greater-A is simultaneously the strongest attack method: achieving an ASR of 96.58% (+8.45%) with 4× fewer queries.
Synchronous updating is the key to generalized defense: defenses trained asynchronously only perform well against attacks encountered during training.
Embedding space perturbation generates more natural adversarial samples than discrete token replacement.

Highlights & Insights¶

The concept of "attack-to-defend" is elegant and natural — a stronger attacker breeds a stronger defender.
Achieves key token identification matching white-box effectiveness under black-box settings through a surrogate model.
Both the attack and defense can be used as independent modules, offering high practicality.

Limitations & Future Work¶

Discrepancies between the surrogate model and the target detector might limit the accuracy of key token identification.
Evaluated only on English text.
Robustness against cross-lingual attacks (e.g., translation attacks) has not been tested.
Adversarial training increases computational costs.

vs RADAR: RADAR employs a paraphraser as an attacker and can only defend against known attacks; GREATER generalizes to unseen attacks through synchronous training.
vs OUTFOX: Outfox relies on in-context learning (ICL) demonstrations of adversarial samples, whereas GREATER does not require external examples.
vs TextFooler/BERT-Attack: Greater-A is more efficient (4× fewer queries) and more effective (+8.45% ASR) as an attacking method.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of synchronous adversarial training and embedding perturbation is novel in the field of MGT detection.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across 16 attacks, 10 defense baselines, and dual assessment of both attack and defense capabilities.
Writing Quality: ⭐⭐⭐⭐ Rigorous threat modeling and clear methodological descriptions.
Value: ⭐⭐⭐⭐⭐ Both attacking and defending components achieve SOTA performance, directly contributing to the robustness research of MGT detection.