MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization¶
Conference: ACL 2026 arXiv: 2601.08564 Code: https://github.com/githigher/MASH Area: Image Generation / Text Detection Adversarial Attack Keywords: AI-generated text detection, black-box adversarial attack, style transfer, text humanization, DPO alignment
TL;DR¶
This paper proposes MASH (Multi-stage Style Humanization), a three-stage pipeline consisting of style-injection SFT → DPO alignment → inference-time refinement, which trains a rewriter with only 0.1B parameters to evade AI-generated text detectors in a black-box setting with an average attack success rate of 92%, while maintaining high linguistic quality.
Background & Motivation¶
Background: The misuse of AI-generated text (AIGT) has driven the development of numerous detection methods, including training-based detectors (e.g., fine-tuned RoBERTa classifiers) and zero-shot detectors (e.g., Binoculars, Fast-DetectGPT). These detectors have achieved high accuracy on standard benchmarks, and several commercial APIs (Writer, Scribbr) have been widely deployed.
Limitations of Prior Work: Existing adversarial evasion strategies face significant practical barriers: (1) perturbation-based methods (e.g., TextFooler, Charmer) achieve limited attack success rates in black-box settings; (2) prompt-based methods (e.g., PromptAttack) rely on the consistency of model instruction-following and produce unstable results; (3) rewriting-based methods (e.g., DIPPER, DPO-Evader) typically require white-box access to the source generator or internal information of the target detector.
Key Challenge: Detectors identify AI-generated text by capturing distributional differences in semantic and statistical features between AI and human text. Effectively evading detection requires fundamentally shifting the style distribution of AI text toward the human text distribution—while preserving semantics—which constitutes a "machine style → human style" transfer task. However, obtaining parallel training data in the direction of "AI text → corresponding human text" is extremely difficult.
Goal: To design a purely black-box style transfer framework that humanizes text generated by arbitrary LLMs to evade detection, without accessing the internal information of the source generator or the target detector.
Key Insight: Inverse data construction—although parallel data in the "AI → human" direction is difficult to obtain, the reverse direction "human → AI" is straightforward (i.e., rewriting human text with an LLM). This asymmetry is exploited to construct parallel corpora, which are then used in style-injection SFT to learn human style patterns.
Core Idea: Detection evasion is reformulated as a specialized "machine style → human style" text style transfer task, implemented via a three-stage pipeline: SFT to learn human style, DPO to learn to cross the detection boundary, and inference-time refinement to ensure output quality.
Method¶
Overall Architecture¶
MASH consists of four stages: (1) Inverse Data Construction—starting from open-source human text, an LLM is used to generate semantically equivalent AI-style text, forming parallel corpora; (2) Style-Injection SFT—a BART-based encoder-decoder architecture with learnable style embeddings for controllable style transfer initialization; (3) DPO Alignment—detector confidence scores serve as implicit reward signals to directly optimize the rewriter toward crossing the detection boundary; (4) Inference-Time Adversarial Refinement—sentences are ranked by perplexity and refined one by one to improve linguistic quality while preserving detection evasion.
Key Designs¶
-
Inverse Data Construction:
- Function: Addresses the scarcity of "AI → human" parallel data.
- Mechanism: Original texts are collected from open-source datasets; texts with high human-confidence scores \(D(\mathbf{x}_{human}) < \tau\) are selected as human samples \(\mathbf{x}_{human}\). Each human text is then rewritten by an LLM to produce a semantically equivalent AI-style text \(\mathbf{x}_{ai}\), which is filtered to retain high-confidence AI samples (\(D(\mathbf{x}_{ai}) > \tau\)). This yields \(N\) parallel pairs \(\mathcal{D}_{pair} = \{(\mathbf{x}_{ai}^{(i)}, \mathbf{x}_{human}^{(i)})\}\).
- Design Motivation: The asymmetry that "human → AI conversion is easy" is exploited to construct high-quality training data at zero annotation cost.
-
Style-Injection Supervised Fine-Tuning:
- Function: Initializes the rewriter with controllable human-style transfer capability.
- Mechanism: Built upon pretrained BART, two learnable style embeddings \(\mathbf{s}_{ai}, \mathbf{s}_{human} \in \mathbb{R}^d\) are introduced. The encoder produces content representations \(\mathbf{H}_{content}\), and a fusion layer injects the selected style via linear projection: \(\mathbf{H}_{fused}^{(t)} = \mathbf{W}_p \cdot [\mathbf{h}_{content}^{(t)}; \mathbf{s}_{style}] + \mathbf{b}_p\). Training uses a multi-task objective: a reconstruction loss \(\mathcal{L}_{recon}\) (injecting AI style to reconstruct the original AI text, preserving semantics) and a transfer loss \(\mathcal{L}_{trans}\) (injecting human style to generate human text, learning style patterns), with total loss \(\mathcal{L}_{SFT} = \lambda\mathcal{L}_{recon} + (1-\lambda)\mathcal{L}_{trans}\).
- Design Motivation: Direct fine-tuning risks overfitting and semantic loss. Style embeddings decouple content from style; the reconstruction task constrains semantic preservation while the transfer task drives style conversion, with joint training achieving a balanced trade-off.
-
DPO Alignment + Inference-Time Refinement:
- Function: Advances the rewriter from "knowing what human style looks like" to "knowing how to cross the detection boundary," and ensures output quality at inference time.
- Mechanism: The reward is defined as the negatively scaled detector confidence \(r(x,y) = -C \cdot D(y)\), such that maximizing this reward drives the optimal policy toward the region where \(D(y) \to 0\). Hard negative mining is applied—SFT model outputs that fail to evade detection serve as rejected responses, and real human texts serve as chosen responses, forming DPO training pairs. At inference time, sentences are ranked by perplexity in descending order; each low-fluency sentence is refined by an LLM, and the replacement is accepted only if the detector still classifies the output as human-written.
- Design Motivation: SFT learns "what the target style looks like," while DPO learns "how far from the detection boundary and how to cross it"—the two stages are functionally complementary. Inference-time refinement repairs potential fluency issues without sacrificing attack effectiveness.
Loss & Training¶
- SFT stage: Multi-task loss \(\mathcal{L}_{SFT} = \lambda\mathcal{L}_{recon} + (1-\lambda)\mathcal{L}_{trans}\), initialized from BART-base.
- DPO stage: \(\mathcal{L}_{DPO} = -\mathbb{E}[\log\sigma(h_\theta(\mathbf{y}_w|\mathbf{x}) - h_\theta(\mathbf{y}_l|\mathbf{x}))]\), where \(h_\theta(\mathbf{y}|\mathbf{x}) = \beta\log\frac{\pi_\theta(\mathbf{y}|\mathbf{x})}{\pi_{ref}(\mathbf{y}|\mathbf{x})}\).
- Hard negative selection criterion: \(D(\mathbf{y}_l) > \tau\), ensuring maximization of the probability margin between preference pairs.
- Optimizer: AdamW; trained on a single NVIDIA RTX 3090 GPU.
Key Experimental Results¶
Main Results (Against RoBERTa Detector, ASR↑)¶
| Method | Essay | Reuters | WP | Humanity | Social | STEM | Avg. |
|---|---|---|---|---|---|---|---|
| DeepWordBug | 0.13 | 0.02 | 0.51 | 0.07 | 0.11 | 0.07 | 0.15 |
| TextFooler | 0.38 | 0.29 | 0.59 | 0.11 | 0.10 | 0.07 | 0.26 |
| Charmer | 0.45 | 0.05 | 0.62 | 0.73 | 0.84 | 0.29 | 0.50 |
| GradEscape | 0.22 | 0.02 | 0.00 | 0.37 | 0.52 | 0.38 | 0.25 |
| CoPA | 0.01 | 0.00 | 0.17 | 0.20 | 0.19 | 0.16 | 0.12 |
| MASH (Ours) | 0.95 | 0.73 | 0.90 | 0.87 | 0.98 | 1.00 | 0.92 |
Ablation Study (Against Binoculars Detector)¶
| Configuration | Essay ASR | Reuters ASR | WP ASR | Note |
|---|---|---|---|---|
| MASH Full | 0.94 | 0.95 | 0.85 | Complete method |
| w/o DPO | ~0.23* | ~0.16* | ~0.33* | ASR drops substantially without DPO |
| w/o Style-SFT | ~0.03* | ~0.03* | ~0.02* | Nearly ineffective without SFT |
*Ablation values estimated from the DPO-Evader baseline (which uses DPO only, without Style-SFT)
Key Findings¶
- A 0.1B model outperforms larger models: MASH, based on BART-base (0.14B), comprehensively surpasses PromptAttack, DIPPER, and other methods that employ large LLMs in attack success rate, demonstrating that precise style alignment is more important than model scale.
- Text quality is well preserved: MASH matches or exceeds the best baselines on the GRUEN (fluency) metric; BERTScore (semantic preservation) remains in the 0.89–0.90 range, and perplexity is maintained at a reasonable level.
- Strong cross-detector generalization: MASH performs well against RoBERTa (training-based), Binoculars (zero-shot), SCRN (denoising reconstruction-based), and commercial APIs (Writer, Scribbr).
- Incremental contribution of each stage: Style-SFT provides foundational style transfer capability; DPO substantially improves the ability to cross the detection boundary; inference-time refinement further ensures output quality. All three stages are indispensable.
Highlights & Insights¶
- Elegance of inverse data construction: The asymmetry that "rewriting human text into AI style is easy, while rewriting AI text into human style is hard" is leveraged to obtain high-quality parallel corpora at zero cost. This idea generalizes to any style transfer task exhibiting directional asymmetry.
- Reframing detection evasion as style transfer: Reformulating the adversarial objective against detectors as a style transfer problem—shifting focus from "how to fool the detector" to "how to make text read like human writing"—represents a conceptual breakthrough that enables a new methodological approach.
- DPO as implicit adversarial training: Embedding detector confidence as an implicit reward within the DPO framework elegantly unifies preference learning with adversarial optimization, without requiring an explicitly defined reward model.
- Extremely low computational overhead: Training requires only a single RTX 3090 GPU; BART-base has just 0.14B parameters. Inference requires no detector queries (only limited interactions during training), making the approach suitable for practical deployment.
Limitations & Future Work¶
- Only ChatGPT-generated text is evaluated as the source; the transferability to text generated by other LLMs (e.g., Claude, Llama) remains unverified.
- The DPO stage requires limited query-based interaction with the detector; alternative strategies are needed when the detector is completely inaccessible.
- Evaluation is limited to English text; style differences and detection evasion in multilingual settings remain unexplored.
- The paper adopts an attacker-only perspective and does not thoroughly discuss how to build detectors that are robust against MASH.
- The inference-time refinement stage relies on an external LLM polisher, introducing additional inference cost.
Related Work & Insights¶
- vs. DIPPER (Krishna et al., 2023): DIPPER uses a T5-XXL (11B) rewriter with controllable lexical and syntactic parameters, but requires white-box access to the source generator and achieves very low ASR in black-box settings (0.07 on Essay). MASH achieves 0.95 with only a 0.14B model in a purely black-box setting.
- vs. DPO-Evader (Nicks et al., 2023): DPO-Evader directly optimizes with DPO but lacks Style-SFT initialization, resulting in extremely low ASR (0.00 on Essay). This demonstrates that the style prior provided by SFT is critical to the success of DPO.
- vs. CoPA (Fang et al., 2025): CoPA relies on computationally expensive contrastive learning-based rewriting and achieves only 0.01–0.20 ASR in black-box settings. MASH comprehensively outperforms it through a more efficient SFT+DPO pipeline.
- vs. GradEscape (Meng et al., 2025): GradEscape requires gradient information from the detector and degrades significantly under strict black-box conditions. MASH maintains high ASR in a purely black-box setting.
Rating¶
- Novelty: ⭐⭐⭐⭐ Reframing detection evasion as style transfer is a novel perspective; the inverse data construction is particularly elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across 6 domains, 5 detectors, and 11 baselines, with full coverage of quality metrics.
- Writing Quality: ⭐⭐⭐⭐ Method description is clear, motivation for each stage is well articulated, and theoretical derivations are rigorous.
- Value: ⭐⭐⭐⭐ Reveals the fragility of current AIGT detectors and provides important insights for research on detector robustness.