Skip to content

DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection

Conference: ACL2026
arXiv: 2605.15518
Code: https://github.com/AIDC-AI/Marco-LLM/tree/main/DetectRL-X
Area: AIGC Detection / LLM-Generated Text Detection
Keywords: LLM-Generated Text Detection, Multilingual Robustness, Ternary Classification, Attack Evaluation, DetectRL-X

TL;DR

DetectRL-X constructs a multilingual, multi-domain, multi-attack, and multi-length benchmark with 3.456 million samples for parallel binary/ternary LLM-generated text detection. It demonstrates that existing detectors still possess significant robustness deficiencies in real-world multilingual and human-AI collaborative writing scenarios.

Background & Motivation

Background: LLM-generated text detection is typically defined as a binary classification task to distinguish between Human-Written Text (HWT) and LLM-Generated Text (LGT). Existing detectors roughly fall into two categories: statistical-based methods, such as Log-Likelihood, Log-Rank, DetectLLM-LRR, GECScore, and Binoculars; and supervised neural detectors, such as XLM-RoBERTa-Classifier and mDeBERTa-Classifier.

Limitations of Prior Work: Many benchmarks only cover a few languages, limited generators, or clean distributions, making it difficult to address real-world deployment challenges. For instance, in commercial scenarios, text may come from different domains, generators, and languages, and might undergo polishing, expansion, compression, rewriting, back-translation, or character perturbation. More critically, actual text is often neither purely human-written nor purely machine-generated, but rather human-written text revised by an LLM. This Hybrid LLM-Text (HLT) makes traditional binary classification unrealistic.

Key Challenge: High scores achieved by detectors in single-domain, single-language, and single-generator settings do not imply their capability to handle real Internet text. Detection evaluation needs to simultaneously cover language differences, domain shifts, generator variations, text length, attacks, and human-AI collaborative writing; otherwise, the reliability of detectors will be systematically overestimated.

Goal: The authors aim to build a detection benchmark closer to real-world usage, covering 8 commercially common languages, 6 high-risk application domains, 4 mainstream generators, 8 dimensions of attacks/perturbations, 4 text length granularities, and 3 types of revision operations, while evaluating both Binary and Ternary tasks.

Key Insight: Instead of proposing a single new detector, the paper first completes the evaluation space. It incorporates HWT, LGT, and HLT into a unified data framework and establishes a leaderboard to compare the performance of 12 representative detection methods under various distribution shifts.

Core Idea: Use more complex and realistic multilingual evaluations to expose the vulnerabilities of detectors, rather than continuing to pursue near-saturated scores on clean binary benchmarks.

Method

DetectRL-X is a benchmark and evaluation framework. Its "methodology" is primarily reflected in data construction, task definition, attack generation, and evaluation module design. The overall design revolves around three text categories: HWT (human-written text), LGT (LLM-generated text), and HLT (human-written text revised with LLM assistance). The Binary task evaluates \(\{HWT, LGT\}\), and the Ternary task evaluates \(\{HWT, HLT, LGT\}\).

Overall Architecture

Data construction begins by collecting authentic human-written text in 8 languages: English, German, Spanish, French, Portuguese, Russian, Arabic, and Chinese. These are categorized into high, medium, and low complexity groups based on linguistic characteristics. Text sources cover six domains: Academic, News, Novel, SEO, Wiki, and WebText, utilizing only publicly released text from before 2022 to mitigate the risk of LGT contamination.

Subsequently, the authors generate LGT using DeepSeek-V3, Gemini-2.5-flash, GPT-4o, and Qwen-Max. Revision operations such as polishing, expanding, and condensing are performed by Qwen-Max on HWT/LGT to construct HLT and post-processed LGT. Finally, the framework incorporates multilingual paraphrase attacks and perturbation attacks, generating sub-samples at different length granularities (64, 128, 256, 512 tokens). The final dataset comprises 3,456,000 samples, split into train/test sets at a 2:1 ratio.

Key Designs

  1. Expansion from Binary to Ternary Classification:

    • Function: Enables evaluation to cover real-world human-AI collaborative writing, rather than just distinguishing pure human versus pure machine origins.
    • Mechanism: The traditional detection function is \(f_{Binary}: T \to \{HWT, LGT\}\); this paper adds \(f_{Ternary}: T \to \{HWT, HLT, LGT\}\). HLT originates from human-written text that has been polished, expanded, or condensed by an LLM, reflecting common collaborative writing in professional and content production settings.
    • Design Motivation: If a detector can only judge the presence of "machine flavor," it struggles with the grey area where original human manuscripts are partially rewritten by LLMs. Ternary classification more accurately exposes the boundary issues of detectors regarding hybrid authorship.
  2. Multilingual, Multi-domain, Multi-generator Data Construction:

    • Function: Prevents detectors from performing well only on English or single writing styles.
    • Mechanism: Data covers 8 languages and 6 domains; generators include DeepSeek-V3, Gemini-2.5-flash, GPT-4o, and Qwen-Max. Language complexity is partitioned by morphological richness and typological distance from English: high complexity (Arabic, Russian, Chinese), medium complexity (German, French, Spanish, Portuguese), and low complexity (English).
    • Design Motivation: The training distributions of most LLMs and detectors are biased toward English. Non-English languages, especially those with significantly different writing systems and morphological structures, pose challenges for tokenization and representation.
  3. Robustness Evaluation across Attacks, Lengths, and Revisions:

    • Function: Simulates rewriting and noise operations that real users might apply to text.
    • Mechanism: Paraphrase attacks include Encoder Paraphrasing, Seq2seq Paraphrasing, Decoder Paraphrasing, and Back-Translation. Perturbation attacks include Character Insertion, Substitution, Deletion, and Zero-width Insertion. Sub-samples of 64/128/256/512 tokens are constructed to evaluate length sensitivity.
    • Design Motivation: Real-world text detection faces "modified LLM text," not raw model output. The attack dimensions test whether detectors rely on fragile surface statistical features.

Loss & Training

The paper does not propose a new training loss but evaluates 12 existing detectors. Statistical methods include Log-Likelihood, Log-Rank, DetectLLM-LRR, GECScore, ReviseDetect, Fast-DetectGPT, Binoculars, Lastde++, RepreGuard, and Biscope; neural methods include X-Rob-Classifier and mDeBERTa-Classifier. Since LLMs in real-world scenarios are mostly black-boxes and inaccessible, watermarking methods are excluded. Evaluation metrics use Binary and Ternary leaderboards, comparing dimensions like In-Distribution, Cross-Domain, Cross-Generator, Cross-Language, Cross-Paraphrase, Cross-Perturbation, Cross-Length, and Cross-Operation.

Key Experimental Results

Main Results

Task Best / Repr. Method Avg. \(F^B_1\) Avg. \(F^F_1\) Interpretation
Binary X-Rob-Classifier 95.58% 91.31% Ranked 1st on Binary leaderboard; neural detectors overall strongest
Binary mDeBERTa-Classifier 95.48% 93.20% 2nd in Binary, but higher \(F^F_1\) than X-Rob-Classifier
Ternary mDeBERTa-Classifier 87.68% 81.10% Strongest in Ternary, but significant drop compared to Binary
Binary / Ternary Biscope 80.06% / 59.69% 63.62% / 37.91% Even weak neural detectors outperform best statistical detectors on several averages
In-Distribution Stat. GECScore 83.22% N/A Statistical methods are unstable even under ID in complex multilingual mixtures

Ablation Study

Robustness Dimension Observed Performance Change Explanation
Cross-Language Neural Binary avg \(F^B_1\) drops 95.3% \(\to\) 91.4%; Ternary 87.10% \(\to\) 66.28% Cross-lingual transfer is especially difficult in ternary; mDeBERTa drops 20.55%
Cross-Domain vs Cross-Gen Neural Binary: Cross-Domain drops 2.95%, Cross-Gen only 0.78% Domain shift is a more significant real-world bottleneck than generator shift
Paraphrase / Perturbation Neural Binary drops 28.1% / 13.1%; Ternary drops 16.8% / 4.3% Rewriting damages detection signals more than fine-grained character noise
Length / Operation Neural Binary drops \(\sim\)4.5% / 1%; Ternary drops 11.9% / 13.4% Length and revision operations have a larger impact on ternary classification
Binary vs Ternary ID Stat. drops 67.9% \(\to\) 39.3%; Neural drops 97.6% \(\to\) 87.1% HLT category significantly increases task difficulty and reflects real hybrid authorship

Key Findings

  • Neural detectors are generally stronger than statistical ones, but the problem is not "solved." Significant performance drops still occur in Cross-Language, Cross-Domain, and paraphrase scenarios.
  • Statistical detectors are fragile against real hybrid distributions. Even in In-Distribution settings, their average \(F^B_1\) is only 67.89%, indicating that single-domain/single-generator experiments overestimate their effectiveness.
  • The Ternary task is closer to real deployment. On overall averages, statistical methods drop from 58.3% to 35.3%, and neural methods from 90.4% to 76.7%, showing that HLT blurs the boundary between HWT and LGT.
  • Paraphrasing is more dangerous than character perturbation. Statistical detectors lose 25-40% under binary paraphrase, and neural detectors drop up to 35.5%, indicating reliance on surface features that can be eliminated by rewriting.
  • Language complexity provides an analytical dimension. High-complexity languages (Arabic, Russian, Chinese) pose greater challenges in tokenization, representation, and cross-lingual transfer.

Highlights & Insights

  • The primary highlight is shifting the evaluation task from "clean but simple" binary classification back to the real world. The HLT category is crucial because much actual text is not fully automated but human-written and model-polished.
  • The 8 evaluation dimensions make the benchmark feel more like a stress test than a simple leaderboard. It tells developers whether cross-language, cross-domain, length, or rewriting is the primary factor degrading the detector.
  • The paper provides a practical conclusion for statistical methods: they have low deployment costs and better interpretability but are unstable in multi-domain, multilingual, and attack scenarios. One should not rely solely on clean test sets.
  • Insights for detector training: Future models require language-invariant features, domain-robust features, and explicit training data for identifying HLT, rather than just learning the generator fingerprints of LGT.

Limitations & Future Work

  • The authors acknowledge the temporal relevance of the benchmark. As LLM generation quality rapidly improves, future model outputs will move closer to human text, potentially rendering current generators and styles obsolete.
  • Language coverage remains limited. Although 8 languages are broader than most benchmarks, many regional, low-resource languages and dialect variations are still missing.
  • Watermarking methods were excluded. The reasoning is that commercial LLMs are black-boxes in real applications, but this means the benchmark does not cover the "proactive watermark embedding" detection route.
  • The definition of ternary classification could be further refined. HLT currently only covers polishing, expanding, and condensing; future work could include multi-turn human-AI collaborative writing, local paragraph rewriting, and post-editing after cross-lingual translation.
  • vs. Traditional LGT Binary Benchmarks: Traditional datasets usually only distinguish HWT/LGT. DetectRL-X adds HLT, making evaluation more aligned with real-world human-AI collaboration.
  • vs. Multi-generator Evaluations (e.g., M4, RAID): These benchmarks have expanded generators and attacks, but DetectRL-X emphasizes multilingualism, ternary classification, and a unified 8-dimensional robustness comparison.
  • vs. Statistical Detectors: Methods like DetectLLM-LRR, GECScore, and Binoculars rely on probability or logit features. Their advantage is being unsupervised or weakly supervised, while their weakness lies in cross-domain, cross-lingual, and paraphrase robustness.
  • vs. Neural Detectors: X-Rob-Classifier and mDeBERTa-Classifier rank higher, but still show significant declines in Cross-Language and Ternary settings, suggesting that supervised detection requires a broader training distribution.

Rating

  • Novelty: ⭐⭐⭐⭐☆ Few new detection algorithms, but the benchmark design integrates HLT, multilingualism, and real attacks comprehensively.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 3.456M samples, covering 8 languages, 6 domains, 4 generators, 12 detectors, and 8 evaluation dimensions.
  • Writing Quality: ⭐⭐⭐⭐☆ Clear argumentation, although tables extracted from PDF are quite long and some metric naming may be challenging for readers.
  • Value: ⭐⭐⭐⭐⭐ Highly valuable for actual AIGC detection deployment, especially as a reminder not to focus solely on clean binary accuracy in English.