HACo-Det: A Study Towards Fine-Grained Machine-Generated Text Detection under Human-AI Coauthoring¶

Conference: ACL 2025
arXiv: 2506.02959
Code: -
Area: AIGC Detection / Human-AI Coauthored Text Detection
Keywords: machine-generated text detection, fine-grained detection, human-AI coauthoring, word-level attribution, sequence labeling

TL;DR¶

This study proposes HACo-Det, a fine-grained machine-generated text (MGT) detection benchmark tailored for human-AI collaborative writing. By employing a multi-round local rewriting pipeline, it automatically constructs 11,200 human-AI coauthored texts with word-level attribution labels. It adapts seven mainstream detectors into a word-level sequence labeling formulation for systematic evaluation, revealing significant room for improvement in current fine-grained detection methods.

Background & Motivation¶

Real-world Demand: With the fast-growing popularity of human-AI collaborative writing systems like GPT-4o Canvas, Notion AI, and Wordcraft, human and AI contributions are increasingly intertwined. Traditional document-level binary detection (classifying the entire text as either "human-written" or "machine-written") fails to meet the practical requirements of fine-grained authorship attribution.
Limitations of Prior Datasets: (1) Mainstream datasets label the entire segment of "human-written prompt + LLM continuation" as MGT, neglecting the authorship of the human-written prefix. (2) They label rewritten texts entirely as machine-generated, though words that remain morphologically unchanged should retain their original human authorship. (3) They only simulate single-round collaboration, leaving a substantial gap to real-world multi-round interaction scenarios.
Bottlenecks of Detection Methods: Metric-based methods (e.g., DetectGPT, Fast-DetectGPT) rely on white-box token-level statistics. While effective at the document level, they lack the capability to generalize to fine-grained word-level or sentence-level detection. Fine-tuning-based methods are more powerful, but their cross-domain and cross-model generalization remains an open challenge.
Key Insight: This work addresses these issues by designing HACo-Det, a multi-round rewriting dataset with word-level annotations, framing MGT detection as a unified word-level sequence labeling task \(D_w(T_w) = L_w\). Sentence-level labels are aggregated via majority voting, and the study systematically compares seven detectors under three settings: In-Distribution (IND), Out-of-Distribution Domain (OOD-Domain), and OOD-Model.

Method¶

Overall Architecture¶

A three-stage pipeline is proposed: Original Text Sampling \(\rightarrow\) Multi-round LLM Local Rewriting \(\rightarrow\) Word-level/Sentence-level Grounded Annotation.

Data Sources: Human-written texts from four domains (News: XSum, Stories: WritingPrompts, Academic Papers: Dagpap24, Wikipedia: Wikipedia_en), with 2,800 documents per domain, totaling 11,200 documents.
Generators: Four instruction-tuned LLMs (Llama-3, Mixtral, GPT-4o mini, GPT-4o), paired with distinct rewriting instruction templates.
Annotation Method: Grounded attribution labeling based on word-level alignment—determining whether each word is Human-Written Text (HWT) or Machine-Generated Text (MGT) by comparing word span alignment before and after rewriting.

Key Designs¶

1. Multi-round Local Rewriting Strategy: In each round, NLTK is used to split the text into a sequence of sentences. According to a pre-defined algorithm (Algorithm 1), a continuous span of sentences is selected and sent to the LLM for rewriting. Each round only rewrites segments already labeled as MGT to avoid overlapping ownership ambiguity. The number of rewriting rounds is proportional to the text length, ranging from 2–3 rounds for short texts (News) up to over 6 rounds for long texts (Wikipedia), thereby generating a natural distribution of different AI intervention ratios.

2. Word-level Grounded Attribution Labeling: Words in the rewritten span \(S'_{\text{span}}\) are labeled as MGT by default. However, if a word appears in the corresponding original span \(S_{\text{span}}\) with the same lemma (including grammatical inflections), it retains its original HWT label. This rule avoids the oversimplification that "rewriting equals complete machine generation," ensuring that words retaining original expressions within an LLM-rewritten paragraph are still attributed to the human author.

3. Unified Sequence Labeling and Sentence-level Aggregation: All detection tasks are unified as word-level binary sequence labeling. For sentence-level detection, majority voting aggregation is applied—if the number of MGT words in a sentence \(s_i\) exceeds the number of HWT words, the sentence is labeled as MGT: \(L_{s_i} = \arg\max_c \sum_{k=1}^{m} \mathbb{I}(L_{w_{i,k}} = c)\). Seven mainstream detectors (DeBERTa, SeqXGPT, DetectGPT, Fast-DetectGPT, NPR, LRR, GLTR) are all adapted to fit this framework.

Dataset Statistics¶

Domain	Samples	Avg. MGT Segments	MGT Word Ratio	Avg. Text Length (Words)
News (XSum)	2,800	2.44	45%	1,489
Story (WritingPrompts)	2,800	3.97	40%	1,964
Paper (Dagpap24)	2,800	4.56	32%	4,558
Wikipedia	2,800	6.01	26%	7,724
Total	11,200	4.25	36%	3,934

Data analysis reveals: GPT-4o demonstrates the highest degree of rewriting (lowest similarity to the original text), while Llama-3 is the most conservative. Texts in the Paper and Wikipedia domains are significantly longer (4k–8k words), which is more conducive to cross-domain generalization.

Experiments¶

Main Results¶

Detector	Category	F1-W (Word)	F1-S (Sentence)	AUC-W	AUC-S
Random	-	0.433	0.497	-	-
DeBERTa	Finetune	0.831	0.966	-	-
SeqXGPT	Finetune	0.513	0.674	-	-
DetectGPT	Metric (w/ perturb)	0.375	0.459	0.482	0.501
Fast-DetectGPT	Metric (w/ perturb)	0.501	0.533	0.507	0.510
NPR	Metric (w/ perturb)	0.414	0.473	0.485	0.509
log prob	Metric (w/o perturb)	0.479	0.444	0.482	0.511
LRR	Metric (w/o perturb)	0.475	0.516	0.483	0.510
entropy	Metric (w/o perturb)	0.479	0.392	0.488	0.511

Core Conclusion: Metric-based methods perform close to random guess at the word level (average F1 \(\approx\) 0.46), while DeBERTa stands out with a word-level F1 (F1-W) of 0.831.

OOD Generalization¶

Generalization Setting	DeBERTa F1-W Range	SeqXGPT F1-W Range	Metric-based F1-W
IND (Baseline)	0.831	0.513	0.375–0.501
OOD-Domain (Cross-domain)	0.776–0.830	0.450–0.490	Mostly minor fluctuations
OOD-Model (Cross-model)	0.768–0.854	0.462–0.498	Mostly minor fluctuations

DeBERTa maintains an F1-W of 0.77+ across both cross-domain and cross-model settings, demonstrating generalization ability far superior to other methods.
When trained on long-text domains (Paper/Wikipedia), the OOD performance is even better than IND, indicating that the diverse linguistic patterns in long texts aid generalization.
Training on the GPT-4o corpus yields better cross-model generalization, as its high degree of rewriting provides a stronger learning signal.

Document-Level AI Ratio Prediction¶

Method	DeBERTa	SeqXGPT	DetectGPT	Fast-DetectGPT	NPR	LRR
Sentence-level Error	1.78%	12.00%	10.70%	14.75%	21.01%	17.44%
Word-level Error	1.84%	10.00%	43.48%	12.65%	32.37%	12.00%

DeBERTa predicts the document-level AI ratio with an error of less than 2%, providing a reliable estimate of AI intervention in practical auditing scenarios.

Key Findings¶

Metric-based Methods Fail Entirely: All statistical metric-based detectors perform near speculative random levels on word-level detection (F1 \(\approx\) random). Simple token-level metrics cannot sustain fine-grained detection; even with perturbations (DetectGPT series), there is no noticeable improvement.
Semantic Representation is Key: DeBERTa acquires embedding-level semantic features through supervised fine-tuning, successfully distinguishing intertwined human and machine-generated word sequences. Although SeqXGPT is also a fine-tuning method, its architecture adaptation is insufficient.
Context Window acts as a Bottleneck: Chunking long documents leads to a loss of context. The F1 scores of both DeBERTa and SeqXGPT increase monotonically with larger chunk sizes.
Zero-shot Detection is Far from Viable: When the DeBERTa encoder is frozen and only the classification head is trained, the IND F1-W drops from 0.831 to 0.571. Similarly, the sentence-level zero-shot performance of metric-based methods remains near random.
Difficulty in Generalization Across Rewriting Paradigms: DeBERTa trained on HACo-Det suffers a significant performance drop when transferred to SeqXGPT-Bench (which uses a different rewriting pipeline).

Highlights & Insights¶

Realistic Task Design: Multi-round local rewriting closely simulates real human-AI collaborative writing, which is far more realistic than simple "human prompt + AI continuation" dynamics.
Reasonable Word-level Grounded Annotation: The alignment-based method preserves original attribution, avoiding the oversimplification of treating any paraphrased segment as entirely machine-generated.
Comprehensive Experimental Design: A complete matrix covering 3 evaluation settings (IND / OOD-Domain / OOD-Model) \(\times\) 4 domains \(\times\) 4 generators.
Innovative AI Ratio Estimation: Connecting fine-grained detection with document-level AI intervention estimation, which holds substantial practical utility for content auditing.

Limitations & Future Work¶

The definition of word-level "authorship transfer" is somewhat subjective. Labeling rules relying purely on morphological alignment are relatively simple and do not account for semantic-equivalence substitutions.
It only covers paraphrase scenarios, omitting more complex collaborative operations such as LLM-mediated insertions, deletions, or structural reorganization.
The dataset is restricted to English, leaving multilingual or cross-lingual detection capabilities unexplored.
Each document is edited by a single LLM, failing to simulate multi-LLM workflows or active multi-round manual human editing.
The study does not introduce a novel detection algorithm; the primary contribution lies in the dataset and the systematic benchmark evaluation.

Document-level MGT Detection: DetectGPT (Mitchell et al., 2023) based on log-probability curvature; Fast-DetectGPT (Bao et al., 2023) introducing conditional expectation acceleration; and RAID (Dugan et al., 2024) establishing a large-scale robustness benchmark.
Fine-grained Detection: Mixtext three-way classification (Gao et al., 2024); boundary detection (Kushnareva et al., 2024); MGT localization (Zhang et al., 2024c); and sentence-level multi-feature fusion (Tao et al., 2024).
Detector Robustness: Shi et al. (2024) on adversarial attacks (word substitution + prompt attacks); Wang et al. (2024) testing multi-level perturbations showing significant performance degradation in most detectors.
Human-AI Co-writing: Lee et al. (2022) and Chakrabarty et al. (2022) on collaborative poetry; Reza et al. (2024, 2025) on collaborative content generation.

Rating¶

Dimension	Score	Brief Comment
Novelty	⭐⭐⭐⭐	The first human-AI coauthored detection benchmark featuring multi-round rewriting and word-level attribution annotation.
Technical Depth	⭐⭐⭐	Focuses primarily on the dataset and systematic evaluation; the detection models themselves are adaptations of existing work.
Experimental Thoroughness	⭐⭐⭐⭐⭐	Complete coverage of IND/OOD-Domain/OOD-Model, evaluated with 7 methods across 3 levels of granularity.
Value	⭐⭐⭐⭐	The AI ratio prediction perspective holds practical content-auditing value, revealing notable gaps in current methods.