When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection¶

Conference: ACL 2026
arXiv: 2510.12476
Code: GitHub
Area: AIGC Detection
Keywords: machine-generated text detection, personalized text, feature inversion, style transfer, robustness

TL;DR¶

This paper reveals the "feature-inversion trap" in MGT detectors under personalization—features that distinguish human-written and machine-generated text in the general domain get inverted in the personalized domain, causing detector performance to plummet or even flip. The proposed StyloCheck framework predicts cross-domain performance changes by quantifying detector reliance on inverted features, achieving prediction correlation above 0.85.

Background & Motivation¶

Background: Large language models are increasingly capable of imitating personal writing styles. Personalized text generation (e.g., style imitation, ghostwriting) has become a real threat. Existing MGT detection methods perform well in general scenarios (e.g., news, Wikipedia), with AUROC reaching 85%+.

Limitations of Prior Work: No one has systematically studied MGT detector performance under personalization. The authors construct the first personalized MGT detection benchmark StyloBench and find that existing detectors suffer dramatic performance degradation on personalized text, even experiencing inversion—e.g., Fast-DetectGPT achieves 98.78% AUROC in the general domain but drops to 8.71% on personalized literary style imitation text, nearly completely inverting.

Key Challenge: Detectors rely on discriminative features (e.g., text diversity—assuming human-written text is more diverse than machine text) that fail under personalization. Personalized MGT may actually be more diverse and less coherent than original human-written text, causing the feature direction to flip.

Goal: (1) Construct a personalized MGT detection benchmark; (2) Explain the mechanism behind detector performance degradation; (3) Propose a diagnostic tool for predicting detector cross-domain transfer performance.

Key Insight: The authors train domain classifiers and find that in the general domain, MGT domain feature values are slightly lower than HWT, but in the personalized domain this inverts to MGT being higher than HWT—suggesting a systematic feature direction inversion across domains.

Core Idea: Formalize the feature inversion problem as a Rayleigh quotient optimization problem, extract the maximum inversion direction, and build the diagnostic framework StyloCheck based on this.

Method¶

Overall Architecture¶

The method has three parts: (1) StyloBench benchmark construction—including literary work imitation (via CPT fine-tuning LLMs) and blog style imitation (via few-shot prompting) sub-scenarios; (2) Theoretical analysis of the feature-inversion trap—finding the inverted feature direction via Rayleigh quotient and verifying its correlation with detector performance; (3) StyloCheck diagnostic framework—generating probe datasets that only retain inverted features through token shuffling, assessing detector reliance on inverted features.

Key Designs¶

Inverted Feature Direction Extraction (Rayleigh Quotient Method):
- Function: Find the feature direction where HWT/MGT differences maximally invert between general and personalized domains
- Mechanism: Use GPT-2 deep residual stream as the text representation space. Compute MGT-HWT difference vectors \(v_G\) and \(v_S\) for general and personalized domains respectively, construct the cross-domain matrix \(A = \sum_i \frac{1}{2}(v_G v_S^\top + v_S v_G^\top)\), solve \(\min_{|\mathbf{w}|=1} \mathbf{w}^\top A \mathbf{w}\) to obtain the eigenvector \(\mathbf{w}^*\) corresponding to the minimum eigenvalue—the direction of strongest inversion. When projected onto this direction, general domain MGT feature values are significantly higher than HWT, while the personalized domain completely inverts
- Design Motivation: Elevate the inversion phenomenon from "intuitive observation" to "quantifiable mathematical object"; the Rayleigh quotient guarantees finding the globally optimal inversion direction
StyloCheck Probe Dataset Construction:
- Function: Construct probe datasets that differ only in the inverted feature dimension, removing semantic, style, and category confounders
- Mechanism: Apply varying degrees of token shuffling to text (controlling shuffling intensity with Kendall τ), eliminating semantic and style information while preserving inverted feature values. From shuffled variants, select the 50 with highest feature values as positive samples and 50 with lowest as negative samples. Validation shows domain classifiers and MGT classifiers perform near-random on probe sets, confirming effective removal of confounders
- Design Motivation: Only by isolating the influence of inverted features can we accurately measure detector reliance on that feature
Cross-Domain Transfer Performance Prediction:
- Function: Predict detector performance change from general to personalized domain based on probe dataset AUROC
- Mechanism: If a detector's AUROC > 0.5 on the probe set, it relies on inverted features and will degrade after transfer; AUROC < 0.5 indicates reverse reliance, potentially improving after transfer; AUROC ≈ 0.5 indicates no reliance, with stable performance. In experiments, StyloCheck prediction correlates with actual cross-domain performance gap with Pearson correlation exceeding 0.7 in 78% of experiments
- Design Motivation: Provide a "health check report" for detectors before deployment, predicting risks without large-scale cross-domain testing

Loss & Training¶

StyloCheck is a diagnostic framework, not a training method. The inverted feature direction is solved via eigendecomposition without training.

Key Experimental Results¶

Main Results (Detector Cross-Domain Performance)¶

Detector	M4 (General) Avg AUROC	Stylo-Blog Avg	Stylo-Literary Avg
Fast-DetectGPT	84.52	77.20	20.13
Lastde	91.72	70.78	66.04
Lastde++	90.55	76.68	49.24
Entropy	34.90	44.56	76.18
Log-Likelihood	79.86	71.63	25.59

Ablation Study (StyloCheck Prediction Reliability)¶

Number of Probe Datasets	Pearson r > 0.5 Ratio	Pearson r > 0.7 Ratio
5	90%	78%
More	Higher	Higher

Key Findings¶

The deeper the personalization (CPT vs few-shot), the more severe the detector degradation—Stylo-Literary (CPT-trained) shows far more dramatic performance decline than Stylo-Blog (few-shot prompted)
Entropy detector is the only method with improved personalized domain performance (AUROC from ~35% to ~76%), because its reliance on inverted features is directionally opposite to other detectors
The inverted feature direction is highly consistent across different datasets (mean cosine similarity 0.547), indicating this is a structural cross-domain phenomenon rather than dataset-specific coincidence
Inverted features correlate with "text diversity"—personalized MGT breaks the traditional assumption that "HWT is more diverse than MGT"

Highlights & Insights¶

Transforming a practical problem into an elegant mathematical formulation: The feature inversion phenomenon is precisely formulated as a Rayleigh quotient problem with a closed-form solution and interpretability. This pathway from phenomenon to theory is worth learning from
StyloCheck's "health check" approach has high practical value: Without collecting large amounts of target domain data, only using token-shuffled probe sets can predict detector cross-domain performance at minimal deployment cost
Counter-intuitive finding: Personalized MGT is more "diverse" than original HWT, overturning the basic assumption in the MGT detection field that "machine-generated text is more monotonous"

Limitations & Future Work¶

Only English scenarios are studied; stylistic feature distributions may differ across languages
StyloBench contains only 7 authors and 4 blog generators, with limited scale
StyloCheck can only predict performance changes based on inverted features; if detectors degrade due to other factors, it cannot capture them
No fundamental fix is proposed—how to train detectors that do not rely on inverted features remains an open problem

vs RAID / M4 and other general benchmarks: These benchmarks focus on general-domain MGT detection without considering personalization; StyloBench fills this gap and reveals structural weaknesses of general detectors
vs Fast-DetectGPT: One of the strongest detectors in the general domain, but AUROC drops to 8.71% on personalized literary imitation, nearly completely inverting—showing high general-domain performance cannot guarantee robustness
vs Training-based detectors: In-domain fine-tuning can recover performance, but cross-domain generalization remains limited

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to reveal the feature-inversion trap with mathematical characterization; StyloCheck diagnostic framework is innovative
Experimental Thoroughness: ⭐⭐⭐⭐ 7 detectors, 11 generators, multi-domain testing, though dataset scale could be larger
Writing Quality: ⭐⭐⭐⭐ Clear logic from phenomenon to theory to application, though notation-heavy requiring cross-referencing
Value: ⭐⭐⭐⭐⭐ Serves as a warning for the MGT detection field; StyloCheck has direct practical value