Skip to content

When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

Conference: ACL 2026
arXiv: 2510.12476
Code: GitHub
Area: AIGC Detection
Keywords: machine-generated text detection, personalized text, feature inversion, style transfer, robustness

TL;DR

This paper reveals the "feature-inversion trap" in MGT detectors under personalization—features that distinguish human-written and machine-generated text in the general domain get inverted in the personalized domain, causing detector performance to plummet or even flip. The proposed StyloCheck framework predicts cross-domain performance changes by quantifying detector reliance on inverted features, achieving prediction correlation above 0.85.

Background & Motivation

Background: Large language models are increasingly capable of imitating personal writing styles. Personalized text generation (e.g., style imitation, ghostwriting) has become a real threat. Existing MGT detection methods perform well in general scenarios (e.g., news, Wikipedia), with AUROC reaching 85%+.

Limitations of Prior Work: No one has systematically studied MGT detector performance under personalization. The authors construct the first personalized MGT detection benchmark StyloBench and find that existing detectors suffer dramatic performance degradation on personalized text, even experiencing inversion—e.g., Fast-DetectGPT achieves 98.78% AUROC in the general domain but drops to 8.71% on personalized literary style imitation text, nearly completely inverting.

Key Challenge: Detectors rely on discriminative features (e.g., text diversity—assuming human-written text is more diverse than machine text) that fail under personalization. Personalized MGT may actually be more diverse and less coherent than original human-written text, causing the feature direction to flip.

Goal: (1) Construct a personalized MGT detection benchmark; (2) Explain the mechanism behind detector performance degradation; (3) Propose a diagnostic tool for predicting detector cross-domain transfer performance.

Key Insight: The authors train domain classifiers and find that in the general domain, MGT domain feature values are slightly lower than HWT, but in the personalized domain this inverts to MGT being higher than HWT—suggesting a systematic feature direction inversion across domains.

Core Idea: Formalize the feature inversion problem as a Rayleigh quotient optimization problem, extract the maximum inversion direction, and build the diagnostic framework StyloCheck based on this.

Method

Overall Architecture

The method has three parts: (1) StyloBench benchmark construction—including literary work imitation (via CPT fine-tuning LLMs) and blog style imitation (via few-shot prompting) sub-scenarios; (2) Theoretical analysis of the feature-inversion trap—finding the inverted feature direction via Rayleigh quotient and verifying its correlation with detector performance; (3) StyloCheck diagnostic framework—generating probe datasets that only retain inverted features through token shuffling, assessing detector reliance on inverted features.

Key Designs

  1. Inverted Feature Direction Extraction (Rayleigh Quotient Method):

    • Function: Find the feature direction where HWT/MGT differences maximally invert between general and personalized domains
    • Mechanism: Use GPT-2 deep residual stream as the text representation space. Compute MGT-HWT difference vectors \(v_G\) and \(v_S\) for general and personalized domains respectively, construct the cross-domain matrix \(A = \sum_i \frac{1}{2}(v_G v_S^\top + v_S v_G^\top)\), solve \(\min_{|\mathbf{w}|=1} \mathbf{w}^\top A \mathbf{w}\) to obtain the eigenvector \(\mathbf{w}^*\) corresponding to the minimum eigenvalue—the direction of strongest inversion. When projected onto this direction, general domain MGT feature values are significantly higher than HWT, while the personalized domain completely inverts
    • Design Motivation: Elevate the inversion phenomenon from "intuitive observation" to "quantifiable mathematical object"; the Rayleigh quotient guarantees finding the globally optimal inversion direction
  2. StyloCheck Probe Dataset Construction:

    • Function: Construct probe datasets that differ only in the inverted feature dimension, removing semantic, style, and category confounders
    • Mechanism: Apply varying degrees of token shuffling to text (controlling shuffling intensity with Kendall τ), eliminating semantic and style information while preserving inverted feature values. From shuffled variants, select the 50 with highest feature values as positive samples and 50 with lowest as negative samples. Validation shows domain classifiers and MGT classifiers perform near-random on probe sets, confirming effective removal of confounders
    • Design Motivation: Only by isolating the influence of inverted features can we accurately measure detector reliance on that feature
  3. Cross-Domain Transfer Performance Prediction:

    • Function: Predict detector performance change from general to personalized domain based on probe dataset AUROC
    • Mechanism: If a detector's AUROC > 0.5 on the probe set, it relies on inverted features and will degrade after transfer; AUROC < 0.5 indicates reverse reliance, potentially improving after transfer; AUROC ≈ 0.5 indicates no reliance, with stable performance. In experiments, StyloCheck prediction correlates with actual cross-domain performance gap with Pearson correlation exceeding 0.7 in 78% of experiments
    • Design Motivation: Provide a "health check report" for detectors before deployment, predicting risks without large-scale cross-domain testing

Loss & Training

StyloCheck is a diagnostic framework, not a training method. The inverted feature direction is solved via eigendecomposition without training.

Key Experimental Results

Main Results (Detector Cross-Domain Performance)

Detector M4 (General) Avg AUROC Stylo-Blog Avg Stylo-Literary Avg
Fast-DetectGPT 84.52 77.20 20.13
Lastde 91.72 70.78 66.04
Lastde++ 90.55 76.68 49.24
Entropy 34.90 44.56 76.18
Log-Likelihood 79.86 71.63 25.59

Ablation Study (StyloCheck Prediction Reliability)

Number of Probe Datasets Pearson r > 0.5 Ratio Pearson r > 0.7 Ratio
5 90% 78%
More Higher Higher

Key Findings

  • The deeper the personalization (CPT vs few-shot), the more severe the detector degradation—Stylo-Literary (CPT-trained) shows far more dramatic performance decline than Stylo-Blog (few-shot prompted)
  • Entropy detector is the only method with improved personalized domain performance (AUROC from ~35% to ~76%), because its reliance on inverted features is directionally opposite to other detectors
  • The inverted feature direction is highly consistent across different datasets (mean cosine similarity 0.547), indicating this is a structural cross-domain phenomenon rather than dataset-specific coincidence
  • Inverted features correlate with "text diversity"—personalized MGT breaks the traditional assumption that "HWT is more diverse than MGT"

Highlights & Insights

  • Transforming a practical problem into an elegant mathematical formulation: The feature inversion phenomenon is precisely formulated as a Rayleigh quotient problem with a closed-form solution and interpretability. This pathway from phenomenon to theory is worth learning from
  • StyloCheck's "health check" approach has high practical value: Without collecting large amounts of target domain data, only using token-shuffled probe sets can predict detector cross-domain performance at minimal deployment cost
  • Counter-intuitive finding: Personalized MGT is more "diverse" than original HWT, overturning the basic assumption in the MGT detection field that "machine-generated text is more monotonous"

Limitations & Future Work

  • Only English scenarios are studied; stylistic feature distributions may differ across languages
  • StyloBench contains only 7 authors and 4 blog generators, with limited scale
  • StyloCheck can only predict performance changes based on inverted features; if detectors degrade due to other factors, it cannot capture them
  • No fundamental fix is proposed—how to train detectors that do not rely on inverted features remains an open problem
  • vs RAID / M4 and other general benchmarks: These benchmarks focus on general-domain MGT detection without considering personalization; StyloBench fills this gap and reveals structural weaknesses of general detectors
  • vs Fast-DetectGPT: One of the strongest detectors in the general domain, but AUROC drops to 8.71% on personalized literary imitation, nearly completely inverting—showing high general-domain performance cannot guarantee robustness
  • vs Training-based detectors: In-domain fine-tuning can recover performance, but cross-domain generalization remains limited

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First to reveal the feature-inversion trap with mathematical characterization; StyloCheck diagnostic framework is innovative
  • Experimental Thoroughness: ⭐⭐⭐⭐ 7 detectors, 11 generators, multi-domain testing, though dataset scale could be larger
  • Writing Quality: ⭐⭐⭐⭐ Clear logic from phenomenon to theory to application, though notation-heavy requiring cross-referencing
  • Value: ⭐⭐⭐⭐⭐ Serves as a warning for the MGT detection field; StyloCheck has direct practical value