AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text¶

Conference: NeurIPS 2025
arXiv: 2506.22508
Code: GitHub
Area: AI Safety / Privacy Protection
Keywords: text anonymization, privacy protection, reinforcement learning, adversarial training, small language models

TL;DR¶

This paper proposes the AgentStealth framework, which trains a small language model (SLM) through a three-stage pipeline comprising an adversarial anonymization workflow, supervised fine-tuning (SFT), and online reinforcement learning, achieving effective anonymization of user-generated content while preserving text utility — yielding a 12.3% improvement in anonymization performance and 6.8% improvement in utility.

Background & Motivation¶

In the digital era, the large volume of text generated by users on social media, forums, and other platforms often contains implicit personal identity cues — such as writing style, habitual vocabulary, and topical preferences — which adversaries may exploit to infer sensitive personal attributes (e.g., age, gender, occupation, and geographic location).

Text anonymization aims to rewrite text so as to eliminate such identity cues while preserving the semantic content and utility of the original text. However, existing approaches face multiple challenges:

Rule-based substitution methods: Simple replacement of keywords (e.g., names, locations) tends to degrade readability and utility.

Cloud-based LLM methods: Large models such as GPT-4 deliver strong performance but incur high costs and introduce privacy risks of their own — uploading sensitive text to the cloud contradicts the original motivation for anonymization.

The dilemma of small models: Locally deployed SLMs suffer from insufficient training data and supervision signals, resulting in suboptimal anonymization performance.

Key Challenge: Effective anonymization demands the capabilities of large models, yet deploying large models in the cloud introduces privacy leakage risks. The central challenge is enabling locally deployed small models to achieve powerful anonymization capabilities.

Method¶

Overall Architecture¶

AgentStealth adopts a three-stage progressive training strategy:

Stage 1: Adversarial Anonymization Workflow - High-quality anonymization data is constructed using a large model (e.g., DeepSeek-V3). - The workflow involves adversarial interaction between two roles: an attacker and an anonymizer. - High-quality triples of (original text, anonymized text, attack signal) are collected.

Stage 2: Supervised Fine-Tuning (SFT) - The SLM is fine-tuned on the high-quality data collected in Stage 1. - Base models include Llama-3.1-8B-Instruct and Qwen-2.5-1.5B-Instruct. - Training is conducted using the LLaMA-Factory framework.

Stage 3: Online Reinforcement Learning (RL) - The SLM performs self-reinforcement using its internal adversarial feedback. - The model simultaneously acts as both the anonymizer and the attacker in an adversarial game. - The anonymization policy is iteratively optimized via PPO or a similar algorithm.

Key Designs¶

1. In-context Contrastive Learning

Contrastive learning is incorporated into the workflow to enhance anonymization quality: - Positive examples: successfully anonymized texts (the attacker fails to identify the author). - Negative examples: unsuccessfully anonymized texts (the attacker still identifies the author). - In-context learning is used to help the model understand what constitutes effective anonymization.

2. Adaptive Utility-Aware Control

Privacy protection and text utility are dynamically balanced during anonymization: - Utility evaluation: measures the degree of semantic preservation between the original and anonymized text. - Adaptive threshold: adjusts rewriting intensity according to the current anonymization difficulty. - Prevents over-rewriting that causes the text to lose its original meaning.

3. Dual-Signal Training Data

SFT data incorporates both anonymization signals and attack signals: - Anonymization signal: teaches the model how to rewrite text to eliminate identity cues. - Attack signal: teaches the model to understand which features are likely to reveal identity. - The dual-signal approach equips the SLM with knowledge of both the "defender" and the "attacker."

Loss & Training¶

SFT Stage: Standard language model fine-tuning loss

\[L_{SFT} = -\sum_{t} \log p(y_t | y_{<t}, x)\]

RL Stage: A composite reward function is employed

\[R = \alpha \cdot R_{anon} + \beta \cdot R_{utility} + \gamma \cdot R_{fluency}\]

where: - \(R_{anon}\): anonymization reward (positive when the attacker makes an incorrect prediction) - \(R_{utility}\): utility reward (based on semantic similarity to the original text) - \(R_{fluency}\): fluency reward (based on perplexity) - \(\alpha, \beta, \gamma\): trade-off coefficients

Training is conducted using the Accelerate and TRL libraries for distributed RL training.

Key Experimental Results¶

Main Results¶

Anonymization performance comparison on two datasets (Reddit and Coding):

Method	Reddit Anon. Rate (%)	Reddit Utility (%)	Coding Anon. Rate (%)	Coding Utility (%)
No Anonymization	0.0	100.0	0.0	100.0
Rule-based Substitution	28.5	82.3	25.1	79.8
Paraphrase (GPT-3.5)	45.2	88.7	41.3	85.2
LLM Anonymizer (GPT-4)	62.8	91.5	58.4	89.1
LLM Anonymizer (DeepSeek-V3)	65.1	92.3	61.7	90.4
AgentStealth-8B (SFT)	68.3	93.8	64.2	91.7
AgentStealth-8B (SFT+RL)	73.1	95.5	69.8	93.6
AgentStealth-1.5B (SFT+RL)	67.5	92.1	63.8	90.2

Ablation Study¶

Contribution of individual components (Reddit dataset):

Configuration	Anon. Rate (%)	Utility (%)	Overall Score
SLM direct inference	38.2	85.6	61.9
+ SFT (anonymization signal only)	58.7	90.3	74.5
+ SFT (dual signal)	68.3	93.8	81.1
+ Contrastive learning	70.1	94.2	82.2
+ Utility-aware control	70.8	94.8	82.8
+ RL self-reinforcement	73.1	95.5	84.3

Key Findings¶

SLMs can match or surpass cloud-based LLMs: AgentStealth-8B achieves an anonymization rate approximately 10 percentage points higher than GPT-4, with superior utility.
Each stage of the three-stage training contributes: The anonymization rate improves from 38.2% (direct inference) to 73.1% (full RL model).
Dual-signal training substantially outperforms single-signal training: Learning both anonymization and attack knowledge improves the anonymization rate by approximately 10 percentage points.
The 1.5B model is deployable: The Qwen-2.5-1.5B variant still outperforms the GPT-4 baseline, enabling edge device deployment.
RL self-reinforcement yields significant gains: An additional ~5 percentage point improvement over SFT alone demonstrates the effectiveness of self-play training.
Privacy and utility improve simultaneously: Unlike the typical privacy–utility trade-off, AgentStealth improves both through more intelligent rewriting strategies.

Highlights & Insights¶

Resolves the core tension: Small models acquire the anonymization capability of large models, enabling local deployment and fundamentally avoiding cloud-side privacy leakage.
Novel self-reinforcement training paradigm: The model simultaneously plays the roles of attacker and defender, continuously improving through self-play.
Strong practical applicability: Supports model scales ranging from 1.5B to 8B parameters, accommodating diverse deployment environments.
Open-source: Code and training configurations are fully open-sourced, ensuring reproducibility.

Limitations & Future Work¶

Language coverage: Validation is conducted primarily on English data; anonymization effectiveness for Chinese and other languages remains unknown.
Limited attribute coverage: The focus is on a few common personal attribute types; finer-grained identity inference (e.g., writing style analysis) is not fully addressed.
Subjectivity in utility evaluation: Utility scoring partially relies on an LLM-as-judge setup, which may introduce bias.
Review status: The arXiv page indicates the paper is still "under review"; the final version may differ.
Attacker model ceiling: If the attacker employs a stronger model than that used during training, anonymization performance may degrade.

Language Models are Advanced Anonymizers (Staab et al., ICLR 2025): This paper builds directly upon that work and adopts its attack evaluation framework.
Differential privacy-based text methods: Offer stronger theoretical guarantees but incur substantial utility degradation.
Style transfer methods: Achieve anonymization by altering writing style.
Self-play / RLHF: Successful applications of reinforcement learning to language model alignment.
Insights: The adversarial self-reinforcement training paradigm is generalizable to other safety tasks, such as content moderation and deepfake detection.

Rating¶

Novelty: ⭐⭐⭐⭐ (The self-reinforcing anonymization framework is novel, though individual components are relatively standard)
Technical Depth: ⭐⭐⭐⭐ (The three-stage training design is complete, with components well integrated)
Experimental Thoroughness: ⭐⭐⭐⭐ (Two datasets, detailed ablations, multi-scale model comparisons)
Writing Quality: ⭐⭐⭐⭐ (Clear structure, well-motivated presentation)
Overall: ⭐⭐⭐⭐ (High practical value, solid technical contributions)