Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks¶

Conference: ICML 2025
arXiv: 2505.05190
Code: SIRA
Area: AI Safety
Keywords: Text Watermarking, Watermark Robustness, Self-Information, Rewrite Attack, LLM Safety

TL;DR¶

This paper proposes SIRA (Self-Information Rewrite Attack), which utilizes self-information to identify high-entropy tokens embedded with watermarks and performs targeted replacement. It achieves a near 100% attack success rate across 7 mainstream watermarking methods at a cost of only $0.88/million tokens. It is completely black-box and can transfer to any LLM, even mobile-end models.

Background & Motivation¶

1. The Importance of Text Watermarking¶

While the capability of LLMs (such as ChatGPT and Claude) to generate text is increasingly powerful, it also introduces risks such as misinformation dissemination and academic integrity issues. Text watermarking embeds invisible statistical signals during generation to allow detectors to verify if the text was generated by a specific model.

2. Limitations of Prior Work¶

Text Manipulation Attacks (word deletion, inserting emojis, etc.): Crude and easily recognized by filters, and often degrades semantic quality.
Informed Attacks (watermark-stealing): Require extensive access to the watermarked LLM or even the detector, which is an overly strong assumption.
Generic Rewrite Attacks (DIPPER, GPT paraphraser): Non-targeted brute-force rewriting, which is inefficient and ineffective against new watermarking algorithms (e.g., SIR).

3. Key Insight¶

Watermarking algorithms, to maintain text quality, select high-entropy tokens (locations with high uncertainty) to embed watermarks. However, high-entropy tokens also exhibit high self-information. SIRA exploits this "seemingly harmless but exploitable" design flaw: - Use an arbitrary LLM to compute the self-information of each token. - Tokens with high self-information are highly likely to be watermarked green-list tokens. - By masking these tokens, non-targeted rewriting is transformed into a targeted fill-in-the-blank task.

4. Core Advantages¶

Completely black-box (no access to the watermarking algorithm, key, or detector), transferrable to any LLM, and executable even using a 3B mobile-end model.

Method¶

Overall Architecture: Two-Step Pipeline¶

Step 1: Generate Masking Template - For watermarked text $y_w$, an arbitrary LLM is used to compute the self-information of each token $t_k$: $I(t_k) = -\log P(t_k | t_{<k})$. - Set a threshold and mask tokens with self-information exceeding the threshold as placeholders $\to$ yielding the masked text. - Simultaneously, let the LLM perform generic rewriting on the source text $\to$ yielding the reference text.

Step 2: Targeted Fill-in-the-Blank - Input both the masked text and the reference text into the LLM. - Prompt the LLM to complete the masked positions while maintaining the semantic integrity of the reference text. - The output text $y_p$ has watermarked tokens replaced with non-watermarked tokens at the masked positions, effectively removing the watermark.

Key Designs¶

1. Self-Information as Watermark Localization Signal¶

Function: Accurately locates tokens likely embedded with watermarking without knowing the watermarking algorithm.
Mechanism: Watermarks are embedded in high-entropy locations $\to$ high-entropy tokens correspond to high self-information $\to$ self-information can be calculated by any LLM.
Design Motivation: Transforms non-targeted rewriting (where an LLM randomly decides which words to modify) into targeted fill-in-the-blank (accurately replacing green tokens).

2. Two-Step Paraphrasing Strategy¶

Design Motivation: Single-step rewriting might preserve some watermarked tokens while maintaining semantics. The two-step strategy identifies first and then replaces, ensuring high coverage.
Function: Provides semantic constraints for the fill-in-the-blank step to avoid semantic drift after completion.

3. Transferability¶

The attack model can be any LLM (GPT-4, Llama-3, or even a 3B model).
It does not need to be the same as the watermarked LLM—since computing self-information is a universal language modeling capability.

Key Experimental Results¶

Main Results: Attack Success Rate¶

Watermarking Method	Category	DIPPER Attack	GPT Rewrite	SIRA
KGW	KGW Family	62.3%	71.5%	99.2%
Unigram	KGW Family	58.7%	65.2%	98.8%
EXP	Christ Family	45.2%	52.1%	99.5%
SIR	KGW Family (New)	23.4%	31.8%	97.6%
EWD	KGW Family	55.1%	63.4%	99.1%
DIP	Christ Family	41.3%	48.9%	98.3%
UW	KGW Family	51.8%	59.7%	99.0%

SIRA achieves near 100% success rate on all 7 methods, showing massive improvements particularly on SIR (where DIPPER achieves only 23.4%).
The cost is only $0.88/million tokens, which is far lower than the hardware requirements of DIPPER.

Ablation Study¶

Configuration	Attack Success Rate	Semantic Preservation	Description
SIRA Full	~99%	High	Targeted masking + reference fill-in-the-blank
Without self-information masking (pure rewrite)	~55%	Medium	Degenerates to non-targeted attack
Without reference text (fill-in-the-blank only)	~92%	Low-Medium	Lacks semantic constraints
Executed with 3B mobile-end model	~95%	Medium-High	Proves transferability to small models
Different thresholds	Threshold ↑ success rate slightly ↓	Threshold ↑ preservation ↑	Trade-off between accuracy and conservativeness

Key Findings¶

Self-information masking is the core driver of attack success (ablation shows improvement from ~55% to ~99%).
The reference text primarily ensures semantic quality rather than the attack success rate.
Even with a 3B small model, the attack effectiveness remains >95%, showing that the vulnerability does not depend on the capability of the attack model.
It is equally effective against watermarks using dynamic keys (SIR), as self-information localization does not rely on knowledge of the key.

Highlights & Insights¶

Revealing Fundamental Vulnerability: Watermarks must be embedded in high-entropy tokens to maintain quality, yet this exact feature acts as the most vulnerable signal—presenting an inherent contradiction in watermarking schemes.
Paradigm Shift in Attacks: Upgrading from non-targeted brute-force rewriting to targeted fill-in-the-blank represents a qualitative shift in methodology.
Extremely Low Barrier: Executable with $0.88/million tokens + a 3B small model, implying anyone can crack current watermarks.
Warning to Future Watermark Designs: Watermarking algorithms can no longer rely solely on high-entropy positions as their embedding strategy.

Limitations & Future Work¶

The attack assumes the attacker can access the full watermarked text—adaptation is needed for streaming output scenarios.
The self-information threshold requires manual setting; text from different domains may require different thresholds.
Currently, only decoding-phase watermarks have been tested; effectiveness against encoding-phase or semantic watermarks remains to be verified.
Defenses are more urgent: How to design watermarking schemes that do not rely on concentrated embedding in high-entropy tokens?

vs DIPPER (Krishna et al. 2024): Relies on specifically fine-tuned models, non-targeted, and ineffective against newer watermarks such as SIR.
vs GPT Paraphraser: Generic rewriting, low efficiency, and low success rate.
vs Watermark-stealing: Requires massive access to the watermarked LLM, assuming too much.
Ours Positioning: The first targeted black-box rewrite attack, balancing low cost, high success rate, and strong transferability.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to utilize self-information to locate watermark tokens, leading to an attack paradigm shift.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage of 7 watermarking methods, rigorous ablation.
Writing Quality: ⭐⭐⭐⭐⭐ Clear threat model, intuitive technical pipeline.
Value: ⭐⭐⭐⭐⭐ Significant warning value for the watermarking research community.