Robust Multi-bit Text Watermark with LLM-based Paraphrasers¶
Conference: ICML 2025
arXiv: 2412.03123
Code: github.com/xiaojunxu/multi-bit-text-watermark
Area: AI Safety / Text Watermarking
Keywords: text watermark, multi-bit, paraphrasing, PPO, co-training
TL;DR¶
Proposes a multi-bit text watermarking method based on LLM paraphrasers. By co-training a pair of behaviorally differentiated paraphrasers and a decoding classifier, the encoder-decoder pair is optimized using PPO reinforcement learning, achieving a detection accuracy of >99.99% AUC on a 1.1B small model while maintaining semantic invariance of the text.
Background & Motivation¶
- Text watermarking requires embedding imperceptible signals into text, applying to copyright protection and LLM-generated text tracking.
- Existing methods have limitations:
- Synonym substitution methods (such as NLW): Limited action space and poor robustness.
- LLM-output watermarks (such as KGW, KTH): Only applicable to text generated by the LLM itself, unable to watermark arbitrary text.
- Paraphrasing-based methods (such as RemarkLLM, Waterfall): Low multi-bit accuracy and insufficient detection AUC.
- Goal: Design a universal, highly accurate, and robust multi-bit text watermarking pipeline.
Method¶
Overall Architecture¶
The pipeline is split into encoding and decoding phases: - Encoding: Uses a pair of paraphrasers \((\theta_0, \theta_1)\) to alternately paraphrase each sentence of the input text, selecting the paraphraser according to the current bit of the watermark code. - Decoding: Segments the text (by sentence) and uses a text classifier \(\theta_d\) to determine whether each segment belongs to class-0 or class-1, concatenating the outputs to recover the decoded watermark.
Key Designs¶
Dual Paraphraser Encoding: Two LLM paraphrasers \(\theta_0, \theta_1\) are fine-tuned from the same base model, but are encouraged to differ during training via JS divergence loss:
Decoder Training (Cross Entropy):
Encoder Training (PPO + Reward Model):
The decoder acts as the reward model, and the watermark reward is the number of successfully decoded bits:
The total reward incorporates a semantic similarity regularization:
The PPO loss updates each of the two paraphrasers on their respective generated tokens, while including a KL divergence regularization to prevent deviation from the reference model.
Co-Training Framework¶
The encoder and decoder are updated alternately: at each step, a watermark code \(M\) is randomly sampled, watermarked text is generated, advantages are computed, and the decoder is updated using \(\ell_D\) while the encoder is updated using \(\ell_E\).
Key Experimental Results¶
Main Results (C4 RealNewsLike dataset, 128 tokens)¶
| Method | Bit Accuracy | Bit Count | AUC | TPR@FPR=1% | TPR@FPR=0.01% | Similarity |
|---|---|---|---|---|---|---|
| RemarkLLM (4bit) | 0.766 | 4.0 | 0.786 | 0.0% | 0.0% | 0.810 |
| KGW (multi-bit) | 0.638 | 4.46 | 0.833 | 22.9% | 6.3% | 0.812 |
| KTH (zero-bit) | - | - | 0.892 | 61.4% | 46.6% | 0.820 |
| Waterfall (κ=1) | - | - | 0.939 | 62.4% | 35.5% | 0.842 |
| Ours | 0.956 | 5.57 | 0.998 | 98.0% | 78.0% | 0.874 |
Robustness to Attacks¶
Word Substitution (20% substitution rate):
| Method | Bit Accuracy | AUC | TPR@1% |
|---|---|---|---|
| KTH (zero-bit) | - | 0.813 | 41.5% |
| Waterfall (κ=1) | - | 0.856 | 25.6% |
| Ours | 0.861 | 0.947 | 51.6% |
| Ours (AdvT) | 0.914 | 0.985 | 78.7% |
Sentence Paraphrasing Attack (LlamaPara):
| Method | Bit Accuracy | AUC | TPR@1% |
|---|---|---|---|
| Ours | 0.714 | 0.865 | 43.9% |
| Ours (AdvT) | 0.849 | 0.924 | 36.8% |
OOD Generalization¶
The model trained on C4 achieves AUC=0.999 on HH-RLHF and AUC=0.996 on PKU SafeRLHF, demonstrating good generalization.
Multi-run Improvement¶
Repeating paraphrasing 5 times and choosing the best: bit accuracy > 0.99, detection AUC > 0.9999.
Stealthiness Validation¶
GPT in-context learning classification accuracy is only 57%, close to random guessing (50%), proving that watermarks cannot be detected by humans or AI.
Highlights & Insights¶
- Clever Design of PPO + Co-Training: The decoder acts as the reward model for the encoder, forming an adversarial co-evolution that is more stable than end-to-end training.
- High Information Density: Encodes 1 bit per 23 tokens (128/5.57), outperforming all baselines.
- Small Model, Strong Capability: TinyLlama with only 1.1B parameters achieves extremely high detection performance, which can be run 5 times in parallel for further improvement.
- Adversarial Training for Enhanced Robustness: Introducing perturbations (word substitution or paraphrasing) during training significantly improves attack resistance.
Limitations & Future Work¶
- Sentence-level paraphrasing attacks still cause a relatively large performance drop, which is an inherent limitation of all text watermarking methods—attackers can always paraphrase the watermarked text back to its original semantics.
- Requires training specific encoder-decoder pairs, making the deployment cost higher than zero-training methods (e.g., KGW).
- Currently based on sentence-level segmentation, the information capacity for ultra-short texts (1-2 sentences) is limited.
- Hyperparameters \(\lambda_s\) and \(\lambda_k\) need to be adjusted to balance detectability and fidelity.
Related Work & Insights¶
- Text Watermarking: Synonym substitution (Topkara et al., 2006), LSTM paraphrasing (Abdelnabi & Fritz, 2021), Gumbel softmax (RemarkLLM, Zhang et al., 2024b), invariant features (Yoo et al., 2023).
- LLM Output Watermarking: KGW (Kirchenbauer et al., 2023), KTH (Kuditipudi et al., 2023), semantic watermarking (Liu et al., 2023), Waterfall (Lau et al., 2024).
- Co-training of Paraphrasing Encoder + Classifier: The framework proposed by Xu et al. (2024), which this work extends to multi-bit scenarios.
Rating¶
⭐⭐⭐⭐ — Elegant method design (PPO co-training + dual-paraphraser), comprehensive experiments (robustness, OOD, stealthiness, ablation), and performance significantly leading the baselines. However, the vulnerability under sentence-level paraphrasing attacks and deployment costs remain bottlenecks for practical application.