Robust Multi-bit Text Watermark with LLM-based Paraphrasers¶

Conference: ICML 2025
arXiv: 2412.03123
Code: github.com/xiaojunxu/multi-bit-text-watermark
Area: AI Safety / Text Watermarking
Keywords: text watermark, multi-bit, paraphrasing, PPO, co-training

TL;DR¶

Proposes a multi-bit text watermarking method based on LLM paraphrasers. By co-training a pair of behaviorally differentiated paraphrasers and a decoding classifier, the encoder-decoder pair is optimized using PPO reinforcement learning, achieving a detection accuracy of >99.99% AUC on a 1.1B small model while maintaining semantic invariance of the text.

Background & Motivation¶

Text watermarking requires embedding imperceptible signals into text, applying to copyright protection and LLM-generated text tracking.
Existing methods have limitations:
- Synonym substitution methods (such as NLW): Limited action space and poor robustness.
- LLM-output watermarks (such as KGW, KTH): Only applicable to text generated by the LLM itself, unable to watermark arbitrary text.
- Paraphrasing-based methods (such as RemarkLLM, Waterfall): Low multi-bit accuracy and insufficient detection AUC.
Goal: Design a universal, highly accurate, and robust multi-bit text watermarking pipeline.

Method¶

Overall Architecture¶

The pipeline is split into encoding and decoding phases: - Encoding: Uses a pair of paraphrasers \((\theta_0, \theta_1)\) to alternately paraphrase each sentence of the input text, selecting the paraphraser according to the current bit of the watermark code. - Decoding: Segments the text (by sentence) and uses a text classifier \(\theta_d\) to determine whether each segment belongs to class-0 or class-1, concatenating the outputs to recover the decoded watermark.

Key Designs¶

Dual Paraphraser Encoding: Two LLM paraphrasers \(\theta_0, \theta_1\) are fine-tuned from the same base model, but are encouraged to differ during training via JS divergence loss:

\[\ell_{init}(\theta_0, \theta_1) = \ell_{SFT}(\theta_0) + \ell_{SFT}(\theta_1) - \lambda_{JS} \cdot \text{JS}(\pi_{\theta_0}, \pi_{\theta_1})\]

Decoder Training (Cross Entropy):

\[\ell_D(\theta_d; x^w, M) = \sum_{i} \text{CE}(g_s(\tilde{x}_i^w; \theta_d), M[i])\]

Encoder Training (PPO + Reward Model):

The decoder acts as the reward model, and the watermark reward is the number of successfully decoded bits:

\[r_w(x^w, M) = \sum_{i} \mathbb{1}\{D(x^w)[i] = M[i]\}\]

The total reward incorporates a semantic similarity regularization:

\[r(x^w, x^o, M) = \lambda_w \cdot r_w + \lambda_s \cdot r_s(x^w, x^o)\]

The PPO loss updates each of the two paraphrasers on their respective generated tokens, while including a KL divergence regularization to prevent deviation from the reference model.

Co-Training Framework¶

The encoder and decoder are updated alternately: at each step, a watermark code \(M\) is randomly sampled, watermarked text is generated, advantages are computed, and the decoder is updated using \(\ell_D\) while the encoder is updated using \(\ell_E\).

Key Experimental Results¶

Main Results (C4 RealNewsLike dataset, 128 tokens)¶

Method	Bit Accuracy	Bit Count	AUC	TPR@FPR=1%	TPR@FPR=0.01%	Similarity
RemarkLLM (4bit)	0.766	4.0	0.786	0.0%	0.0%	0.810
KGW (multi-bit)	0.638	4.46	0.833	22.9%	6.3%	0.812
KTH (zero-bit)	-	-	0.892	61.4%	46.6%	0.820
Waterfall (κ=1)	-	-	0.939	62.4%	35.5%	0.842
Ours	0.956	5.57	0.998	98.0%	78.0%	0.874

Robustness to Attacks¶

Word Substitution (20% substitution rate):

Method	Bit Accuracy	AUC	TPR@1%
KTH (zero-bit)	-	0.813	41.5%
Waterfall (κ=1)	-	0.856	25.6%
Ours	0.861	0.947	51.6%
Ours (AdvT)	0.914	0.985	78.7%

Sentence Paraphrasing Attack (LlamaPara):

Method	Bit Accuracy	AUC	TPR@1%
Ours	0.714	0.865	43.9%
Ours (AdvT)	0.849	0.924	36.8%

OOD Generalization¶

The model trained on C4 achieves AUC=0.999 on HH-RLHF and AUC=0.996 on PKU SafeRLHF, demonstrating good generalization.

Multi-run Improvement¶

Repeating paraphrasing 5 times and choosing the best: bit accuracy > 0.99, detection AUC > 0.9999.

Stealthiness Validation¶

GPT in-context learning classification accuracy is only 57%, close to random guessing (50%), proving that watermarks cannot be detected by humans or AI.

Highlights & Insights¶

Clever Design of PPO + Co-Training: The decoder acts as the reward model for the encoder, forming an adversarial co-evolution that is more stable than end-to-end training.
High Information Density: Encodes 1 bit per 23 tokens (128/5.57), outperforming all baselines.
Small Model, Strong Capability: TinyLlama with only 1.1B parameters achieves extremely high detection performance, which can be run 5 times in parallel for further improvement.
Adversarial Training for Enhanced Robustness: Introducing perturbations (word substitution or paraphrasing) during training significantly improves attack resistance.

Limitations & Future Work¶

Sentence-level paraphrasing attacks still cause a relatively large performance drop, which is an inherent limitation of all text watermarking methods—attackers can always paraphrase the watermarked text back to its original semantics.
Requires training specific encoder-decoder pairs, making the deployment cost higher than zero-training methods (e.g., KGW).
Currently based on sentence-level segmentation, the information capacity for ultra-short texts (1-2 sentences) is limited.
Hyperparameters \(\lambda_s\) and \(\lambda_k\) need to be adjusted to balance detectability and fidelity.

Text Watermarking: Synonym substitution (Topkara et al., 2006), LSTM paraphrasing (Abdelnabi & Fritz, 2021), Gumbel softmax (RemarkLLM, Zhang et al., 2024b), invariant features (Yoo et al., 2023).
LLM Output Watermarking: KGW (Kirchenbauer et al., 2023), KTH (Kuditipudi et al., 2023), semantic watermarking (Liu et al., 2023), Waterfall (Lau et al., 2024).
Co-training of Paraphrasing Encoder + Classifier: The framework proposed by Xu et al. (2024), which this work extends to multi-bit scenarios.

Rating¶

⭐⭐⭐⭐ — Elegant method design (PPO co-training + dual-paraphraser), comprehensive experiments (robustness, OOD, stealthiness, ablation), and performance significantly leading the baselines. However, the vulnerability under sentence-level paraphrasing attacks and deployment costs remain bottlenecks for practical application.