Skip to content

Robust Multi-bit Text Watermark with LLM-based Paraphrasers

Conference: ICML 2025
arXiv: 2412.03123
Code: github.com/xiaojunxu/multi-bit-text-watermark
Area: AI Safety / Text Watermarking
Keywords: text watermark, multi-bit, paraphrasing, PPO, co-training

TL;DR

Proposes a multi-bit text watermarking method based on LLM paraphrasers. By co-training a pair of behaviorally differentiated paraphrasers and a decoding classifier, the encoder-decoder pair is optimized using PPO reinforcement learning, achieving a detection accuracy of >99.99% AUC on a 1.1B small model while maintaining semantic invariance of the text.

Background & Motivation

  • Text watermarking requires embedding imperceptible signals into text, applying to copyright protection and LLM-generated text tracking.
  • Existing methods have limitations:
    • Synonym substitution methods (such as NLW): Limited action space and poor robustness.
    • LLM-output watermarks (such as KGW, KTH): Only applicable to text generated by the LLM itself, unable to watermark arbitrary text.
    • Paraphrasing-based methods (such as RemarkLLM, Waterfall): Low multi-bit accuracy and insufficient detection AUC.
  • Goal: Design a universal, highly accurate, and robust multi-bit text watermarking pipeline.

Method

Overall Architecture

The pipeline is split into encoding and decoding phases: - Encoding: Uses a pair of paraphrasers \((\theta_0, \theta_1)\) to alternately paraphrase each sentence of the input text, selecting the paraphraser according to the current bit of the watermark code. - Decoding: Segments the text (by sentence) and uses a text classifier \(\theta_d\) to determine whether each segment belongs to class-0 or class-1, concatenating the outputs to recover the decoded watermark.

Key Designs

Dual Paraphraser Encoding: Two LLM paraphrasers \(\theta_0, \theta_1\) are fine-tuned from the same base model, but are encouraged to differ during training via JS divergence loss:

\[\ell_{init}(\theta_0, \theta_1) = \ell_{SFT}(\theta_0) + \ell_{SFT}(\theta_1) - \lambda_{JS} \cdot \text{JS}(\pi_{\theta_0}, \pi_{\theta_1})\]

Decoder Training (Cross Entropy):

\[\ell_D(\theta_d; x^w, M) = \sum_{i} \text{CE}(g_s(\tilde{x}_i^w; \theta_d), M[i])\]

Encoder Training (PPO + Reward Model):

The decoder acts as the reward model, and the watermark reward is the number of successfully decoded bits:

\[r_w(x^w, M) = \sum_{i} \mathbb{1}\{D(x^w)[i] = M[i]\}\]

The total reward incorporates a semantic similarity regularization:

\[r(x^w, x^o, M) = \lambda_w \cdot r_w + \lambda_s \cdot r_s(x^w, x^o)\]

The PPO loss updates each of the two paraphrasers on their respective generated tokens, while including a KL divergence regularization to prevent deviation from the reference model.

Co-Training Framework

The encoder and decoder are updated alternately: at each step, a watermark code \(M\) is randomly sampled, watermarked text is generated, advantages are computed, and the decoder is updated using \(\ell_D\) while the encoder is updated using \(\ell_E\).

Key Experimental Results

Main Results (C4 RealNewsLike dataset, 128 tokens)

Method Bit Accuracy Bit Count AUC TPR@FPR=1% TPR@FPR=0.01% Similarity
RemarkLLM (4bit) 0.766 4.0 0.786 0.0% 0.0% 0.810
KGW (multi-bit) 0.638 4.46 0.833 22.9% 6.3% 0.812
KTH (zero-bit) - - 0.892 61.4% 46.6% 0.820
Waterfall (κ=1) - - 0.939 62.4% 35.5% 0.842
Ours 0.956 5.57 0.998 98.0% 78.0% 0.874

Robustness to Attacks

Word Substitution (20% substitution rate):

Method Bit Accuracy AUC TPR@1%
KTH (zero-bit) - 0.813 41.5%
Waterfall (κ=1) - 0.856 25.6%
Ours 0.861 0.947 51.6%
Ours (AdvT) 0.914 0.985 78.7%

Sentence Paraphrasing Attack (LlamaPara):

Method Bit Accuracy AUC TPR@1%
Ours 0.714 0.865 43.9%
Ours (AdvT) 0.849 0.924 36.8%

OOD Generalization

The model trained on C4 achieves AUC=0.999 on HH-RLHF and AUC=0.996 on PKU SafeRLHF, demonstrating good generalization.

Multi-run Improvement

Repeating paraphrasing 5 times and choosing the best: bit accuracy > 0.99, detection AUC > 0.9999.

Stealthiness Validation

GPT in-context learning classification accuracy is only 57%, close to random guessing (50%), proving that watermarks cannot be detected by humans or AI.

Highlights & Insights

  1. Clever Design of PPO + Co-Training: The decoder acts as the reward model for the encoder, forming an adversarial co-evolution that is more stable than end-to-end training.
  2. High Information Density: Encodes 1 bit per 23 tokens (128/5.57), outperforming all baselines.
  3. Small Model, Strong Capability: TinyLlama with only 1.1B parameters achieves extremely high detection performance, which can be run 5 times in parallel for further improvement.
  4. Adversarial Training for Enhanced Robustness: Introducing perturbations (word substitution or paraphrasing) during training significantly improves attack resistance.

Limitations & Future Work

  • Sentence-level paraphrasing attacks still cause a relatively large performance drop, which is an inherent limitation of all text watermarking methods—attackers can always paraphrase the watermarked text back to its original semantics.
  • Requires training specific encoder-decoder pairs, making the deployment cost higher than zero-training methods (e.g., KGW).
  • Currently based on sentence-level segmentation, the information capacity for ultra-short texts (1-2 sentences) is limited.
  • Hyperparameters \(\lambda_s\) and \(\lambda_k\) need to be adjusted to balance detectability and fidelity.
  • Text Watermarking: Synonym substitution (Topkara et al., 2006), LSTM paraphrasing (Abdelnabi & Fritz, 2021), Gumbel softmax (RemarkLLM, Zhang et al., 2024b), invariant features (Yoo et al., 2023).
  • LLM Output Watermarking: KGW (Kirchenbauer et al., 2023), KTH (Kuditipudi et al., 2023), semantic watermarking (Liu et al., 2023), Waterfall (Lau et al., 2024).
  • Co-training of Paraphrasing Encoder + Classifier: The framework proposed by Xu et al. (2024), which this work extends to multi-bit scenarios.

Rating

⭐⭐⭐⭐ — Elegant method design (PPO co-training + dual-paraphraser), comprehensive experiments (robustness, OOD, stealthiness, ablation), and performance significantly leading the baselines. However, the vulnerability under sentence-level paraphrasing attacks and deployment costs remain bottlenecks for practical application.