Efficiently Identifying Watermarked Segments in Mixed-Source Texts¶
Conference: ACL 2025
arXiv: 2410.03600
Code: https://github.com/XuandongZhao/llm-watermark-location
Area: AI Safety / Text Watermark Detection
Keywords: watermark detection, mixed-source text, geometric cover, online learning, watermark localization
TL;DR¶
Proposes two efficient methods (Geometric Cover Detector and Adaptive Online Locator) to detect and precisely locate watermarked segments in long mixed-source texts, reducing the time complexity from \(O(n^2)\) to \(O(n \log n)\), significantly outperforming baselines across three mainstream watermarking techniques.
Background & Motivation¶
- Watermarking techniques for LLM-generated texts are increasingly used to detect synthetic text, preventing abuses such as fake news and academic misconduct.
- Existing watermark detection methods primarily focus on document-level classification (whether it contains a watermark), but overlook a common and important scenario: identifying which specific segments are generated by an LLM within longer, mixed-source documents.
- Practical application scenario: Malicious actors might use LLMs to modify certain paragraphs of news articles to spread misinformation, requiring functionality similar to Turnitin plagiarism detection systems to locate these segments.
- Key Challenge:
- The watermark signal is "diluted" by the unwatermarked parts in long texts, making document-level detection methods ineffective.
- Brute-force searching all possible subsequence intervals requires an \(O(n^2)\) time complexity, which is computationally prohibitive.
- The watermark score of a single token is highly noisy, making direct threshold-based determination unreliable.
Method¶
Overall Architecture¶
The paper proposes two complementary methods: 1. Geometric Cover Detector (GCD): Determines whether the document contains watermarked text (classification task). 2. Adaptive Online Locator (AOL): Precisely locates the start and end positions of the watermarked text (localization task).
Both methods are based on the Geometric Cover technique, with a time complexity of \(O(n \log n)\).
Key Designs¶
Three Target Watermarking Schemes¶
- KGW-Watermark: Splits the vocabulary into green/red groups based on prefix hashes, boosting the logits of green tokens during generation.
- Unigram-Watermark: Fixes the green/red lists independently of prefixes, offering stronger robustness.
- Gumbel-Watermark: Performs deterministic sampling based on Gumbel trick, achieving distortion-free watermarking.
Geometric Cover Detector (GCD)¶
- Utilizes the Geometric Cover (GC) technique to partition long text into multi-scale interval sets.
- GC Definition: \(\mathcal{I} = \bigcup_{k \in \mathbb{N} \cup 0} \mathcal{I}^{(k)}\), where \(\mathcal{I}^{(k)}\) is the set of all contiguous intervals of length \(2^k\).
- Each token belongs to \(\lfloor \log n \rfloor + 1\) different intervals, totaling \(O(n)\) intervals.
- Key guarantee (Daniely et al., 2015, Lemma 5): For any unknown watermarked interval, there always exists an interval in the GC that is completely contained within it and has at least one-quarter of its length.
- Performs watermarking detection independently for each interval; if any interval is detected, the entire document is classified as containing watermarked text.
- Controls the Family-Wise Error Rate (FWER) via Bonferroni correction: segment-level FPR \(\tau \rightarrow\) document-level FWER \(\le n\tau\).
- In practice, it starts from higher-order intervals (e.g., \(\mathcal{I}^{(5)}\), intervals with 32+ tokens) to avoid unreliable judgments from overly short intervals.
Adaptive Online Locator (AOL)¶
- Formulates the watermark localization problem as an online sequence denoising problem (online denoising / nonparametric regression).
- Core Idea: The watermark score of each token is a noisy observation of the expected score, requiring estimation of the expected score sequence.
- Red-Green Watermark: \(s_t = \mathbf{1}(y_t \in \text{Green})\), with the expected value in the watermarked segment \(> \gamma\), and in the non-watermarked segment \(= \gamma\).
- Gumbel Watermark: \(s_t = \log(1/(1-r_{y_t}))\), with the expected value in the watermarked segment \(> 1\), and in the non-watermarked segment \(= 1\).
- Employs the Aligator algorithm (Baby et al., 2021) for denoising.
- Aligator internally utilizes a Geometric Cover structure.
- Provides a minimax optimal estimation guarantee: competing with an oracle that knows the watermarked segment locations in terms of mean squared error.
- Time complexity is \(O(n \log n)\).
- Circular Aligator: Designed to address the boundary effects of online learning.
- Treats the text as a circular buffer, randomly selecting a starting point to traverse the entire sequence each time.
- Performs \(m\) iterations with different starting points, and finally takes the average of all predictions for each token.
- Effectively eliminates prediction inaccuracies at the start and end boundaries.
- Finally, applies a threshold \(\zeta\) to the denoised average scores; tokens with scores exceeding the threshold are identified as watermarked text.
Loss & Training¶
This paper presents a detection method and does not involve model training. The key aspects are: - The threshold for GCD is determined by the FPR calibration function \(F\). - The threshold \(\zeta\) for AOL is set according to the watermarking scheme (e.g., \(\zeta = 1.3\) in Gumbel watermarking). - The number of iterations for Circular Aligator is \(m = 10\) random starting points.
Key Experimental Results¶
Experimental Setup¶
- Datasets: C4 and Arxiv (real text serves as the non-watermarked part, with LLM-generated watermarked text embedded within).
- Models: LLaMA-7B and Mistral-7B.
- Watermarking Schemes: KGW-Watermark, Unigram-Watermark, Gumbel-Watermark.
- Evaluation Metrics: TPR (classification), IoU (localization).
Main Results — Classification Task (C4 + LLaMA-7B)¶
| Method | KGW TPR | Unigram TPR | Gumbel TPR |
|---|---|---|---|
| Vanilla (Document-level) | 0.602-0.692 | 0.006-0.058 | 0.650-0.918 |
| GCD | 0.912-0.934 | 0.874-0.958 | 1.000 |
- The Vanilla method almost completely fails on Unigram watermarking (TPR only 0.006) because the watermark signal is diluted by the long text.
- GCD significantly outperforms the Vanilla method across all watermarking schemes.
- Gumbel watermarking is the easiest to detect, with GCD achieving 100% TPR.
- The improvement for Unigram watermarking is the most significant: rising from near-random (0.006) to 0.874+.
Main Results — Localization Task¶
- AOL achieves an average IoU \(> 0.55\) across the three watermarking schemes, performing far exceeding baseline methods.
- The Circular initialization strategy (\(m=10\) random starting points) significantly improves localization accuracy at the boundaries.
- A single Aligator pass exhibits noticeable boundary artifacts, which are effectively eliminated by multiple circular passes.
Key Findings¶
- Document-level detection severely fails in mixed-source scenarios: Particularly for Unigram watermarking, where all signals are overwhelmed by non-watermarked text.
- Multi-scale analysis is key: The multi-scale interval partitioning of GC ensures that watermarked segments of varying lengths can be effectively captured.
- Online denoising is superior to simple thresholding: Individual token scores are highly noisy and must be denoised through appropriate window averaging.
- The Circular strategy is crucial for eliminating boundary effects: A single linear pass yields inaccurate predictions for the start and end positions.
- The method is highly generalizable across watermarking schemes: The same framework is applicable to three watermarking schemes with completely different mechanisms.
Highlights & Insights¶
- The problem formulation is highly practical: Advancing from "whether the entire document contains a watermark" to "which paragraphs contain a watermark" directly addresses real-world demands such as plagiarism detection.
- Elegant algorithmic design: Borrowing mature tools from Geometric Cover and online learning fields reduces the time complexity from \(O(n^2)\) to \(O(n \log n)\).
- Clever design of Circular Aligator: It simply but effectively solves the boundary effect problem inherent in online algorithms.
- Solid theoretical guarantees: Aligator's estimation error has a clear theoretical upper bound and possesses strong adaptivity at the segment level.
- Highly versatile framework: Does not rely on a specific watermarking scheme, making it extensible to future watermarking technologies.
Limitations & Future Work¶
- Assumption of contiguous watermarked segments: In practice, there may be multiple non-contiguous watermarked segments, and the current framework might require post-processing for segmentation.
- Threshold \(\zeta\) requires manual setting based on the watermarking scheme: The optimal thresholds vary across different watermarking schemes.
- Robustness under watermark attacks (e.g., paraphrasing) is not considered: In real-world scenarios, the watermarked text might be paraphrased.
- Evaluated only on 7B models: The effectiveness on watermarked texts generated by larger models (e.g., 70B, GPT-4 level) has not been verified.
- FPR control uses a conservative union bound: Tighter control methods might exist.
- Localization of multiple scattered watermarked segments is not discussed: Performance when there are multiple dispersed watermarked segments in a document is unaddressed.
Related Work & Insights¶
- Watermarking Schemes: KGW (Kirchenbauer et al., 2023), Unigram (Zhao et al., 2023), Gumbel (Aaronson, 2023, Kuditipudi et al., 2023).
- Partial Watermark Detection: Aaronson's plausibility score (\(O(n^{3/2})\)), Christ et al.'s Substring Completeness (\(O(n^2)\)), WinMax (\(O(n^2)\)).
- Online Learning: Strong adaptivity guarantees of Aligator (Baby et al., 2021).
- Insights: The framework can be extended to (1) localization and detection of code watermarks; (2) integration with plagiarism detection systems; (3) fine-grained annotation of AI-generated content; (4) adversarial evaluation of watermark robustness.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First to achieve efficient watermarked segment localization in mixed-source texts.
- Technical Depth: ⭐⭐⭐⭐⭐ — Elegantly combines GC and online learning theory with solid theoretical guarantees.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Covers three watermarking schemes and two datasets, though the model size is limited.
- Practical Value: ⭐⭐⭐⭐⭐ — Directly addresses the needs of plagiarism detection and AI-generated content regulation.
- Writing Quality: ⭐⭐⭐⭐⭐ — Clear logic, exquisite illustrations, and excellent alignment between theory and experiments.
- Overall Rating: 9.0/10