Efficiently Identifying Watermarked Segments in Mixed-Source Texts¶

Conference: ACL 2025
arXiv: 2410.03600
Code: https://github.com/XuandongZhao/llm-watermark-location
Area: AI Safety / Text Watermark Detection
Keywords: watermark detection, mixed-source text, geometric cover, online learning, watermark localization

TL;DR¶

Proposes two efficient methods (Geometric Cover Detector and Adaptive Online Locator) to detect and precisely locate watermarked segments in long mixed-source texts, reducing the time complexity from \(O(n^2)\) to \(O(n \log n)\), significantly outperforming baselines across three mainstream watermarking techniques.

Background & Motivation¶

Watermarking techniques for LLM-generated texts are increasingly used to detect synthetic text, preventing abuses such as fake news and academic misconduct.
Existing watermark detection methods primarily focus on document-level classification (whether it contains a watermark), but overlook a common and important scenario: identifying which specific segments are generated by an LLM within longer, mixed-source documents.
Practical application scenario: Malicious actors might use LLMs to modify certain paragraphs of news articles to spread misinformation, requiring functionality similar to Turnitin plagiarism detection systems to locate these segments.
Key Challenge:
- The watermark signal is "diluted" by the unwatermarked parts in long texts, making document-level detection methods ineffective.
- Brute-force searching all possible subsequence intervals requires an \(O(n^2)\) time complexity, which is computationally prohibitive.
- The watermark score of a single token is highly noisy, making direct threshold-based determination unreliable.

Method¶

Overall Architecture¶

The paper proposes two complementary methods: 1. Geometric Cover Detector (GCD): Determines whether the document contains watermarked text (classification task). 2. Adaptive Online Locator (AOL): Precisely locates the start and end positions of the watermarked text (localization task).

Both methods are based on the Geometric Cover technique, with a time complexity of \(O(n \log n)\).

Key Designs¶

Three Target Watermarking Schemes¶

KGW-Watermark: Splits the vocabulary into green/red groups based on prefix hashes, boosting the logits of green tokens during generation.
Unigram-Watermark: Fixes the green/red lists independently of prefixes, offering stronger robustness.
Gumbel-Watermark: Performs deterministic sampling based on Gumbel trick, achieving distortion-free watermarking.

Geometric Cover Detector (GCD)¶

Utilizes the Geometric Cover (GC) technique to partition long text into multi-scale interval sets.
GC Definition: \(\mathcal{I} = \bigcup_{k \in \mathbb{N} \cup 0} \mathcal{I}^{(k)}\), where \(\mathcal{I}^{(k)}\) is the set of all contiguous intervals of length \(2^k\).
Each token belongs to \(\lfloor \log n \rfloor + 1\) different intervals, totaling \(O(n)\) intervals.
Key guarantee (Daniely et al., 2015, Lemma 5): For any unknown watermarked interval, there always exists an interval in the GC that is completely contained within it and has at least one-quarter of its length.
Performs watermarking detection independently for each interval; if any interval is detected, the entire document is classified as containing watermarked text.
Controls the Family-Wise Error Rate (FWER) via Bonferroni correction: segment-level FPR \(\tau \rightarrow\) document-level FWER \(\le n\tau\).
In practice, it starts from higher-order intervals (e.g., \(\mathcal{I}^{(5)}\), intervals with 32+ tokens) to avoid unreliable judgments from overly short intervals.

Adaptive Online Locator (AOL)¶

Formulates the watermark localization problem as an online sequence denoising problem (online denoising / nonparametric regression).
Core Idea: The watermark score of each token is a noisy observation of the expected score, requiring estimation of the expected score sequence.
- Red-Green Watermark: \(s_t = \mathbf{1}(y_t \in \text{Green})\), with the expected value in the watermarked segment \(> \gamma\), and in the non-watermarked segment \(= \gamma\).
- Gumbel Watermark: \(s_t = \log(1/(1-r_{y_t}))\), with the expected value in the watermarked segment \(> 1\), and in the non-watermarked segment \(= 1\).
Employs the Aligator algorithm (Baby et al., 2021) for denoising.
- Aligator internally utilizes a Geometric Cover structure.
- Provides a minimax optimal estimation guarantee: competing with an oracle that knows the watermarked segment locations in terms of mean squared error.
- Time complexity is \(O(n \log n)\).
Circular Aligator: Designed to address the boundary effects of online learning.
- Treats the text as a circular buffer, randomly selecting a starting point to traverse the entire sequence each time.
- Performs \(m\) iterations with different starting points, and finally takes the average of all predictions for each token.
- Effectively eliminates prediction inaccuracies at the start and end boundaries.
Finally, applies a threshold \(\zeta\) to the denoised average scores; tokens with scores exceeding the threshold are identified as watermarked text.

Loss & Training¶

This paper presents a detection method and does not involve model training. The key aspects are: - The threshold for GCD is determined by the FPR calibration function \(F\). - The threshold \(\zeta\) for AOL is set according to the watermarking scheme (e.g., \(\zeta = 1.3\) in Gumbel watermarking). - The number of iterations for Circular Aligator is \(m = 10\) random starting points.

Key Experimental Results¶

Experimental Setup¶

Datasets: C4 and Arxiv (real text serves as the non-watermarked part, with LLM-generated watermarked text embedded within).
Models: LLaMA-7B and Mistral-7B.
Watermarking Schemes: KGW-Watermark, Unigram-Watermark, Gumbel-Watermark.
Evaluation Metrics: TPR (classification), IoU (localization).

Main Results — Classification Task (C4 + LLaMA-7B)¶

Method	KGW TPR	Unigram TPR	Gumbel TPR
Vanilla (Document-level)	0.602-0.692	0.006-0.058	0.650-0.918
GCD	0.912-0.934	0.874-0.958	1.000

The Vanilla method almost completely fails on Unigram watermarking (TPR only 0.006) because the watermark signal is diluted by the long text.
GCD significantly outperforms the Vanilla method across all watermarking schemes.
Gumbel watermarking is the easiest to detect, with GCD achieving 100% TPR.
The improvement for Unigram watermarking is the most significant: rising from near-random (0.006) to 0.874+.

Main Results — Localization Task¶

AOL achieves an average IoU \(> 0.55\) across the three watermarking schemes, performing far exceeding baseline methods.
The Circular initialization strategy (\(m=10\) random starting points) significantly improves localization accuracy at the boundaries.
A single Aligator pass exhibits noticeable boundary artifacts, which are effectively eliminated by multiple circular passes.

Key Findings¶

Document-level detection severely fails in mixed-source scenarios: Particularly for Unigram watermarking, where all signals are overwhelmed by non-watermarked text.
Multi-scale analysis is key: The multi-scale interval partitioning of GC ensures that watermarked segments of varying lengths can be effectively captured.
Online denoising is superior to simple thresholding: Individual token scores are highly noisy and must be denoised through appropriate window averaging.
The Circular strategy is crucial for eliminating boundary effects: A single linear pass yields inaccurate predictions for the start and end positions.
The method is highly generalizable across watermarking schemes: The same framework is applicable to three watermarking schemes with completely different mechanisms.

Highlights & Insights¶

The problem formulation is highly practical: Advancing from "whether the entire document contains a watermark" to "which paragraphs contain a watermark" directly addresses real-world demands such as plagiarism detection.
Elegant algorithmic design: Borrowing mature tools from Geometric Cover and online learning fields reduces the time complexity from \(O(n^2)\) to \(O(n \log n)\).
Clever design of Circular Aligator: It simply but effectively solves the boundary effect problem inherent in online algorithms.
Solid theoretical guarantees: Aligator's estimation error has a clear theoretical upper bound and possesses strong adaptivity at the segment level.
Highly versatile framework: Does not rely on a specific watermarking scheme, making it extensible to future watermarking technologies.

Limitations & Future Work¶

Assumption of contiguous watermarked segments: In practice, there may be multiple non-contiguous watermarked segments, and the current framework might require post-processing for segmentation.
Threshold \(\zeta\) requires manual setting based on the watermarking scheme: The optimal thresholds vary across different watermarking schemes.
Robustness under watermark attacks (e.g., paraphrasing) is not considered: In real-world scenarios, the watermarked text might be paraphrased.
Evaluated only on 7B models: The effectiveness on watermarked texts generated by larger models (e.g., 70B, GPT-4 level) has not been verified.
FPR control uses a conservative union bound: Tighter control methods might exist.
Localization of multiple scattered watermarked segments is not discussed: Performance when there are multiple dispersed watermarked segments in a document is unaddressed.

Watermarking Schemes: KGW (Kirchenbauer et al., 2023), Unigram (Zhao et al., 2023), Gumbel (Aaronson, 2023, Kuditipudi et al., 2023).
Partial Watermark Detection: Aaronson's plausibility score (\(O(n^{3/2})\)), Christ et al.'s Substring Completeness (\(O(n^2)\)), WinMax (\(O(n^2)\)).
Online Learning: Strong adaptivity guarantees of Aligator (Baby et al., 2021).
Insights: The framework can be extended to (1) localization and detection of code watermarks; (2) integration with plagiarism detection systems; (3) fine-grained annotation of AI-generated content; (4) adversarial evaluation of watermark robustness.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First to achieve efficient watermarked segment localization in mixed-source texts.
Technical Depth: ⭐⭐⭐⭐⭐ — Elegantly combines GC and online learning theory with solid theoretical guarantees.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers three watermarking schemes and two datasets, though the model size is limited.
Practical Value: ⭐⭐⭐⭐⭐ — Directly addresses the needs of plagiarism detection and AI-generated content regulation.
Writing Quality: ⭐⭐⭐⭐⭐ — Clear logic, exquisite illustrations, and excellent alignment between theory and experiments.
Overall Rating: 9.0/10