Provable Watermarking for Data Poisoning Attacks¶

Conference: NeurIPS 2025 arXiv: 2510.09210 Code: None Area: AI Security Keywords: watermarking, data poisoning, backdoor attack, availability attack, provable guarantee

TL;DR¶

This paper proposes two provable watermarking schemes—post-poisoning watermarking and poisoning-concurrent watermarking—that provide transparency declaration mechanisms for data poisoning attacks. The theoretical analysis demonstrates that, under specific watermark length conditions, both watermark detectability and poisoning effectiveness can be simultaneously guaranteed.

Background & Motivation¶

The dual nature of data poisoning: Data poisoning attacks have traditionally been regarded as security threats, but have increasingly been adopted for legitimate purposes, including dataset ownership verification (backdoor attacks), prevention of unauthorized data use (unlearnable examples), and artistic copyright protection (NightShade/Glaze).
Risk of misuse: Authorized users may inadvertently train on poisoned data, leading to misunderstandings and disputes. Poisoning generators therefore bear a responsibility to transparently disclose the presence of poisoning traces to authorized users.
Limitations of existing detection methods: Existing backdoor and availability attack detection methods are diverse and mutually incompatible, making it difficult to distribute them as a unified framework to authorized users. More critically, these methods largely rely on heuristic training algorithms and lack provable poisoning declaration mechanisms.
Applicability of watermarking: Watermarking techniques have been widely applied in copyright protection and AI-generated content detection, making them a promising solution that allows poisoning generators to declare the existence of poisoning in a provable manner.
Two practical scenarios: Two distinct requirements arise in practice—one in which a poisoning generator delegates watermark creation to a third-party entity (post-poisoning watermarking), and one in which the generator simultaneously produces both the watermark and the poisoning (poisoning-concurrent watermarking).
Theoretical gap: Despite extensive independent research on data poisoning and watermarking, applying watermarking schemes to data poisoning attacks remains unexplored, with no existing theoretical guarantees for such a setting.

Method¶

Overall Architecture¶

The paper defines data poisoning as a mapping \(\delta^p: [0,1]^d \to [-\epsilon_p, \epsilon_p]^d\) and watermarking as a mapping \(\delta^w: [0,1]^d \to [-\epsilon_w, \epsilon_w]^q\), where \(q \leq d\) denotes the watermark length. The detection mechanism uses a secret key \(\zeta\) to compute the inner product \(\zeta^T x\) to determine whether a sample has been watermarked.

Post-Poisoning Watermarking¶

Scenario: A third-party entity generates watermarks for an already-poisoned dataset.
Total perturbation: \(\delta_x = \delta_x^p + \delta_x^w\), with perturbation budget \(\epsilon_w + \epsilon_p\).
Sample-level theory (Theorem 4.1): When the watermark length \(q > \frac{1}{\epsilon_w}\sqrt{2d\log\frac{1}{\omega}}\), i.e., \(q = \Omega(\sqrt{d}/\epsilon_w)\), watermarked data can be distinguished from clean data with high probability.
Universal version (Theorem 4.9): When \(q = \Theta(\sqrt{d}/\epsilon_w)\) and sample size \(N = \Omega(d)\), the guarantee holds for the majority of samples and extends to the entire distribution.

Poisoning-Concurrent Watermarking¶

Scenario: The watermark generator is also the poisoning generator, enabling control over dimension allocation.
Dimension separation: Watermarking and poisoning operate on disjoint dimensions, with poisoning dimensions \(\mathcal{P} = [d] \setminus \mathcal{W}\).
Total perturbation: \(\max\{\epsilon_w, \epsilon_p\}\), which is smaller than the post-poisoning case of \(\epsilon_w + \epsilon_p\).
Sample-level theory (Theorem 4.3): A watermark length of \(q = \Omega(1/\epsilon_w^2)\) suffices to ensure detectability, requiring a shorter watermark than the post-poisoning scheme.
Universal version (Theorem 4.13): \(q = \Theta(1/\epsilon_w^2)\) is sufficient to hold for the majority of samples.

Poisoning Effectiveness Guarantees¶

Post-poisoning case (Theorem 5.2): For an \(L\)-layer feedforward network with Xavier initialization, when \(d\) and \(N\) are sufficiently large, the influence of watermarking on poisoning effectiveness can be bounded by a negligible error term, without additional constraints on \(q\).
Concurrent case (Theorem 5.6): An additional condition \(q = O(\sqrt{d}/\epsilon_p)\) is required; otherwise, the watermark dominates and significantly degrades poisoning effectiveness.
Final watermark length ranges: Post-poisoning uses \(\Theta(\sqrt{d}/\epsilon_w)\); concurrent uses \(\Theta(1/\epsilon_w^2)\) to \(O(\sqrt{d}/\epsilon_p)\).

Threat Model Instantiation¶

Using an autonomous driving dataset as a running example, the paper constructs a complete deployment scenario: data owner Alice publishes a dataset with both poisoning and watermarking; authorized user Bob obtains the key via a secure channel to verify identity and remove poisoning; malicious user Chad can neither successfully train a model nor remove the watermark. An HMAC-based key management scheme is also designed to prevent forgery attacks.

Key Experimental Results¶

Experimental Setup¶

Attack methods: Backdoor attacks (Narcissus, AdvSc) + availability attacks (UE, AP)
Datasets: CIFAR-10, CIFAR-100, Tiny-ImageNet
Models: ResNet-18, ResNet-50, VGG-19, DenseNet121, WRN34-10, MobileNet v2
Watermark length: 0–3000; watermark/poisoning budgets are 16/255 (backdoor) and 8/255 (availability)

Backdoor Attack Watermarking Results (Table 1 — ResNet-18, CIFAR-10)¶

Watermark Length \(q\)	Narcissus Post-Poisoning Acc/ASR/AUROC	Narcissus Concurrent Acc/ASR/AUROC	AdvSc Post-Poisoning Acc/ASR/AUROC	AdvSc Concurrent Acc/ASR/AUROC
0 (Baseline)	94.69/95.04/-	94.69/95.04/-	92.80/95.53/-	92.80/95.53/-
500	94.95/93.11/0.9509	94.70/95.03/0.9968	93.18/97.43/0.9218	92.89/95.79/0.9986
1000	94.40/92.43/0.9974	94.32/92.03/0.9992	93.05/94.41/0.9809	93.38/84.39/0.9995
2000	94.55/90.37/1.0000	94.89/22.46/1.0000	93.40/79.97/0.9994	92.38/30.05/1.0000
3000	94.93/90.02/1.0000	94.72/9.75/1.0000	93.10/74.82/1.0000	93.04/9.97/1.0000

Availability Attack Watermarking Results (Table 2 — ResNet-18, CIFAR-10)¶

Watermark Length \(q\)	UE Post-Poisoning Acc↓/AUROC↑	UE Concurrent Acc↓/AUROC↑	AP Post-Poisoning Acc↓/AUROC↑	AP Concurrent Acc↓/AUROC↑
0 (Baseline)	10.79/-	10.79/-	8.53/-	8.53/-
500	11.71/0.7810	10.02/0.9930	8.71/0.8623	15.84/0.8931
1000	11.37/0.9499	9.42/0.9991	10.58/0.9742	21.87/0.9949
2000	9.06/0.9992	10.03/1.0000	10.48/0.9987	38.62/1.0000
3000	9.99/1.0000	91.79/1.0000	13.52/1.0000	93.40/1.0000

Key Findings: - Post-poisoning watermarking: poisoning effectiveness is well-preserved across all values of \(q\) (ASR declines only marginally), validating Theorem 5.2. - Poisoning-concurrent watermarking: poisoning effectiveness degrades sharply at large \(q\) (e.g., ASR drops to ~22–30% at \(q=2000\)), validating the necessity of the condition \(q = O(\sqrt{d}/\epsilon_p)\). - Detection performance (AUROC) of the concurrent scheme consistently surpasses that of the post-poisoning scheme at equivalent \(q\), corroborating the theoretical conclusion that \(\Omega(1/\epsilon_w^2) < \Omega(\sqrt{d}/\epsilon_w)\).

Highlights & Insights¶

Pioneering contribution: This work is the first to introduce watermarking into the domain of data poisoning attacks, providing transparency guarantees for "poisoning for good" use cases.
Theoretical rigor: A complete theoretical guarantee chain is established, from sample-level to universal and distributional settings, with precise conditions ensuring both watermark detectability and poisoning effectiveness.
Complementary schemes: The post-poisoning scheme suits third-party trust scenarios (where larger watermark lengths are preferable), while the concurrent scheme is tailored for self-administered poisoning (requiring a balance between watermark and poisoning length).
End-to-end deployment design: Beyond theory and algorithms, the paper also proposes engineering-level security measures such as HMAC-based key management.

Limitations & Future Work¶

Sufficient but not necessary conditions: The theory provides only sufficient conditions on watermark length; necessary conditions remain an open problem.
Linear detection mechanism: Detection relies on a simple inner-product threshold, which may lack robustness.
Narrow valid range in concurrent watermarking: The effective range of \(q\), i.e., \([\Theta(1/\epsilon_w^2), O(\sqrt{d}/\epsilon_p)]\), may be restrictively narrow in practice.
Limited adversarial robustness: Although the appendix discusses defenses such as data augmentation, image regeneration, differentially private noise, and diffusion-based purification, a systematic robustness analysis is not conducted in depth.

Dimension	Ours	Li et al. (2023) Backdoor Dataset Watermarking
Watermark–poisoning relationship	Explicitly separates watermarking from poisoning; proposes two schemes	Treats the backdoor attack itself as the watermark signal
Theoretical guarantees	Provides provable conditions for detectability and poisoning effectiveness	Lacks formal theoretical guarantees
Attack coverage	Unified framework for backdoor and availability attacks	Limited to backdoor attacks
Generality	General; does not depend on specific attack algorithms	Tightly coupled to specific backdoor attack methods

Dimension	Ours	NightShade/Glaze
Objective	Declaring the presence of poisoning; providing transparency	Protecting artists' copyright from being learned by generative models
Watermark mechanism	Detectable watermark independent of poisoning	No explicit watermark mechanism
Theoretical framework	Rigorous probabilistic argumentation	Empirical approach
Detection capability	Authorized users can verify poisoning	No active detection mechanism

Rating¶

⭐⭐⭐⭐ Novelty: First to introduce watermarking into data poisoning scenarios; problem formulation is novel and practically motivated.
⭐⭐⭐⭐ Technical Depth: Complete theoretical chain from sample-level to distributional guarantees; proofs are rigorous.
⭐⭐⭐⭐ Experimental Thoroughness: Covers 4 attack types, multiple models and datasets, with thorough ablation studies.
⭐⭐⭐ Practicality: Theory is promising, but constraints on the valid watermark length range and robustness concerns limit real-world deployment.