Robustness and Radioactivity of Watermarks in Federated Learning May Be at Odds¶
Conference: ICLR 2026 arXiv: 2510.17033 Area: Video Understanding Keywords: Federated Learning, LLM Watermarking, Data Provenance, Robust Aggregation, Radioactivity Detection
TL;DR¶
This work presents the first study on LLM watermark-based data provenance in federated learning (FL). It demonstrates that watermarks are radioactive (detectable) in FL, yet a malicious server can suppress watermark signals by employing strong robust aggregation algorithms to filter watermarked updates, revealing a fundamental trilemma among radioactivity, robustness, and model utility.
Background & Motivation¶
As LLM-generated synthetic data becomes increasingly prevalent in FL, data provenance has become a critical concern:
Proliferation of Synthetic Data: FL clients increasingly rely on LLM-generated synthetic datasets for local training.
Provenance Requirements: Without a provenance mechanism, malicious use of synthetic data for fine-tuning cannot be attributed or held accountable.
Regulatory Demands: Regulations such as the EU AI Act explicitly require transparency and traceability in AI systems.
Core Problem: Are LLM watermarks still effective in FL environments? Can a malicious server remove watermark signals while preserving model utility?
Key Findings¶
t-SNE visualization reveals that model updates from watermarked clients appear as outliers in high-dimensional space, clearly separated from the update distribution of clean clients (Figure 1), thereby providing a handle for adversarial servers to mount attacks.
Method¶
Overall Architecture¶
The study considers two FL scenarios:
- VanillaFL: A benign server that naively averages all client updates.
- ActiveFL: A malicious server that employs a robust aggregator to filter watermarked updates.
Key Definitions¶
Radioactivity: A dataset \(D^w\) is \(\alpha\)-radioactive with respect to statistical test \(T\) if \(T\) can reject the null hypothesis \(H_0\) (that the model was not trained on \(D^w\)) with a p-value below \(\alpha\).
FL Robustness: Dataset \(D^w_i\) is non-robust against adversary \(\mathcal{A}\) if there exists \(\mathcal{A}\) such that: 1. \(\mathcal{A}(U_\Delta, \theta^{t_\mathcal{A}}) \approx_\mathcal{E} \mathcal{T}(C_\Delta, \theta^{t_\mathcal{A}})\) (utility preserved) 2. \(\text{Detect}^{\mathcal{M}_{\theta^{t_\mathcal{A}+1}},\mathcal{A}}_s(D^w_i) \rightarrow \text{False}\) (watermark undetectable)
Attack Mechanism¶
The malicious server replaces simple averaging with a Byzantine-robust aggregator (e.g., RandEigen), which guarantees an upper bound on aggregation bias:
Strong robust aggregators guarantee \(\beta = O(1)\), independent of the parameter dimensionality \(d\), making them well-suited for the high-dimensional parameter spaces of LLMs.
Evaluation Metrics¶
- Escape Rate (ER): Fraction of watermarked clients whose updates survive aggregation without being filtered.
- Over-Filtering Rate (OFR): Fraction of filtered updates that belong to non-watermarked clients.
Key Experimental Results¶
Radioactivity Detection (VanillaFL, ε=6.6%)¶
| Dataset | Watermark | Model | p-value (pre-FT) | p-value (post-FT) |
|---|---|---|---|---|
| C4 | KGW+ | 160M | 0.397 | \(1.27\times10^{-3}\) |
| C4 | KGW+ | 410M | 0.877 | \(2.41\times10^{-8}\) |
| Alpaca | KGW+ | 410M | 0.302 | \(4.96\times10^{-24}\) |
| C4 | KTH+ | All | ~0.5 | ~0.5 |
Robustness Detection (ActiveFL vs. VanillaFL, ε=6.6%)¶
| Dataset | Model | VanillaFL p-value | ActiveFL p-value |
|---|---|---|---|
| C4 | 160M | \(1.27\times10^{-3}\) | 0.550 |
| C4 | 410M | \(2.41\times10^{-8}\) | 0.613 |
| Alpaca | 160M | \(1.59\times10^{-11}\) | 0.231 |
| Alpaca | 410M | \(4.96\times10^{-24}\) | 0.282 |
Key Findings¶
- KGW+ watermarks exhibit strong radioactivity in FL: Even with only 6.6% watermarked data, p-values can reach as low as \(10^{-24}\).
- KTH+ watermarks are not radioactive in FL: Their detectors cannot accumulate statistical signals across prompts.
- RandEigen effectively removes all watermarks: Under ActiveFL, p-values for all radioactive watermarks recover to ~0.5.
- Larger δ increases radioactivity but reduces robustness: ER drops from 60.2% (δ=0) to 0.7% (δ=5).
- Trilemma: Increasing ε simultaneously improves radioactivity and robustness but degrades model utility.
Highlights & Insights¶
- First federated data provenance study: Extends watermark detection from centralized to distributed FL settings.
- Adversarial insight: The distributional shift introduced by watermarking causes updates to appear as outliers, which are precisely the targets of aggregators originally designed to defend against Byzantine attacks.
- Revelation of a fundamental trilemma: Radioactivity (detectability), robustness (adversarial resistance), and utility (model quality) cannot be simultaneously satisfied.
- Practical threat model: A malicious server only needs to replace the aggregation function to suppress watermarks, without any knowledge of the watermarking scheme or its parameters.
Limitations & Future Work¶
- Evaluation is limited to the Pythia model family (70M–410M); validation on larger models is absent.
- Only two watermarking schemes (KGW+, KTH+) are considered, limiting coverage.
- The malicious server is assumed to have no knowledge of the watermark key or scheme; in practice, such information may leak.
- No effective defense against robust aggregation attacks is proposed.
- The watermarked client ratio ε is kept small (at most 30%); more extreme scenarios remain unexplored.
Rating ⭐⭐⭐⭐¶
The problem formulation is novel and the experimental design is rigorous, uncovering a fundamental tension in FL watermarking. Although no solution is proposed, the work charts a clear direction for future research. The connection to video understanding is tenuous; the paper is more squarely situated at the intersection of security and federated learning.