Guidance Watermarking for Diffusion Models¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=5ifzhjMCKq
Code: TBD
Area: Image Generation / AIGC Watermarking
Keywords: Diffusion model watermarking, guided diffusion, in-generation watermarking, robustness, PCGrad

TL;DR¶

This paper proposes a "Guidance Watermarking" method: using any off-the-shelf post-hoc watermark decoder to backpropagate gradients and guide the diffusion sampling trajectory. This converts any post-hoc watermarking scheme into an in-generation watermark at zero cost, without retraining the diffusion model or the decoder, while inheriting or even enhancing the decoder's robustness.

Background & Motivation¶

Background: Images generated by diffusion models have become indistinguishable from real photographs. Regulatory requirements necessitate provenance and identification of AIGC content, with digital watermarking being a core method. Existing solutions are divided into two categories: post-hoc (adding watermarks after generation) and in-generation (embedding watermarks during the generation process, such as Stable Signature fine-tuning VAE, or Tree-Ring/Gaussian Shading embedding patterns in seeds).
Limitations of Prior Work: ① Stable Signature concentrates watermark energy on high-frequency details produced by VAE upsampling, resulting in poor robustness against low-pass processing like JPEG. ② Seed schemes like Tree-Ring/Gaussian Shading either alter image semantics or are fragile under geometric attacks (cropping, rotation), and they lack theoretical guarantees for probability of false alarm (PFA). ③ Post-hoc schemes are merely external add-ons, disconnected from the generation process. ④ Most in-generation schemes require retraining models or decoders.
Key Challenge: There is a desire to combine the strong robustness and controllable false alarm rate of post-hoc decoders (trained with augmentation layers) with the advantages of in-generation schemes—where "the watermark is embedded in the semantics rather than relying on weak signals." These two have previously been difficult to reconcile.
Goal: To enable the sampling process to directly generate images that are "judged as watermarked by a pre-trained detector" without retraining the diffusion model, the decoder, or depending on the original image, while embedding the watermark into the semantic layer as early as possible.
Core Idea: [Treating the watermark decoder as a classifier for gradient guidance] — A watermark detector is essentially a differentiable classifier. Borrowing the gradient guidance approach from counterfactual generation, the gradients of the watermark loss from the decoder are used to push the diffusion trajectory, ensuring the final image falls into the "watermarked" region. An augmentation layer is added to achieve robustness against attacks.

Method¶

Overall Architecture¶

Assume a latent diffusion model consists of a noise estimation network \(\epsilon_\theta\), a schedule \(\bar\alpha_t\), and a VAE. Given any differentiable watermark extraction function \(\phi: \mathcal{X}\to\mathbb{R}^M\) and a secret vector \(u_m\), the method injects the gradient of the decoding loss into the noise estimation at each step. This guides the sampling trajectory toward a "high cosine similarity" watermark region. Robustness is obtained via an augmentation layer and gradient aggregation, with truncation/clipping used for acceleration and intensity control.

flowchart LR
    A[Latent z_t] --> B[Complete diffusion z_t->z_0 + VAE to get x_0]
    B --> C[Augmentation layer T: JPEG/Crop/Contrast...]
    C --> D[Watermark decoder φ extracts logits]
    D --> E[Cosine loss L = 1 - cos<φ, u_m>]
    E -->|Backprop ∇L| F[PCGrad aggregates multi-aug gradients]
    F --> G[Corrected noise estimation ε̂ = ε_θ - ω√(1-ᾱ_t)·g]
    G --> A

Key Designs¶

1. Gradient guidance turns post-hoc decoders into diffusion conditions: The soul of the method lies in formulating the "detector judgment" as a differentiable loss and computing gradients with respect to the latent variables. At each time step, the noise estimation is rewritten as \(\hat\epsilon(z_t,t)=\epsilon_\theta(z_t,t)-\omega\sqrt{1-\bar\alpha_t}\,\nabla_{z_t}\log L(z_t)\), where \(\omega\) controls the watermark strength. The loss unifies multi-bit decoding and zero-bit detection: \(L(z_t,u_m)=1-\cos\big(\phi(x_0(z_t)),\,u_m\big)\), i.e., minimizing the angle \(\theta\) between the extracted features and the secret vector. Under zero-bit detection, this angle can be directly converted into a statistical p-value \(p=\tfrac12\big(1\pm I_{\cos^2\theta}(\tfrac12,\tfrac{M-1}{2})\big)\). Thus, "minimizing loss" is equivalent to "lowering the p-value and increasing the probability of correct detection," allowing the desirable post-hoc property of controllable false alarms to be fully inherited.

2. Augmentation layer + PCGrad aggregation for "training-free robustness": To resist attacks, the authors calculate losses not only on the original image but also on a set of image transformations \(T\in\mathcal{T}\) (Identity, JPEG QF50/80, Brightness +0.2, Contrast ×2, Center Crop 50%). The gradients are then aggregated: \(\hat\epsilon_{\mathcal T}(z_t)=\epsilon_\theta(z_t)-\sqrt{1-\bar\alpha_t}\,\mathrm{Agg}(\{\nabla_{z_t}\log L(z_t,u_m;T)\})\). Since gradient directions of different transformations often conflict, a simple average would cancel them out. Therefore, PCGrad (projecting out conflicting gradient components), a technique from multi-task learning, is borrowed for aggregation. A key benefit is that \(\mathcal{T}\) can include transformations to which the original decoder was not initially robust; the guidance process actively "teaches" this robustness into the generation results without retraining the decoder.

3. Fast and controllable approximate guidance: A naive implementation would require \(T(T+1)/2\) diffusion steps plus backpropagation, which is computationally expensive. The authors apply two simplifications: ① Watermark guidance is only enabled after a certain step \(T_w\) (\(0<T_w<T\)); ② An identity transformation is used to approximate gradient propagation in reverse diffusion, substituting \(\nabla_{z_t}\) with \(\nabla_{z_0}\). To avoid the hassle of searching for \(\omega\) for every model, norm clipping is applied to stabilize the watermark energy injected at each step: \(\hat\epsilon(z_t,t)=\epsilon_\theta(z_t,t)-\omega\sqrt{1-\bar\alpha_t}\,\dfrac{g}{\max(\eta,\lVert g\rVert)}\), where \(g=\mathrm{clip}_\tau(\nabla_{z_0}\log L(z_t))\). This makes the method plug-and-play across different solvers like SD2, Flux, and Sana.

4. Full-spectrum watermark energy distribution: Unlike Stable Signature, which crowds watermarks into high frequencies, the guidance method imposes constraints throughout the entire generation trajectory. This ensures that the watermark energy is distributed across the entire spectrum (evidenced by the spectral difference map in Fig. 2 of the paper). This is the physical reason for its superior robustness against low-pass attacks like JPEG and explains why the watermark "enters the semantics," causing minimal changes to images from the same seed/prompt while remaining globally detectable.

Key Experimental Results¶

Main Results Table (Quality + Robustness, Multi-model, Gain over SSig/VS counterparts in parentheses)¶

Model	Method	FID↓	CLIP↑	PSNR↑	Capacity↑	PD @ 1e-10 PFA	−log10(PFA) @ PD=0.9
SD2	G-SSig	2.3	0.332	19.6	27.7 (+19.3)	0.99 (+0.5)	16.3 (+12.2)
SD2	G-VS	2.2	0.332	18.5	212.2 (+37.7)	1.0 (+0.0)	105.6 (+61.8)
Flux	G-VS	9.3	0.269	26.0	192.5 (+16.0)	1.0	72.8 (+24.3)
Sana	G-VS	4.1	0.346	23.5	207.5 (+28.8)	1.0	96.4 (+49.2)

FID/CLIP are almost on par with the no-watermark baseline; PSNR is lower (expected, as the watermark modifies semantics rather than adding weak signals).

Ablation/Comparison Table (SD2, Attack-by-attack against in-generation schemes)¶

Method	Identity	Contrast×2	JPEG Q50	Gauss Blur3	Rotation 90	Crop 50%
Gaussian Shading (Multi-bit, bits)	221	211	181	216	0	0
G-VS (Multi-bit, bits)	222	219	197	220	194	206
Tree-Rings (Zero-bit, −log10 PFA)	11.7	6.5	4.3	9.1	0.8	0.4
G-VS (Zero-bit, −log10 PFA)	154.6	130.6	89.2	150.3	100.7	101.9

Key Findings¶

Geometric attacks are the watershed: Gaussian Shading/Tree-Ring drop to nearly zero under rotation and cropping, whereas G-VS maintains high capacity/detectability by inheriting the VideoSeal decoder's exposure to these transformations during training.
Zero-bit detection: In the extremely low false alarm region of PFA=1e-10, G-SSig doubles the detectability of SSig; G-VS remains perfectly detectable in PFA ranges where SSig is no longer detectable.
Adversarial attacks: G-VS is consistently more robust than Tree-Ring against VAE purification, re-generation, and average attacks—because guided watermarking is content-dependent, resulting in low average residual magnitudes that are difficult to strip via mean attacks.

Highlights & Insights¶

"Decoder as Condition" perspective shift: Treating the watermark detector as a differentiable classifier within guided diffusion provides a concave and universal bridge, allowing the post-hoc and in-generation technical routes to interconnect at low cost for the first time.
Inheritable and enhanceable robustness: By incorporating any attack into the augmentation layer and aggregating with PCGrad, the watermark can be supplemented with robustness that the original decoder lacks, without any training.
Statistically controllable PFA: The direct mapping of cosine loss to p-values gives this solution a theoretical footing in large-scale detection scenarios where extremely low false alarm rates are mandatory—a weakness in seed-based schemes.
Complementary to VAE schemes: Since the watermark energy is distributed across the full spectrum, it can be used in conjunction with schemes like Stable Signature that modify the VAE.

Limitations & Future Work¶

Computational overhead: Even with \(T_w\) delayed activation and identity approximation, guidance still requires multiple backpropagations of decoder gradients during sampling, making it more expensive than pure sampling; hyperparameters \(\omega, \eta, \tau\) require grid searches.
Strength-quality trade-off sensitivity: Excessive guidance can produce artifacts (intensified hue/shape shifts), requiring calibration for each model.
Dependency on decoder quality: The upper bound of the method's robustness is determined by the chosen pre-trained decoder; attacks not seen by the decoder must still be explicitly included in the augmentation layer for coverage.
Flux was only tested at 256×256 due to computational constraints; scalability for higher resolutions and more diffusion backbones remains to be verified.

Stable Signature (Fernandez et al., 2023): In-generation via fine-tuning VAE. This work borrows the idea of "utilizing robust decoders + controlling PFA" but avoids the need for retraining VAEs and the high-frequency concentration flaw.
Tree-Ring / Gaussian Shading (Wen 2023 / Yang 2024): In-generation at the seed level. This work adopts the view that "GenAI watermarks should not be treated as weak signals and PSNR is inappropriate," while solving their issues with geometric robustness and lack of PFA guarantees.
Classifier Guided Diffusion (Dhariwal & Nichol, 2021) and Counterfactual Generation (Jeanneret et al., 2022): The direct technical origins of the method.
PCGrad (Yu et al., 2020): A key tool for resolving multi-augmentation gradient conflicts, inspiring the treatment of robustness as a multi-task optimization.

Rating¶

Novelty: ⭐⭐⭐⭐ — The perspective of using the gradient of an arbitrary post-hoc decoder as a diffusion guidance signal is novel and universal, embedding watermarks via guidance for the first time.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers three backbones (SD2/Flux/Sana), two types of decoders, zero/multi-bit, and various geometric and adversarial attacks, while correcting p-value calculations for Tree-Ring.
Writing Quality: ⭐⭐⭐⭐ — Motivations are developed step-by-step, the statistical link between loss and p-value is clear, and spectral difference plots strongly support the core arguments.
Value: ⭐⭐⭐⭐ — Addresses the practical need for AIGC provenance regulation; the "no-retraining + inheritable robustness + controllable false alarm" features are highly attractive for industrial deployment.