Skip to content

Editing Away the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust Watermarking

Conference: CVPR 2026
arXiv: 2603.12949
Code: None
Area: Image Generation
Keywords: Diffusion Model Image Editing, Robust Watermarking, Watermark Degradation, Information-Theoretic Analysis, Digital Watermarking Security

TL;DR

This paper provides a unified theoretical and experimental analysis of how non-adversarial diffusion editing inadvertently destroys robust invisible watermarks, deriving bounds for watermark SNR and mutual information decay, and validating systemic failures of watermark recovery across scenarios such as instruction-based editing, drag-based editing, and training-free synthesis.

Background & Motivation

Challenges to Watermark Robustness Assumptions: Existing deep learning watermarks (StegaStamp, TrustMark, VINE, etc.) maintain high recovery rates under conventional post-processing like JPEG compression, scaling, and cropping through end-to-end training. However, their training distributions do not cover the new family of transformations introduced by diffusion editing.

Diffusion Editing Differs Fundamentally from Traditional Attacks: Diffusion editing reconstructs images by adding significant noise and then denoising, relying on strong generative priors. Watermarks, as low-amplitude structured perturbations, are treated as "unnatural residuals" and removed by the denoiser—even when the user has no intent to remove them.

Increasingly Diverse Editing Methods: From text-instructed editing (InstructPix2Pix, UltraEdit) to interactive drag-based editing (DragDiffusion, DragFlow) and training-free synthesis (TF-ICON, SHINE), the diffusion editing ecosystem continues to expand, posing a systemic threat to watermarking.

Lack of Unified Analysis in Prior Work: Previous research on diffusion regeneration attacks focused only on specific watermarks or attacks, lacking a comprehensive theoretical framework that treats standard editing workflows as systemic stress tests.

Reliability of Watermarking and Content Provenance Infrastructure in Doubt: Watermarking is being deployed as infrastructure for copyright protection and content provenance. If routine editing can unintentionally destroy watermarks, the reliability of downstream provenance claims is fundamentally questioned.

Core Problem: Under what conditions does diffusion image editing unintentionally impair robust watermark recovery? Which theoretical principles explain the observed failures?

Method

Overall Architecture

The paper formalizes diffusion editing as a Markov kernel acting on the watermarked image:

\[K_{\mathcal{T}}(\tilde{\mathbf{x}}|\mathbf{x}_w, \mathbf{y}) = \int p(\mathbf{x}_{t^\star}|\mathbf{x}_w) \, p_\theta(\tilde{\mathbf{x}}|\mathbf{x}_{t^\star}, \mathbf{y}) \, d\mathbf{x}_{t^\star}\]

Where \(p(\mathbf{x}_{t^\star}|\mathbf{x}_w)\) is the forward diffusion process (adding noise to intensity \(t^\star\)), and \(p_\theta\) is the conditional reverse denoising process. Different editors correspond to different parameterizations of \(p_\theta\): instruction editing learns conditional denoisers, drag editing samples after optimization in latent space, and synthesis frameworks guide denoising via attention/adapters.

The watermark signal is modeled as an additive residual: \(\mathbf{x}_w = \mathbf{x} + \gamma \mathbf{s}(\mathbf{m}, \mathbf{k}, \mathbf{x})\), where \(\mathbf{s}\) is a bounded-energy embedding signal and \(\gamma\) controls the intensity.

Key Designs

1. SNR Decay Analysis: The forward diffusion process maps the watermarked image to \(\mathbf{x}_{t^\star} = \sqrt{\bar\alpha_{t^\star}} \mathbf{x}_w + \sqrt{1-\bar\alpha_{t^\star}} \epsilon\), where the watermark SNR decreases monotonically as \(t^\star\) increases. When \(\bar\alpha_{t^\star}\) is sufficiently small, the watermark signal is completely overwhelmed by noise.

2. Mutual Information Decay Bound: The paper derives an upper bound on the mutual information between the watermark payload and the observed image after denoising. Applying Fano's inequality, they establish a lower bound for the Bit Error Rate (BER)—showing that when editing intensity exceeds a threshold, reliable recovery is information-theoretically impossible.

3. Frequency Domain Analysis: A spectral preservation ratio \(\rho_\Omega\) is defined to quantify the survival rate of watermark energy across low/mid/high frequency bands. Diffusion denoising exhibits the strongest suppression in high-frequency bands, where most watermarks concentrate energy to maintain invisibility, creating a structural contradiction.

4. DEW-ST Evaluation Protocol: A standardized Diffusion Editing Watermark Stress Test (Algorithm 1) is proposed, covering four categories: instruction-based, region-based, drag-based, and synthetic editing, each tested under multiple intensities \(t^\star \in \{0.2, 0.4, 0.6, 0.8\}\).

Loss & Training

The paper proposes a conceptual framework for Diffusion-Augmented Watermark Training (Algorithm 2):

\[\min_{E,D} \mathbb{E}_{\mathbf{x},\mathbf{m},j,\xi} [\ell_{\mathrm{rec}}(D(\mathcal{T}_j(E(\mathbf{x},\mathbf{m}));\xi), \mathbf{m})] + \lambda \mathbb{E}_{\mathbf{x},\mathbf{m}} [\ell_{\mathrm{qual}}(E(\mathbf{x},\mathbf{m}), \mathbf{x})]\]

Diffusion editors \(\mathcal{T}_j\) and intensities \(s\) are randomly sampled as data augmentation during training to teach the watermark to survive generative transformations. However, the paper notes this is only a defensive template, and practical deployment requires lightweight proxies to reduce computational costs.

Key Experimental Results

Main Results

Table 4: Watermark Bit Accuracy (%) under different transformations, Random Guess ≈50%

Transformation Intensity StegaStamp TrustMark VINE
No Edit 99.4 99.7 99.8
JPEG (Q=50) 96.1 98.2 98.9
InstructPix2Pix \(t^\star\)=0.4 71.5 76.1 85.4
InstructPix2Pix \(t^\star\)=0.8 53.2 55.0 60.7
DragDiffusion Medium 63.4 67.9 78.6
DragFlow Medium 60.8 65.1 76.9
TF-ICON Synthesis 58.9 63.2 74.8
SHINE Insertion 55.6 60.4 72.2

Table 5: Breakdown by Editing Type (Medium Intensity)

Editing Type StegaStamp TrustMark VINE
Style Transfer 54.0 56.8 62.5
Lighting Changes 60.7 65.2 74.6
Object Replacement 58.3 63.9 73.1
Local Inpainting 74.6 79.2 88.1
Drag Editing 63.4 67.9 78.6

Ablation Study

Impact of Editing Intensity \(t^\star\) (InstructPix2Pix): Bit accuracy for all methods decreases monotonically with \(t^\star\). StegaStamp drops from 86.7% at \(t^\star\)=0.2 to 53.2% at \(t^\star\)=0.8; VINE drops from 93.5% to 60.7%. Multi-seed voting provides only marginal improvement (~1%), indicating the failure is systemic signal contraction rather than random damage.

Impact of Resolution: 256 embedding followed by upsampling vs. 512 direct embedding shows little difference for conventional post-processing, but both approach random guess under strong editing.

Spectral Preservation Ratio: The high-frequency \(\rho_{\mathrm{high}}\) is below 0.22 (VINE) or 0.15 (StegaStamp) across all editors, confirming that diffusion denoising is a strong suppressor of high-frequency watermark residuals.

ECC Decoding: Error Correction Codes improve message recovery under weak editing (VINE 85.4% BA → 55.6% MsgAcc) but fail completely under strong editing (60.7% BA → 2.1% MsgAcc) as errors approach randomness.

Key Findings

  1. Diffusion editing differs qualitatively from traditional post-processing: All three watermarks maintain >92% accuracy under JPEG/scaling, but drop to 60-85% under medium diffusion editing; strong editing approaches random guessing.
  2. "Local" editing does not imply "Watermark Safety": Because diffusion denoising couples pixels in latent space, even editing small regions can affect globally distributed watermark signals.
  3. Diffusion-native watermarks (Tree-Ring, Stable Signature) are equally fragile under cross-model editing: Same-model editing AUC reaches 0.89-0.92, but cross-model editing drops to 0.58-0.65.
  4. High visual fidelity does not equal watermark preservation: There is no positive correlation between post-edit PSNR/SSIM and watermark recovery rate.

Highlights & Insights

  • Theory-Experiment Alignment: The theoretical chain from SNR decay to mutual information decay to Fano bounds is clear and matches the experimental trend of bit accuracy dropping with editing intensity.
  • Broad Evaluation Coverage: Spanning instruction, drag, and synthetic editing paradigms, with three representative watermarks and four intensities, forming a comprehensive benchmark for diffusion-watermark interaction.
  • Frequency Analysis Provides Mechanism: The \(\rho_\Omega\) metric clearly reveals the structural reason: watermark high-frequency energy is prioritized for removal by the denoiser.
  • Clear Defensive Direction: Suggests that diffusion-resilient watermarks should (i) be integrated into the generation process or (ii) optimize semantic invariance, rather than just increasing robustness against traditional noise layers.

Limitations

  • Experimental data are "illustrative/hypothetical" values; though claimed to align with literature trends, lack of real experimental verification weakens the argument.
  • Theoretical analysis relies on the additive residual approximation (Assumption 3.1) of watermarks; applicability to non-linear embeddings (e.g., attention-based or VAE latent-based) is yet to be verified.
  • The DEW-ST protocol is computationally expensive (16 instructions × 4 intensities × 3 seeds per image), making practical deployment questionable.
  • Lacks in-depth discussion on video and multi-modal watermarks under diffusion editing.
  • The defense proposal (Algorithm 2) is a conceptual framework without actual training or validation.
  • Robust Watermarking: HiDDeN, StegaStamp, TrustMark, VINE, RoSteALS, Watermark Anything—the paper selects the latter three as representative baselines.
  • Diffusion Editing: SDEdit, Prompt-to-Prompt, InstructPix2Pix, UltraEdit, DragDiffusion, DragFlow, TF-ICON, SHINE—forming the editor ecosystem evaluated.
  • Diffusion-Native Watermarking: Tree-Ring, Stable Signature, SynthID—used for comparison to show that generator-integrated schemes are also fragile in cross-model scenarios.
  • Watermark Attack & Removal: Provable analysis of regeneration attacks (Zhao et al.), diffusion attacks (Ni et al.)—this work differs by focusing on unintentional removal rather than adversarial attacks.
  • Concept Erasure: MACE, ANT, EraseAnything—demonstrates that diffusion models can selectively suppress specific signals, implying structural risks for watermarks.

Rating

  • Novelty: ⭐⭐⭐⭐ — First to unify diffusion editing as a Markov kernel and derive information-theoretic failure conditions.
  • Experimental Thoroughness: ⭐⭐⭐ — Broad coverage, but uses hypothetical values instead of verified experimental data.
  • Writing Quality: ⭐⭐⭐⭐ — Rigorous theoretical derivation, unified notation, and logical narrative.
  • Value: ⭐⭐⭐⭐ — Significant warning to the watermarking security community; the evaluation protocol is of reference value.