SPDMark: Selective Parameter Displacement for Robust Video Watermarking¶

Conference: CVPR 2026 arXiv: 2512.12090 Code: Available (mentioned in the paper) Area: Diffusion Models / Video Watermarking Keywords: Video watermarking, parameter displacement, LoRA, diffusion models, robustness

TL;DR¶

SPDMark proposes a video diffusion model watermarking framework based on Selective Parameter Displacement (SPD). By learning a low-rank basis shift dictionary in the decoder and selecting combinations according to the watermark key, it achieves per-frame watermark embedding with imperceptibility, high robustness, and low computational overhead, while supporting temporal tampering detection and localization.

Background & Motivation¶

Background: The emergence of high-quality video generation models (e.g., Sora, SVD) has made the provenance of AI-generated video an increasingly pressing problem. Both the EU AI Act and U.S. AI executive orders recommend watermarking AI-generated content. Video watermarking must simultaneously satisfy imperceptibility, robustness, and computational efficiency.
Limitations of Prior Work: (a) Post-processing methods (e.g., VideoSeal) introduce latency and cannot leverage generative priors; (b) noise-space methods (e.g., VideoShield) decode via DDIM inversion, incurring high computational cost and susceptibility to perturbations; (c) model fine-tuning methods (e.g., LVMark) uniformly modulate all layers, limiting per-frame control, while VidSig embeds only a single fixed signature and cannot detect temporal tampering. All three categories exhibit trade-offs among imperceptibility, robustness, and efficiency.
Key Challenge: How can one achieve efficient multi-key per-frame watermark embedding with frame-level temporal tampering detection, without sacrificing video quality?
Goal: Design an in-generation video watermarking scheme that supports arbitrary keys, per-frame watermarking, and temporal tampering detection with negligible computational overhead.
Key Insight: Rather than perturbing pixels or noise, the method learns a dictionary of low-rank basis shifts and selectively displaces the generative model's parameters according to the watermark key to embed the watermark.
Core Idea: Learn a fixed LoRA basis shift dictionary; the watermark key for each frame determines which basis shift is selected per layer, thereby embedding per-frame watermarks in the decoder parameter space—without inference overhead or per-key retraining.

Method¶

Overall Architecture¶

The SPDMark pipeline proceeds as follows: (1) Given a video-level key \(K_{base}\), a unique watermark message \(\kappa_t\) is generated for each frame via a cryptographic hash function; (2) each \(\kappa_t\) is mapped to a binary mask \(\mathbf{b}(\kappa_t)\) that selects one LoRA basis shift per decoder layer; (3) a watermarked video \(\tilde{\mathbf{x}}\) is generated using the displaced decoder; (4) after per-frame watermark extraction, maximum bipartite matching and hypothesis testing are applied to verify watermark validity and localize temporal tampering.

Key Designs¶

Selective Parameter Displacement Framework:
- Function: Encodes the watermark key as a parameter displacement in the generative model.
- Mechanism: The model parameters are partitioned into an unmodified set \(\Phi_U\) and a modifiable set \(\Phi_M\) (decoder only). \(\Phi_M\) spans \(L\) layers, each with \(P\) basis shifts \(\zeta_{\ell,p}\); the displacement is \(\Delta\phi_\ell = \sum_{p=1}^P b_{\ell,p} \zeta_{\ell,p}\). The key-to-mask mapping splits an \(M = L\log_2 P\)-bit key into \(L\) chunks, where the decimal value of each chunk selects the basis shift for that layer. In practice, exactly one basis shift is selected per layer: \(\Delta\Phi_M(\kappa) = [\zeta_{1,i_1+1}, \ldots, \zeta_{L,i_L+1}]^T\).
- Design Motivation: The full parameter displacement space is too large to be learnable directly. Decomposing it into a layer-wise basis shift selection problem drastically reduces the search space. A fixed dictionary supports arbitrary keys without per-key retraining.
LoRA-Based Parameter-Efficient Implementation:
- Function: Implements basis shifts in a parameter-efficient manner.
- Mechanism: Each basis shift is factorized as \(\zeta_{\ell,p} = A_{\ell,p} B_{\ell,p}\), where \(A \in \mathbb{R}^{d \times r}\), \(B \in \mathbb{R}^{r \times d}\), and \(r \ll d\) (with \(r=32\) in the paper). The displaced layer output is \(\mathbf{h}_\ell = \mathcal{F}_{\phi_\ell}(\mathbf{h}_{\ell-1}) + \alpha \mathcal{F}_{\Delta\phi_\ell}(\mathbf{h}_{\ell-1})\). The method is applied to \(L=14\) spatial ResNet blocks in the decoder, each with \(P=4\) LoRA modules, yielding \(\log_2 4 = 2\) bits per layer and a per-frame payload of 28 bits.
- Design Motivation: Learning full-rank shift parameters directly is prohibitively expensive. Low-rank LoRA decomposition preserves expressiveness while drastically reducing parameter count, making the scheme deployable on large-scale models.
Per-Frame Watermarking and Temporal Tampering Detection:
- Function: Embeds a unique per-frame watermark message and supports frame-level tampering localization.
- Mechanism: A frame-level message is generated as \(\kappa_t = \text{Trunc}_M(\mathcal{H}(K_{base}, t))\) using HMAC-SHA256. During extraction, a ResNet-50 extracts 28-dimensional logits per frame. For verification, a bipartite graph is constructed from reference messages \(\mathbf{K}\) and extracted messages \(\hat{\mathbf{K}}\), with edge weights defined by Hamming similarity \(\bar{S}_{m,n} = 1 - \psi(\kappa_m, \hat{\kappa}_n)/M\). The Hungarian algorithm finds the maximum-weight matching, followed by binomial hypothesis testing (frame-level threshold \(\tau_f\) and video-level threshold \(\tau_v\)) to determine watermark validity. Unmatched frames are identified as tampered.
- Design Motivation: Per-frame unique messages enable frame-level deletion, swapping, and insertion to be detected through matching failures—a capability unavailable to prior methods that embed only a single fixed signature.

Loss & Training¶

The total loss is \(\min_{\zeta,\eta} \mathcal{L}_{imp}(\mathbf{x}, \tilde{\mathbf{x}}) + \mathcal{L}_{rec}(\mathcal{V}_\eta(\tilde{\mathbf{x}}), \kappa)\). Message recovery uses BCEWithLogitsLoss; the imperceptibility loss is \(\mathcal{L}_{imp} = \lambda_{ps} \mathbb{E}_t[\text{LPIPS}(x_t, \tilde{x}_t)] + \lambda_{tc} \mathbb{E}_t[\|\delta y_t - \delta \tilde{y}_t\|_1]\), where LPIPS ensures perceptual similarity and the temporal consistency term (L1 on luminance differences) suppresses flickering. Training is conducted on 10,000 videos from OpenVid-1M, optimizing over expectations of \(\kappa, \mathbf{c}, \mathbf{z}\). The extractor is a ResNet-50 (ImageNet pretrained); at inference, batch normalization over all frames of the test video is applied to stabilize predictions.

Key Experimental Results¶

Main Results (Video Quality + Watermark Detection)¶

SVD-XT Model:

Method	Payload	Bit Acc↑	SC↑	BC↑	MS↑	IQ↑
VideoShield	512	0.979	0.954	0.954	0.956	0.695
VideoSeal	256	0.999	0.955	0.950	0.961	0.682
VidSig	48	0.958	0.951	0.953	0.956	0.693
SPDMark	28×25	0.995	0.966	0.958	0.975	0.690

Robustness Evaluation (SVD-XT Average Bit Acc)¶

Method	Photometric Attacks	Temporal Attacks	Post-processing	Average
VideoShield	~0.82	~0.94	~0.83	0.833
VideoSeal	~0.94	~1.00	~0.82	0.912
VidSig	~0.66	~0.96	~0.53	0.685
SPDMark	~0.94	~0.99	~0.89	0.935

Ablation Study¶

Configuration	Key Metric	Notes
Full SPDMark	Avg Bit Acc 0.935	Full model
SPDMark on ModelScope	High Avg Bit Acc	Generalizes across architectures (UNet→DiT)
Temporal tampering localization	High Precision/Recall/F1	Detects frame deletion/insertion/swapping

Key Findings¶

SPDMark consistently outperforms all baselines on video quality metrics (SC/BC/MS), indicating that parameter displacement minimally affects visual quality.
SPDMark achieves an average Bit Acc of 0.935 on robustness benchmarks, surpassing VideoSeal (0.912) and VideoShield (0.833).
Under screen recording attacks, SPDMark achieves 0.837, far exceeding VideoSeal's 0.598, demonstrating that in-generation watermarks are more robust than post-processing approaches.
Under the Crop&Drop compound attack, SPDMark (0.856) significantly outperforms competing methods (0.458–0.513).
Per-frame watermarking enables detection and localization of temporal tampering (frame deletion, swapping, and insertion).

Highlights & Insights¶

Watermarking in parameter space is an elegant paradigm shift: Rather than operating in pixel or noise space, embedding watermarks directly in the model parameter space inherits the model's generative quality by design, with negligible overhead.
LoRA basis shift dictionary supports unlimited keys: Once the dictionary is trained, any new key requires only a different combination of basis shifts—no retraining needed. This is far more efficient than per-key fine-tuning.
Cryptographic hash for per-frame messages + Hungarian matching for verification: Combining cryptographic tools with graph matching algorithms elegantly solves the temporal tampering detection problem. This framework is generalizable to other scenarios requiring sequence integrity verification.

Limitations & Future Work¶

Each frame carries only 28 bits of payload; capacity is limited (14 layers × 2 bits/layer). Increasing bit depth requires more LoRA bases or additional layers.
Watermarking is applied only to the decoder; if an adversary replaces the decoder, the watermark is invalidated (though this is unlikely in API-controlled deployment scenarios).
The ResNet-50 extractor is relatively lightweight and may lack robustness under extreme attacks (e.g., high-ratio H.265 compression).
Training requires paired watermarked/non-watermarked videos, incurring non-trivial data costs.

vs. VideoShield: Operates in noise space with DDIM inversion, incurring high computational cost; Bit Acc drops to only 0.521 under crop attacks. SPDMark avoids inversion entirely.
vs. VideoSeal: A post-processing method that degrades severely under screen recording (0.598). SPDMark leverages generative priors for greater robustness.
vs. VidSig: Freezes PAS layers with temporal alignment but embeds only a fixed signature, precluding temporal tampering detection. SPDMark's per-frame mechanism is significantly more flexible.
vs. AQuaLoRA: An image-level LoRA watermarking method; SPDMark extends the paradigm to video and incorporates temporal consistency and tampering detection.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The parameter displacement framework and LoRA basis shift dictionary design are highly original; the temporal tampering detection mechanism is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers two generative architectures and multiple attack types; ablation experiments could be more extensive.
Writing Quality: ⭐⭐⭐⭐ Formal derivations are clear, though the dense notation requires careful reading.
Value: ⭐⭐⭐⭐⭐ A highly practical video watermarking scheme directly deployable in video generation API services.