CVPR2026 AI Safety Zero-watermarking cross-modal alignment learnable data augmentation camera robustness CLIP adversarial training invariant feature learning

TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking¶

Conference: CVPR2026 arXiv: 2602.18863 Code: To be confirmed Area: Object Detection (actually Multimedia Security / Watermarking) Keywords: Zero-watermarking, cross-modal alignment, learnable data augmentation, camera robustness, CLIP, adversarial training, invariant feature learning

TL;DR¶

This paper proposes TIACam, a framework that simulates camera distortions via a learnable auto-augmentor, learns invariant features through text-anchored cross-modal adversarial training, and binds binary messages to features via a zero-watermarking head—achieving camera-robust zero-watermarking without modifying any image pixels. TIACam attains state-of-the-art bit accuracy across three real-world scenarios: screen recapture, print-and-scan, and screenshot.

Background & Motivation¶

Zero-watermarking paradigm: Conventional watermarking modifies images in the spatial or transform domain; zero-watermarking instead associates the watermark with intrinsic image features without altering pixels, balancing invisibility with verification reliability.
Camera recapture challenge: Camera recapture introduces compound, spatially coupled degradations—perspective distortion, illumination variation, sensor noise, and Moiré patterns—making it one of the most challenging scenarios for watermark extraction.
Limitations of handcrafted noise layers: Methods such as StegaStamp and PIMoG manually design camera noise layers, but real optical distortions vary with environment and are nonlinearly coupled, making fixed augmentations insufficient.
Suboptimal pretrained features: The robustness of features from self-supervised models such as DINO is a byproduct of pretraining, not explicitly optimized for watermarking tasks.
Insufficient single-source invariance: Invariance learning based solely on text guidance or solely on distortion adversarial training cannot simultaneously guarantee semantic consistency and distortion robustness.
Lack of a unified framework: Existing methods treat augmentation, feature learning, and watermark binding as separate stages, lacking an end-to-end joint optimization mechanism.

Method¶

Overall Architecture¶

TIACam consists of three modules trained jointly in a tripartite adversarial loop:

Auto-Augmentor: A differentiable camera distortion simulation pipeline.
Text-Anchored Invariant Feature Learner: Cross-modal adversarial alignment based on CLIP.
Zero-Watermarking Head: Binding binary messages in the invariant feature space.

Learnable Auto-Augmentor¶

Six differentiable modules are cascaded:

Module	Function	Key Parameters
Geometric	Perspective/rotation/scaling transforms	Learnable 3×3 perspective matrix $A$
Photometric	Brightness/contrast/gamma	Learnable $\alpha, \gamma, \beta$
Additive Noise	Sensor noise	Reparameterized $\sigma \cdot z,\ z \sim \mathcal{N}(0,1)$
Filtering	Optical blur/lens smear	Learnable convolution kernel $K$
Compression	JPEG quantization & frequency masking	Smooth quantization + trainable mask $M$
Moiré	Sensor–display interference fringes	Learnable frequency $(f_x, f_y)$ and amplitude $\alpha$

The composition formula is: $$\hat{x} = \mathcal{T}_{\text{aug}}(x;\Theta) = \mathcal{T}_{\text{comp}} \circ \mathcal{T}_{\text{filter}} \circ \mathcal{T}_{\text{add}} \circ \mathcal{T}_{\text{photo}} \circ \mathcal{T}_{\text{geo}} \circ \mathcal{T}_{\text{moire}}(x)$$

Each module is pretrained on 10k samples of its corresponding distortion type using MSE+SSIM loss, then fine-tuned during overall adversarial training.

Text-Anchored Invariant Feature Learner¶

Feature extractor $f_\theta$: Frozen CLIP image encoder + trainable invariant feature extractor (3 residual blocks + projection head → 1024-dim).
Discriminator $D_\psi$: 4-layer Transformer (8-head attention, hidden dim 512), receiving image–text feature pairs and judging semantic match.
Training objective: Image $x$ and its augmented version $\hat{x}$ are paired with positive/negative text anchors $T^+/T^-$ to form real/fake pairs; the discriminator loss is $\mathcal{L}_{\text{disc}}$ and the generator loss is $\mathcal{L}_{\text{adv}}$.
Augmentor–extractor adversarial dynamic: The augmentor maximizes $\mathcal{L}_{\text{inv}} - \lambda_{\text{sem}}\mathcal{L}_{\text{sem}}$; the extractor minimizes $\mathcal{L}_{\text{inv}}$, where semantic fidelity is measured by cosine similarity from a frozen ViT.

Three-way alternating updates: ① Update $D_\psi$ to improve paired discrimination → ② Update $\Theta$ to generate stronger distortions → ③ Update $f_\theta$ to align with positive text anchors while resisting distortions.

Zero-Watermarking Head¶

Extract invariant features $\tilde{F} = \Psi(f_\theta(x))$, where $\Psi$ denotes global average pooling followed by linear projection.
Maintain a learnable reference matrix $C \in \mathbb{R}^{k \times d}$, where the $i$-th row is the direction code for bit $i$.
Prediction: $\hat{W}_i = \sigma(\tilde{F} \cdot C_i)$
Registration phase: For each image–message pair, optimize $C$ and $\Psi$ (BCE + L2 regularization) with $f_\theta$ frozen.
Extraction phase: For a distorted image $x'$, compute $\tilde{F}' = \Psi(f_\theta(x'))$ and recover the binary message by thresholding at 0.5.

Key Experimental Results¶

Feature Invariance (Cosine Similarity, Original vs. Distorted Images)¶

Distortion Type	SimCLR	BYOL	Barlow	VICReg	VIbCReg	TIACam
Additive Noise	0.82	0.88	0.79	0.83	0.89	0.97
Photometric Change	0.84	0.84	0.81	0.76	0.88	0.93
Perspective Transform	0.87	0.85	0.87	0.83	0.88	0.95
JPEG Compression	0.79	0.80	0.87	0.81	0.73	0.98
Moiré Pattern	0.85	0.83	0.84	0.89	0.87	0.97
Filtering Blur	0.88	0.88	0.89	0.87	0.88	0.98
All Combined	0.74	0.71	0.74	0.77	0.77	0.94

Watermark Extraction Accuracy in Real-World Scenarios (Bit Accuracy %)¶

Method	Screen 30b	Screen 100b	Print 30b	Print 100b	Screenshot 30b	Screenshot 100b
HiDDeN	70.6	68.8	67.1	65.7	74.5	70.6
PIMoG	82.3	80.1	75.7	72.3	79.7	78.6
StegaStamp	93.8	91.2	92.2	91.3	93.7	93.9
TIACam	99.1	98.2	96.6	95.1	97.4	95.2

Ablation Study: Contribution of the Invariant Feature Extractor¶

Dataset	CLIP Only	CLIP + TIACam
Visual Genome	0.78	0.92
Flickr	0.84	0.93
MSCOCO	0.76	0.89
ImageNet	0.82	0.93

The feature extractor improves cosine similarity by approximately 13–15%, demonstrating that the robustness gains stem from the proposed framework rather than CLIP pretraining alone.

Feature Discriminability Test¶

Across 200 image pairs generated from identical captions: only the registered image achieves 100% watermark recovery; the cosine similarity of another image's features averages 0.73, with extraction accuracy dropping to ~84%, indicating that the framework preserves inter-instance visual discriminability alongside invariance.

Highlights & Insights¶

Tripartite adversarial unified framework: The augmentor, feature extractor, and discriminator are jointly optimized, representing the first unification of distortion simulation and cross-modal alignment into a single training loop.
Fully differentiable augmentation pipeline: Six differentiable modules cover geometric, photometric, noise, filtering, compression, and Moiré distortions, enabling gradient backpropagation to optimize augmentation strategies.
No pixel modification required: The zero-watermarking paradigm leaves images entirely unaltered; message extraction relies solely on dot products and thresholding in the feature space.
Thorough real-world validation: TIACam substantially outperforms prior SOTA under three physically realistic degradation scenarios: screen recapture, print-and-scan, and screenshot.
No localization step required: The strong robustness of the invariant feature space enables direct watermark extraction from the full image without prior detection of watermark regions.

Limitations & Future Work¶

The area label is classified as object detection but the actual domain is multimedia security/watermarking; the classification requires correction.
Images are uniformly resized to 128×128; the framework's ability to preserve local features in high-resolution images is not thoroughly discussed.
Zero-watermark registration requires individually optimizing $C$ and $\Psi$ for each image–message pair; batch registration efficiency may be a deployment bottleneck.
Experiments are conducted solely on an RTX 4090; inference latency and feasibility of deployment on mobile or embedded devices are not discussed.
Semantically similar but visually distinct images still achieve ~84% accuracy (ideally this should be lower), raising concerns about cross-instance feature leakage in the feature space.
Acquiring text anchors (captions) in practice requires additional modules or manual provision.

Method	Type	Augmentation Strategy	Feature Source	Camera Robustness
HiDDeN	Embedding	Fixed noise layer	Self-trained CNN	Low
StegaStamp	Embedding	Handcrafted camera noise layer	Self-trained CNN	Medium-High
PIMoG	Embedding	Handcrafted projection noise	Self-trained CNN	Medium
InvZW	Zero-watermark	Distortion adversarial	Adversarial training	Medium
DINO-based	Zero-watermark	None	Pretrained SSL	Medium
TIACam	Zero-watermark	Learnable auto-augmentation	CLIP + adversarial training	High

Core distinction: TIACam is the first method to unify a learnable augmentor, cross-modal text anchoring, and zero-watermarking within a single adversarial training framework.

Rating¶

Novelty: ⭐⭐⭐⭐ — The tripartite adversarial training framework and differentiable augmentation pipeline design are novel.
Experimental Thoroughness: ⭐⭐⭐⭐ — Synthetic and real-world scenarios, ablation studies, and discriminability tests are relatively comprehensive, though runtime efficiency analysis is absent.
Writing Quality: ⭐⭐⭐⭐ — Structure is clear, mathematical derivations are complete, and illustrations are intuitive.
Value: ⭐⭐⭐⭐ — Represents significant progress in camera-robust zero-watermarking, though practical deployment feasibility warrants further validation.

Module	Function	Key Parameters
Geometric	Perspective/rotation/scaling transforms	Learnable 3×3 perspective matrix \(A\)
Photometric	Brightness/contrast/gamma	Learnable \(\alpha, \gamma, \beta\)
Additive Noise	Sensor noise	Reparameterized \(\sigma \cdot z,\ z \sim \mathcal{N}(0,1)\)
Filtering	Optical blur/lens smear	Learnable convolution kernel \(K\)
Compression	JPEG quantization & frequency masking	Smooth quantization + trainable mask \(M\)
Moiré	Sensor–display interference fringes	Learnable frequency \((f_x, f_y)\) and amplitude \(\alpha\)