Anti-I2V: Safeguarding your photos from malicious image-to-video generation¶
Conference: CVPR 2026 arXiv: 2603.24570 Code: None Area: Video Generation Keywords: Adversarial Attack, Video Diffusion Models, Image Protection, Dual-Space Perturbation, Deep Feature Collapse
TL;DR¶
Anti-I2V proposes a defense method against malicious image-to-video generation. By optimizing perturbations in both L*a*b* color space and the frequency domain, and designing Internal Representation Collapse (IRC) and Anchor (IRA) losses to disrupt semantic feature propagation within the denoising network, the method achieves state-of-the-art protection across three architecturally distinct models: CogVideoX, DynamiCrafter, and Open-Sora.
Background & Motivation¶
Background: Video diffusion models (VDMs) are advancing rapidly. Models such as CogVideoX and Open-Sora can generate realistic videos from a single image and a text prompt, introducing severe risks of deepfake misuse.
Limitations of Prior Work: - Existing defenses primarily target text-to-image generation or specific architectures (e.g., SVD), with limited validation on large-scale DiT/MMDiT models; - RGB-space perturbations are easily eliminated during multi-step denoising, resulting in insufficient robustness; - Most methods only attack the final output (VAE encoding or the tail of the denoising network), neglecting intermediate feature propagation.
Key Challenge: VDMs possess larger capacity and stronger temporal modeling; conventional perturbation strategies struggle to effectively interfere with them—necessitating deeper disruption strategies.
Key Insight: A two-pronged approach—optimizing perturbations in a more robust non-RGB space, while identifying semantically rich layers within the network and selectively disrupting their feature propagation.
Core Idea: L*a*b* + frequency-domain dual-space perturbation + deep-to-shallow feature collapse + cross-layer semantic anchoring = effective attack against large-scale VDMs.
Method¶
Overall Architecture¶
Input image \(x\) → caption generation via LVLM → reference video → dual-space perturbation optimization (L*a*b* + DCT) → IRC and IRA losses + diffusion loss + auxiliary loss → output protected image \(x_\xi\), causing severe degradation of VDM-generated video quality.
Key Designs¶
-
Dual-Space Perturbation (DSP):
- Function: Optimizes adversarial noise in two non-RGB spaces: the L*a*b* color space and the DCT frequency domain.
- Mechanism:
- L*a*b* stage: Only the \(a^*\) and \(b^*\) channels (chrominance) are perturbed, leaving the \(L^*\) luminance channel intact, making perturbations less perceptible to human observers.
- DCT stage: Noise is injected into low-frequency DCT coefficients (which encode structural and textural information), disrupting deeper representations via frequency-domain perturbation.
- The two stages alternate updates, with the final perturbation projected within the \(\Delta_{RGB}\) constraint in RGB space.
- Design Motivation: Pixel-level RGB perturbations are prone to being "washed away" during multi-step denoising. L*a*b* is more perceptually uniform; low-frequency DCT coefficients correspond to core image structure, yielding more persistent perturbation effects.
-
Internal Representation Collapse (IRC):
- Function: Forces deep (semantically rich) layer features to degenerate toward shallow (low-semantic) layer features.
- Mechanism:
- PCA visualization reveals that high-level semantic features emerge after layer 19 in Open-Sora and after layer 27 in CogVideoX, while layer 3 exhibits almost no semantics.
- Loss: \(\mathcal{L}_{IRC}^{i,j} = \mathbb{E}\|\epsilon_\theta^j(z_t, z_\xi, t, y) - \epsilon_\theta^i(z_t, z_\xi, t, y)\|_2^2\)
- Features from the last 3 layers are aligned to those of layer 3.
- Design Motivation: By collapsing deep semantic features, the denoising process loses the ability to reconstruct meaningful structure; the effect cascades to all frames via the attention mechanism.
-
Internal Representation Anchor (IRA):
- Function: At each layer of both the denoising module and the VAE, anchors the protected image's features to those of an unrelated target image.
- Mechanism:
\(\mathcal{L}_{IRA} = \mathcal{L}_{IRA,\epsilon_\theta} + \mathcal{L}_{IRA,E}\)
- Denoising module level: \(\|\epsilon_\theta^m(z_t, z_\xi, t, y) - \epsilon_\theta^m(z_t, z_\psi, t, y)\|_2^2\)
- VAE level: \(\|E^n(z_\xi) - E^n(z_\psi)\|_2^2\)
- Design Motivation: Rather than merely collapsing semantics (IRC), IRA actively steers features toward an incorrect direction, providing a complementary and more effective dual disruption.
Final Objective¶
$\(\mathcal{L}_{Anti-I2V} = \mathcal{L}_{IRC} + \mathcal{L}_{IRA} + \mathcal{L}_{auxiliary} - \mathcal{L}_{DM}\)$ - Auxiliary loss: CLIP feature distance maximization + LPIPS perceptual distance maximization
Key Experimental Results¶
Main Results (CelebV-Text Dataset)¶
| Model | Method | ISM↓ | C-FIQA↓ | Q-A(F)↓ | Q-A(V)↓ | DINO↓ |
|---|---|---|---|---|---|---|
| CogVideoX | Clean | 0.721 | 0.522 | 0.746 | 0.802 | 0.828 |
| CogVideoX | MIST | 0.561 | 0.463 | 0.476 | 0.577 | 0.750 |
| CogVideoX | Anti-I2V | 0.448 | 0.433 | 0.447 | 0.532 | 0.722 |
| DynamiCrafter | Clean | 0.528 | 0.467 | 0.724 | 0.794 | 0.622 |
| DynamiCrafter | AdvDM | 0.269 | 0.370 | 0.167 | 0.207 | 0.397 |
| DynamiCrafter | Anti-I2V | 0.151 | 0.303 | 0.032 | 0.047 | 0.167 |
Ablation Study¶
| Configuration | ISM↓ | Q-A(V)↓ | Note |
|---|---|---|---|
| RGB perturbation only | 0.583 | 0.543 | Baseline (similar to AdvDM) |
| + L*a*b* | 0.521 | 0.511 | Color-space perturbation more effective |
| + DCT | 0.498 | 0.496 | Frequency domain further improves results |
| + IRC | 0.472 | 0.558 | Semantic collapse is effective |
| + IRA | 0.460 | 0.540 | Anchor loss provides complementary gains |
| Full Anti-I2V | 0.448 | 0.532 | All components synergize optimally |
Key Findings¶
- The method achieves the most significant results on DynamiCrafter (UNet architecture), reducing Q-A(V) from 0.794 to 0.047.
- It also proves effective on CogVideoX (DiT architecture), validating cross-architecture generalization.
- A simple layer selection strategy (last 3 layers → layer 3) generalizes across different architectures.
Highlights & Insights¶
- First systematic study of adversarial perturbation optimization in non-RGB spaces; the L*a*b* + frequency-domain combination represents a promising new direction.
- The IRC loss is theoretically grounded in PCA analysis of layer-wise features in the denoising network.
- Applicable to three mainstream architectures (UNet, DiT, MMDiT), demonstrating strong practical utility.
Limitations & Future Work¶
- Perturbation optimization still requires white-box access to the target model; black-box transferability has not been sufficiently validated.
- Robustness against image preprocessing (JPEG compression, blurring) warrants further analysis.
- Computational efficiency: PGD-based iterative perturbation optimization incurs significant computational overhead.
Related Work & Insights¶
- The text-level loss shares conceptual similarities with MIST but extends the disruption to the layer level.
- The DSP approach can be generalized to other adversarial attack and defense scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of dual-space perturbation and layer-wise feature collapse is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three VDM architectures × two datasets with comprehensive ablation.
- Writing Quality: ⭐⭐⭐⭐ Technical details are thorough; PCA analysis is intuitive.
- Value: ⭐⭐⭐⭐ Significant practical implications for AI safety and privacy protection.