AAAI 2026 Self-Supervised Learning Adversarial Attack Video Foundation Models Transfer Attack Contrastive Learning Temporal Consistency Multimodal Large Language Models

From Pretrain to Pain: Adversarial Vulnerability of Video Foundation Models without Finetuning¶

Conference: AAAI 2026 arXiv: 2511.07049 Authors: Hui Lu, Yi Yu, Song Xia, Yiming Yang, Deepu Rajan, Boon Poh Ng, Alex Kot, Xudong Jiang (NTU, Singapore) Code: aloe101/TVA Area: Self-Supervised Learning Keywords: Adversarial Attack, Video Foundation Models, Transfer Attack, Contrastive Learning, Temporal Consistency, Multimodal Large Language Models

TL;DR¶

This paper proposes Transferable Video Attack (TVA), which generates adversarial perturbations solely by exploiting the embedding space of open-source Video Foundation Models (VFMs), without any knowledge of downstream tasks, and effectively attacks downstream models and multimodal LLMs across 24 video tasks.

Background & Motivation¶

State of the Field¶

Large-scale Video Foundation Models (VFMs) such as VideoMAE and InternVideo have achieved superior performance on video understanding tasks and are widely adopted for downstream fine-tuning or as visual encoders in multimodal LLMs. However, their open-source nature introduces security risks, as adversaries can exploit publicly available model parameters to mount adversarial attacks.

Limitations of Prior Work¶

Traditional transfer attacks rely on task-aligned surrogate models: they assume the attacker knows the victim model's task type and training data distribution (e.g., action recognition on Kinetics-400), and train a similar surrogate model to generate adversarial examples.
Privacy and legal constraints render such assumptions impractical, as sensitive data is protected and attackers often cannot access downstream training data.
Existing video adversarial attacks primarily target the single task of action recognition and lack a unified attack framework for diverse video tasks (detection, segmentation, VQA, etc.).
The growing scale of VFMs makes training task-aligned surrogate models computationally prohibitive, further limiting the practicality of conventional approaches.

Root Cause¶

The paper investigates a more realistic threat scenario: the attacker possesses only an open-source VFM (e.g., VideoMAE pre-trained weights), with no knowledge of the downstream task type, training data, model architecture, or outputs, and directly generates transferable adversarial perturbations in the VFM embedding space to attack diverse downstream applications.

Method¶

Overall Architecture¶

TVA consists of three complementary components: (1) a self-supervised embedding-layer attack that generates base perturbations; (2) a bidirectional temporally-aware contrastive loss that enhances cross-model transferability; and (3) a temporal consistency loss that disrupts inter-frame temporal coherence.

Component 1: Self-Supervised Embedding-Layer Attack¶

Given a frozen VFM \(f_\phi\) and input video \(\bm{x} \in \mathbb{R}^{T \times C \times H \times W}\), the embedding is extracted as \(\bm{z} = f_\phi(\bm{x}) \in \mathbb{R}^{T \times D}\). The attack objective is to deviate the adversarial embedding \(\bm{z}^{adv} = f_\phi(\bm{x} + \bm{\delta})\) from the clean embedding, using an L1 loss to promote sparse deviation:

\[\mathcal{L}_{L1} = \|\bm{z}^{adv} - \bm{z}\|_1\]

Perturbations are updated iteratively via I-FGSM: \(\bm{\delta}_{t+1} = \text{clip}_\epsilon\{\bm{\delta}_t + \alpha \cdot \text{sign}(\nabla_{\bm{\delta}_t} \mathcal{L})\}\). This component requires no downstream labels or task outputs and is entirely self-supervised.

Component 2: Bidirectional Temporally-Aware Contrastive Loss (Bi-con Loss)¶

Gradient mismatch problem: The paper theoretically analyzes the perturbation update deviation between surrogate and victim models. For fine-tuned downstream models (Form a), the deviation arises from accumulated gradients through residual transformations of per-layer parameter changes; for models with a frozen backbone and task head (Form b), it stems from the gradient contribution of the task head.

Gradient asymmetry in unidirectional contrastive loss: The paper proves (Theorem 2) that the gradient prefix factors differ between \(\mathcal{L}_{clean \to adv}\) (anchored on clean features) and \(\mathcal{L}_{adv \to clean}\) (anchored on adversarial features):

\[\nabla_{\bm{\delta}_{(i)}} \mathcal{L}_{clean \to adv} = \frac{1}{n\tau}(\exp(-\mathcal{L}_{clean \to adv}^{(i)}) - 1) \bm{z}_{(i)} \cdot \frac{d\bm{z}_{(i)}^{(adv)}}{d\bm{\delta}_{(i)}}\]

The reverse gradient contains an additional weighted negative-sample term \(\sum_{j \neq i} q_j \bm{z}_{(j)}\), causing the two directional gradients to be unequal.

Bidirectional loss design: Gradient asymmetry is eliminated by averaging the contrastive losses from both directions:

\[\mathcal{L}_{Bi\text{-}con} = \frac{\mathcal{L}_{clean \to adv} + \mathcal{L}_{adv \to clean}}{2}\]

This design operates at the frame level—each clean frame is contrasted against all adversarial frames in the batch—enlarging negative sample diversity, enhancing temporal saliency, and avoiding the temporal information neglect inherent in video-level methods.

Component 3: Temporal Consistency Loss (TC Loss)¶

Video models rely on inter-frame temporal coherence for understanding. TVA disrupts this coherence by penalizing the similarity between embeddings of adjacent adversarial frames:

\[\mathcal{L}_{TC} = \frac{1}{T-1} \sum_{t=1}^{T-1} (1 - \cos(\bm{z}_t^{adv}, \bm{z}_{t+1}^{adv}))\]

This loss forces adversarial features of adjacent frames to diverge directionally, undermining the temporal priors upon which video models depend.

Joint Optimization Objective¶

The three losses are unified as:

\[\mathcal{L}_{total} = \mathcal{L}_{L1} + \mathcal{L}_{Bi\text{-}con} + \mathcal{L}_{TC}\]

These terms enhance perturbation transferability along three dimensions: spatial deviation, semantic alignment disruption, and temporal coherence destruction.

Key Experimental Results¶

Experiment 1: Transfer Attack on Temporal Action Detection (TAD)¶

Surrogate model: VideoMAE-Base (original pre-trained weights). Victim models: ActionFormer, Tridet, DyFaDet (frozen backbone), and AdaTAD (fine-tuned backbone, SOTA end-to-end method).

Attack Method	ActionFormer	Tridet	DyFaDet	AdaTAD	Avg. mAP (%)↓
No Attack	50.40	49.85	49.85	53.17	50.07
I-FGSM	7.59	6.94	8.29	21.08	10.98
MI-FGSM	6.73	6.77	7.02	19.18	9.93
FTM (prev. strongest)	3.55	3.40	4.60	14.17	6.43
BSR	46.11	46.99	50.30	52.50	48.98
TVA + MI-FGSM	0.12	0.44	0.29	4.07	1.23
TVA + FTM	0.79	0.40	0.45	3.05	1.17

TVA reduces frozen-backbone model performance to near zero and drops the mAP of the SOTA end-to-end model AdaTAD from 53.17% to 3.05%, far surpassing the previous strongest attack FTM (14.17%).

Experiment 2: Multimodal Video Task Attack on MVBench¶

Surrogate model: LanguageBind. Victim model: VideoLLaVA. Results reported as Attack Success Rate (ASR%↑).

Task	I-FGSM	MI-FGSM	BSR	TVA+MI
Action Sequence	38.04	28.19	11.96	47.83
Fine-grained Action	53.09	38.50	18.52	79.01
Object Interaction	37.11	35.50	12.37	68.04
Scene Transition	19.41	12.50	3.53	52.94
Character Order	34.57	31.00	22.22	54.32
Avg. over 20 tasks	29.52	25.79	15.76	42.10

TVA substantially outperforms all baselines across all 20 video understanding sub-tasks, achieving an average ASR of 42.10%, exceeding the strongest baseline by 12.58 percentage points.

Experiment 3: Cross-Model Transfer on SEEDBench¶

Surrogate→Victim	Task	I-FGSM	MI-FGSM	X-Transfer	AnyAttack	TVA
SigLIP→LLaVA-NeXT	AR	38.03	67.61	32.39	22.54	76.06
SigLIP→LLaVA-NeXT	AP	40.00	70.00	38.57	52.86	68.57
SigLIP→LLaVA-NeXT	Avg.	36.86	59.05	32.18	33.66	61.39
LanguageBind→VideoLLaVA	Avg.	33.31	3.84	-	-	53.87

TVA also transfers to commercial models, achieving 48.8% ASR against Gemini-2.0-flash and 33.3% ASR against GPT5-mini under \(\epsilon=16/255\) using SigLIP as surrogate.

Key Findings¶

Ablation study: The three components are complementary—removing any one degrades performance, with Bi-con Loss contributing most (removal raises average mAP from 1.23% to 9.93%).
Frame-level vs. video-level contrastive: Frame-level bidirectional contrastive significantly outperforms video-level and unidirectional variants, validating the importance of fine-grained temporal-aware attacks.
BSR is nearly ineffective: BSR (block-wise transformation) achieves negligible attack performance in this setting (mAP barely changes), indicating that conventional augmentation-based methods fail in VFM embedding-space attacks.
End-to-end fine-tuned models are more robust: AdaTAD (fine-tuned backbone) is harder to attack than frozen-backbone models, consistent with the theoretical analysis showing larger gradient deviation for Form (a).

Highlights & Insights¶

Novel threat model: This is the first systematic study of attacking video downstream models and multimodal LLMs using only open-source VFMs without any downstream task knowledge, representing a weaker and more realistic assumption than conventional transfer attacks.
Strong theoretical grounding: Theorem 1 quantifies the perturbation update deviation between surrogate and victim models; Theorem 2 proves the gradient asymmetry of unidirectional contrastive loss, providing theoretical motivation for the bidirectional design.
Large-scale validation across 24 tasks: Experiments cover temporal action detection, 20 MVBench sub-tasks, and 3 SEEDBench sub-tasks, spanning frozen-backbone, fine-tuned backbone, and multimodal LLM deployment scenarios.
Plug-and-play: Bi-con Loss and TC Loss can be combined with arbitrary gradient-based attack methods including I-FGSM, MI-FGSM, and FTM.
Transfer to commercial models: The paper demonstrates attack capability against Gemini and GPT5-mini.

Limitations & Future Work¶

Only \(\ell_\infty\) perturbations are considered: Attack efficacy under \(\ell_2\) or perceptual constraints (e.g., LPIPS) is not evaluated; perceptual quality may be more critical in practical scenarios.
Insufficient defense evaluation: Only data augmentation defenses are briefly tested in the appendix; stronger defenses such as adversarial training and certified defenses are not assessed.
Sensitivity to iteration count: TAD experiments use only 4 iterations while others use 20; the trade-off between iteration count and transferability is not thoroughly analyzed.
Surrogate and victim models must share the same model family: TVA's transferability relies on downstream models being initialized from the same VFM; cross-architecture-family transfer (e.g., ViT→CNN) is not validated.
No evaluation on video generative models: Coverage is limited to discriminative tasks; emerging applications such as video generation and editing are not considered.

X-Transfer / AnyAttack: Require training a super-ensemble or retraining a surrogate model at high computational cost; TVA directly uses a frozen VFM without additional training.
FTM: Improves transferability by mixing attacked and clean features, but still operates under a task-aligned assumption; TVA is entirely task-agnostic.
AdvCLIP / Downstream-agnostic: Downstream-agnostic attacks designed for the image domain that do not address video temporal characteristics; TVA's Bi-con and TC Loss are specifically designed for the temporal structure of video.
Traditional augmentation methods (DI/TI/SI/BSR): Exhibit limited effectiveness in VFM embedding-space attack settings (BSR nearly fails); TVA achieves more fundamental feature disruption through contrastive learning in the embedding space.

Rating¶

Novelty: ⭐⭐⭐⭐ — First systematic study of task-agnostic video attacks in the VFM embedding space; the threat model is well-defined and novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 24 tasks, diverse surrogate–victim combinations, complete ablations, and commercial model evaluations.
Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are clear, experimental organization is systematic, and figures are highly informative.
Value: ⭐⭐⭐⭐ — Reveals important security vulnerabilities in VFM deployment; the plug-and-play method offers strong practical utility.