Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection¶

Conference: ICCV 2025 arXiv: 2501.01184 Code: GitHub Area: AI Safety Keywords: deepfake detection, spatio-temporal learning, data synthesis, multi-task learning, vulnerability-aware

TL;DR¶

This paper proposes FakeSTormer, a fine-grained generative deepfake video detection framework that simultaneously models temporal and spatial vulnerability regions via multi-task learning, coupled with a Self-Blended Video (SBV) data synthesis strategy to generate high-quality forgery samples. Trained exclusively on real data, it achieves state-of-the-art generalization across multiple cross-dataset benchmarks.

Background & Motivation¶

With advances in generative AI, deepfake videos have become increasingly realistic, posing serious threats to social security. Existing methods face two core challenges:

Insufficient generalization: Most methods rely on binary classifiers that perform well on seen forgery types but suffer significant performance drops when generalizing to unseen manipulation methods, due to binary classification loss causing overfitting to specific forgery artifacts.

Poor robustness to high-quality forgeries: As forgery techniques improve, spatio-temporal artifacts become increasingly subtle, and existing implicit attention mechanisms fail to reliably capture these fine-grained artifacts.

The image-level vs. video-level gap: In image-level forgery detection, multi-task learning combined with vulnerability awareness and refined data synthesis has proven effective (e.g., LAA-Net), but extending this to the video level is non-trivial — it requires simultaneously handling fundamentally different temporal and spatial artifacts while maintaining temporal consistency in video-level data synthesis.

Core Idea: Reformulate deepfake video detection as a fine-grained detection task, employing a three-branch multi-task framework to separately learn classification, spatial vulnerability, and temporal vulnerability, complemented by high-quality Self-Blended Video (SBV) data synthesis using only real videos for training.

Method¶

Overall Architecture¶

FakeSTormer comprises three main modules: 1. SBV Data Synthesis: Extends SBI to the video level, generating temporally consistent forgery videos with corresponding annotations. 2. Modified TimeSformer Backbone: Decouples spatial and temporal tokens to disentangle feature learning. 3. Three-Branch Multi-Task Head: A classification head \(f\), a temporal head \(h\) (regressing temporal vulnerability derivatives), and a spatial head \(g\) (predicting per-frame spatial vulnerability soft labels).

Key Designs¶

Self-Blended Video (SBV) Data Synthesis:
- Function: Generates high-quality forgery videos from real videos, providing annotation-free training signals.
- Mechanism: Extends SBI (Self-Blended Image) with two temporal consistency modules:
  - Consistent Synthesized Parameters (CSP): Fixes all blending parameters \(\theta^{(sbi)}\) from the first frame (ConvexHull type, mask deformation kernel, blending ratio, etc.) for subsequent frames.
  - Landmark Interpolation (LI): Smooths keypoint changes between adjacent frames via interpolation when displacement is too large: \(\mathbf{l}_i(t) = \mathbf{l}_i(t-1) + \frac{\mathbf{l}_i(t) - \mathbf{l}_i(t-1)}{\text{round}(d/\bar{d})}\), where \(\text{round}\) introduces slight errors to preserve subtle temporal artifacts.
- Design Motivation: Existing video-level synthesis methods (e.g., STC, VB) introduce exaggerated temporal distortions inconsistent with real high-quality forgeries; SBV generates more realistic forgery samples by maintaining temporal consistency.
Vulnerability-Driven Cutout Augmentation:
- Function: Occludes regions most likely to contain blending artifacts, preventing the model from overfitting to specific artifact locations.
- Mechanism: Computes the blending boundary \(\mathbf{B} = (\mathbf{1} - \mathbf{M}) * \mathbf{M} * 4\), quantifies patch-level vulnerability values \(\bar{\mathbf{B}}\) via MaxPooling, randomly selects a threshold \(\tau_{cutout} \in (0.5, 1.0]\), and occludes patches exceeding the threshold at consistent positions across all frames.
- Design Motivation: Models tend to overfit blending boundary regions; occlusion forces the model to learn features from other regions.
Modified TimeSformer Backbone:
- Function: Provides disentangled spatio-temporal features for the three-branch framework.
- Mechanism: Adds spatial token \(\mathbf{z}_s^0\) and temporal token \(\mathbf{z}_t^0\) to each attention dimension of TimeSformer, each interacting exclusively with patch embeddings along its corresponding axis. After \(L=12\) Transformer layers, outputs \([\mathbf{Z}^L, \mathbf{z}_s^L, \mathbf{z}_t^L]\), fed into separate heads.
- Design Motivation: The global CLS token in the original TimeSformer conflates spatio-temporal features, hindering disentangled learning; computational complexity is reduced from \(\mathcal{O}(T^2 \cdot N^2)\) to \(\mathcal{O}(T^2 + N^2)\).
Temporal Head \(h\) — Temporal Vulnerability Regression:
- Function: Predicts the temporal derivative of blending boundaries to capture high-variation regions of temporal artifacts.
- Mechanism: Reshapes patch embeddings into 3D features and applies two layers of 3D convolutions (temporal kernel 3 × spatial kernel 1) to regress the normalized temporal derivative \(\hat{\mathbf{D}} = \partial\tilde{\mathbf{B}}/\partial t\). Loss is MSE: \(\mathcal{L}_h = \frac{1}{T \times N}\|\hat{\mathbf{D}} - \tilde{\mathbf{D}}\|_2^2\).
- Design Motivation: Temporally varying blending boundaries reflect the presence of temporal artifacts, a critical signal for video-level detection.
Spatial Head \(g\) — Spatial Vulnerability Prediction:
- Function: Predicts per-frame spatial artifact intensity soft labels.
- Mechanism: Applies an MLP to the spatial token to predict per-frame soft labels \(\tilde{\mathbf{p}} = \text{MLP}(\mathbf{z}_s^L)\), with labels \(p(t) = \max_{l,m}(\tilde{\mathbf{B}}(t))\). Loss is BCE: \(\mathcal{L}_g = \text{BCE}(\tilde{\mathbf{p}}, \mathbf{p})\).
- Design Motivation: Spatial artifact detection complements temporal artifact detection; their synergy enables comprehensive capture of forgery traces.

Loss & Training¶

Total loss: \(\mathcal{L} = \lambda_c \mathcal{L}_c + \lambda_h \mathcal{L}_h + \lambda_g \mathcal{L}_g\)

\(\mathcal{L}_c\): Binary classification BCE loss (real/fake discrimination)
\(\mathcal{L}_h\): Temporal vulnerability MSE regression loss
\(\mathcal{L}_g\): Spatial vulnerability BCE loss
Training uses only real data (forgery data synthesized online via SBV)

Key Experimental Results¶

Main Results (Cross-Dataset Generalization — Trained on FF++(c23))¶

Method	Training Data	CDF(AUC%)	DFD	DFDCP	DFDC	DFW	DiffSwap
SBI	Real	90.6	-	-	72.4	-	-
AltFreezing	Real+Fake	89.5	98.5	-	-	-	-
Swin+TALL	Real+Fake	90.8	-	76.8	-	-	-
LFGDIN	Real+Fake	90.4	-	80.8	-	-	85.7
LAA-Net	Real	95.4	86.9	92.1	-	-	-
FakeSTormer(T=16)	Real	92.8	98.6	90.2	75.1	75.3	97.2

Ablation Study¶

Configuration	CDF(AUC%)	DFDCP	Notes
Baseline (classification head + SBI only)	89.2	85.3	No multi-task learning
+ SBV replacing SBI	91.0	88.1	SBV synthesis yields significant gains
+ Temporal head \(h\)	91.8	89.3	Temporal vulnerability regression is effective
+ Spatial head \(g\)	92.4	90.0	Spatial vulnerability prediction further improves
+ Cutout augmentation	92.8	90.2	Prevents overfitting to specific regions
T=4 vs T=8 vs T=16	92.4/92.4/92.8	90.0/90.0/90.2	More frames marginally beneficial

Key Findings¶

Training a baseline classifier with SBV alone matches existing SOTA, demonstrating the power of data synthesis.
The temporal and spatial heads each contribute independently, and their combination outperforms either branch used alone.
Cutout augmentation yields particularly notable improvements on datasets featuring novel forgery methods (e.g., DF40).
The token decoupling design in the modified TimeSformer is critical for the three-branch framework.

Highlights & Insights¶

Temporal extension of the vulnerability concept: Extending image-level vulnerability (pixels/patches most likely to embed blending artifacts) to video-level temporal vulnerability (blending boundaries with high temporal variation) is a natural and effective generalization.
Data synthesis quality is paramount: The two temporal consistency modules in SBV, though simple, are the primary driver of large performance gains.
Training on real data only: Avoids overfitting to specific forgery methods, enhancing generalization.
Plug-and-play design: SBV can be applied to any existing video-level forgery detection method.

Limitations & Future Work¶

TimeSformer incurs substantial computational overhead, which may become a bottleneck for real-world deployment.
SBV-generated forgery samples remain blending-based and may diverge significantly from non-blending forgeries (e.g., purely generative methods, SAM-based face swapping).
Temporal vulnerability is modeled using only first-order derivatives, potentially missing higher-order temporal patterns.
Performance on ultra-high-quality AI-generated videos (e.g., Sora, Kling) has not been evaluated.
The three loss weights \(\lambda_c, \lambda_h, \lambda_g\) require manual tuning.

The extension path from image-level LAA-Net to video-level FakeSTormer is clear, offering a paradigm for adapting other image-level methods to the video domain.
Vulnerability-driven attention mechanisms may generalize to other visual anomaly detection tasks.
The SBV approach of temporal smoothing with slight error retention offers reference value for other video augmentation tasks.
The token decoupling design in the multi-branch framework can be adapted to other multi-task video understanding problems.

Rating¶

Novelty: ⭐⭐⭐⭐ The temporal extension of the vulnerability concept and SBV are core contributions, though the overall framework follows the LAA-Net paradigm.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluation across 6 test sets, comparison with 15+ methods, comprehensive ablations, and visualization analysis.
Writing Quality: ⭐⭐⭐⭐ Figures and tables are clear, method descriptions are complete, though some notation is slightly redundant.
Value: ⭐⭐⭐⭐⭐ Establishes a strong baseline for video-level forgery detection; the data synthesis and multi-task framework designs have broad applicability.