Unleashing Vision-Language Semantics for Deepfake Video Detection¶

Conference: CVPR 2026 arXiv: 2603.24454 Code: https://github.com/mala-lab/VLAForge Area: Face Understanding / Deepfake Detection Keywords: Deepfake Detection, Vision-Language Alignment, CLIP, Attention Module, Identity-Aware

TL;DR¶

This paper proposes VLAForge, which employs a ForgePerceiver to independently learn diverse forgery cues and forgery localization maps, and integrates an identity-aware Vision-Language Alignment (VLA) scoring mechanism to unleash the cross-modal semantic potential of VLMs for enhanced deepfake video detection, achieving comprehensive state-of-the-art performance across 9 datasets.

Background & Motivation¶

Background: Deepfake video detection (DFD) aims to determine the authenticity of facial videos. Conventional approaches primarily focus on detecting spatial artifacts or temporal inconsistencies. Recently, methods based on pre-trained vision-language models (VLMs) such as CLIP have attracted growing attention due to their strong generalization capability.
Limitations of Prior Work: Existing VLM-based methods enhance the visual encoder itself through adapter tuning, bias correction, or spatiotemporal modeling, while overlooking the most distinctive advantage of VLMs—the rich vision-language semantics embedded in the latent space. These methods exploit only unimodal visual features, failing to leverage the discriminative potential of cross-modal semantics.
Key Challenge: The visual encoder of a VLM learns to understand semantic objects in images during pre-training, rather than to detect forgery artifacts. When directly applied to DFD, attention tends to be distributed over objects irrelevant to forgery. Meanwhile, manipulated facial regions often exhibit diverse and heterogeneous low-level artifacts (boundary inconsistencies, texture distortions), which are difficult for semantics-oriented VLM visual encoders to effectively capture.
Goal: (1) How to enhance the VLM's visual perception of forgery artifacts without disrupting its pre-trained knowledge? (2) How to exploit the VLM's intrinsic vision-language alignment to provide complementary fine-grained discriminative cues?
Key Insight: By injecting identity priors into text prompts, the visual-text alignment is adapted into a more fine-grained form, enabling the model to capture authenticity cues tailored to each individual.
Core Idea: A standalone ForgePerceiver learns diverse forgery cues to modulate VLM visual tokens, while identity-prior-enhanced text prompts unleash the VLM's cross-modal semantics for patch-level authenticity judgment. The two branches are fused to achieve global and local discrimination.

Method¶

Overall Architecture¶

VLAForge is built upon CLIP and comprises two core components: ForgePerceiver and Identity-Aware VLA Scoring. ForgePerceiver serves as an independent visual forgery learner for the VLM, generating forgery-aware masks to modulate the VLM's class token (global discrimination) and outputting forgery localization maps (local cues). Identity-Aware VLA Scoring constructs identity-prior-enhanced text prompts, computes patch-level VLA attention maps, and fuses them with the forgery localization maps to produce local authenticity scores. The final authenticity score is a weighted combination of the global and local branches.

Key Designs¶

ForgePerceiver — Forgery-Aware Mask Learning:
- Function: Independently learns diverse forgery cues from the VLM and modulates the VLM's global representation via masks.
- Mechanism: A lightweight ViT processes visual tokens \(\mathbf{V}\) and learnable query tokens \(\mathbf{Q}\) from the VLM. \(H\) groups of per-head forgery-aware masks are computed via the similarity between queries and visual features: \(\mathcal{M}_i = \hat{\mathbf{Q}} \hat{\mathbf{V}}_i^\top\). To ensure different queries capture complementary artifact priors, an orthogonality constraint \(\mathcal{L}_{orth}\) is imposed on query-level masks. The resulting masks are injected as attention biases into the VLM visual encoder's self-attention: \(\mathbf{z}_j^{(l)} = \text{softmax}(\frac{\mathbb{Q}_j^{(l)} \mathbb{K}_P^{\top(l)}}{\sqrt{d}} + \mathcal{M}_{i,j}) \mathbb{V}_P^{(l)}\), guiding the class token to accumulate more forgery-relevant semantics.
- Design Motivation: The original VLM class token is insensitive to subtle forgery artifacts. Modulating the attention distribution with diverse forgery-aware masks enables the class token to capture forgery information from multiple complementary perspectives, while preserving pre-trained VLM knowledge since ForgePerceiver operates independently.
ForgePerceiver — Forgery Localization:
- Function: Generates coarse-grained region-aware forgery localization maps to provide spatial guidance.
- Mechanism: Visual tokens are projected into a task-adaptive space via another projection function \(g_3(\cdot)\). Query-level localization maps are computed and aggregated through a convolutional head: \(\mathbf{M}_{loc} = h([\tilde{\mathcal{M}}_1, \ldots, \tilde{\mathcal{M}}_q])\). An MSE loss supervises the maps against ground-truth forgery masks.
- Design Motivation: This component provides auxiliary spatial guidance to help the model learn more accurate forgery priors without sacrificing the diversity of forgery masks, while also supplying local cues for subsequent VLA scoring.
Identity-Aware VLA Scoring:
- Function: Exploits the VLM's intrinsic vision-language alignment to provide fine-grained patch-level authenticity discrimination.
- Mechanism: Text templates of the form "This is a real/fake photo of \<id> person." are constructed, with the \<id> placeholder replaced by the class token embedding \(\mathbf{z}^{(L)}\) from the last layer of the VLM visual encoder, thereby injecting identity priors. ID-aware features \(\mathbf{F}_r\)/\(\mathbf{F}_f\) are obtained via the text encoder, and softmax is applied against patch tokens to produce the VLA attention map: \(\mathbf{M}_{VLA}(i,j) = \frac{\exp(\phi(\mathbf{P}(i,j))\mathbf{F}_f^\top)}{\sum_{c}\exp(\phi(\mathbf{P}(i,j))\mathbf{F}_c^\top)}\). The VLA attention map is element-wise fused with the forgery localization map to generate the VLA score.
- Design Motivation: Existing VLM-based detection methods perform only image-level global alignment, lacking fine-grained patch-level authenticity correspondence. Injecting identity priors makes the text-visual alignment more discriminative—precisely highlighting forged regions in fake samples while suppressing spurious attention in real samples.

Loss & Training¶

Total loss: \(\mathcal{L}_{final} = \mathcal{L}_{loc} + \mathcal{L}_{VLA} + \mathcal{L}_G + \mathcal{L}_L\)
\(\mathcal{L}_G\): Global-level binary cross-entropy loss based on the forgery-mask-modulated class token.
\(\mathcal{L}_L\): Local-level binary cross-entropy loss based on the VLA fusion score.
\(\mathcal{L}_{loc}\): MSE loss supervising the forgery localization map.
\(\mathcal{L}_{VLA}\): Dice loss supervising the VLA attention map.
Final inference score: \(s(x') = \alpha s_g' + (1-\alpha)s_{VLA}'\), where \(\alpha\) balances global and local contributions.

Key Experimental Results¶

Main Results¶

Dataset	Metric (AUROC)	VLAForge	Prev. SOTA (ForAda)	Gain
CDF-v1 (frame-level)	AUROC	93.9%	91.4%	+2.5%
CDF-v2 (frame-level)	AUROC	91.2%	90.0%	+1.2%
DFDC (frame-level)	AUROC	87.0%	84.3%	+2.7%
DFD (frame-level)	AUROC	93.6%	93.3%	+0.3%
CDF-v2 (video-level)	AUROC	96.8%	95.7%	+1.1%
DFDC (video-level)	AUROC	89.6%	87.2%	+2.4%
DFD (video-level)	AUROC	97.2%	96.5%	+0.7%
VQGAN (frame-level)	AUROC	98.4%	93.9%	+4.5%
SiT (frame-level)	AUROC	77.4%	69.0%	+8.4%

Ablation Study¶

Configuration	CDF-v2 (frame)	DFDC (frame)	DFD (frame)	Description
Base (CLIP)	58.3%	64.0%	77.5%	Baseline CLIP encoder
+T1 (forgery mask)	76.3%	76.0%	74.6%	+ forgery-aware mask modulation
+T2 (forgery localization)	82.3%	80.9%	87.4%	+ forgery localization supervision
+T3 (VLA scoring)	90.8%	86.5%	92.8%	+ identity-aware VLA
+T4 (orthogonality constraint)	91.2%	87.0%	93.6%	full model

Key Findings¶

Each component contributes significantly: frame-level AUROC on CDF-v2 improves from 58.3% to 91.2% from the base to the full model.
Forgery-aware masks (+T1) yield the largest single-step gain (CDF-v2: 58.3%→76.3%), underscoring that enhancing VLM visual perception is critical.
VLA scoring provides important complementary gains (+T3 on CDF-v2: 82.3%→90.8%), validating the discriminative value of cross-modal semantics.
Gains are more pronounced on full-face generation forgeries (GAN/Diffusion)—SiT frame-level improves from 69.0% to 77.4%—suggesting that VLA semantics are more robust to novel forgery types.
The orthogonality constraint, while yielding modest improvement, ensures that different queries learn complementary forgery priors.

Highlights & Insights¶

The idea of unleashing VLM cross-modal semantics is distinctive—rather than merely enhancing the visual encoder, the work exploits vision-language alignment itself as a discriminative signal, a direction entirely overlooked by prior methods.
The identity-prior injection into text prompts is particularly elegant: using the VLM class token as the embedding for the \<id> placeholder simultaneously encodes identity information and conforms to the VLM's text encoding space.
Designing ForgePerceiver as a standalone learner preserves VLM pre-trained knowledge, while achieving information injection via mask modulation rather than direct parameter modification.
Visualizations of VLA attention maps demonstrate clear differences between fake and real samples—accurately highlighting forged regions in fake samples while remaining calm on real ones.

Limitations & Future Work¶

The identity prior is derived from the VLM's own class token; if the VLM's visual features are insufficiently discriminative, the quality of the identity prior will be constrained accordingly.
Although cross-dataset evaluation is comprehensive, training is primarily conducted on FF++, and real-world training data distributions are considerably more complex.
All loss weights are set to 1, leaving the relative importance of different loss terms unexplored.
Only CLIP is used as the VLM backbone; stronger VLMs (e.g., SigLIP, EVA-CLIP) may yield further improvements.

vs. ForAda: ForAda fine-tunes the CLIP visual encoder via adapters, representing a purely visual enhancement; VLAForge additionally leverages vision-language alignment semantics, surpassing ForAda by 2.7% in frame-level AUROC on DFDC.
vs. RepDFD: RepDFD reprograms the VLM using externally generated, sample-specific text prompts from face embeddings, but performs only image-level global alignment; VLAForge achieves fine-grained patch-level alignment.
vs. FFTG: FFTG augments interpretability with synthesized image-text pairs and masks, but the text descriptions are auxiliary rather than adapted to the VLM's intrinsic alignment; VLAForge directly unleashes the VLM's inherent cross-modal discriminative capability.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First systematic exploitation of VLM cross-modal semantics for DFD; the identity-prior injection design is highly creative.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 9 datasets, frame- and video-level evaluation, covering both face-swapping and full-face generation forgeries.
Writing Quality: ⭐⭐⭐⭐ Method description is clear and visualizations are convincing.
Value: ⭐⭐⭐⭐⭐ Opens a new direction for VLM application in DFD; achieves comprehensive state-of-the-art results.