Visual Language Models as Zero-Shot Deepfake Detectors¶
Conference: ICML 2025
arXiv: 2507.22469
Code: None
Area: Image Generation
Keywords: Deepfake Detection, Vision-Language Models, Zero-shot Classification, VLM Probability Calibration, InstructBLIP
TL;DR¶
Proposes an image classification framework based on VLM token probability normalization, upgrading deepfake detection from binary decisions to probability estimation. Under zero-shot settings, InstructBLIP outperforms most dedicated deepfake detectors, and achieves near-perfect performance on DFDC-P after fine-tuning.
Background & Motivation¶
Background: Most deepfake detection methods train dedicated classifiers (such as FaceForensics++, SBI, MAT), which rely heavily on labeled data and generalize poorly to novel deepfakes.
Limitations of Prior Work: (a) Existing detectors experience a sharp drop in performance on out-of-distribution data; (b) Existing VLM-based deepfake studies only perform binary yes/no decisions and cannot output confidence probabilities; (c) They lack support for practical deployment metrics such as FAR/FRR.
Key Challenge: Real-world deployment requires probabilistic outputs to adjust thresholds (balancing false acceptance and false rejection rates), whereas the argmax output of VLMs can only yield 0/1 binary decisions.
Goal: How to extract meaningful classification confidence from the token distribution of VLMs?
Key Insight: Utilize the probability ratio of "yes"/"no" tokens in response to "Is this photo real?" as the confidence score.
Core Idea: Normalize the yes/no token probabilities into \(\tilde{P}_{\text{fake}} = P_{\text{no}} / (P_{\text{no}} + P_{\text{yes}})\) to obtain continuous confidence scores suitable for ROC analysis.
Method¶
Overall Architecture¶
Given an image and a prompt (e.g., "Is this photo real?"), the VLM performs a single forward pass to obtain the token distribution. The probabilities of tokens like "yes"/"Yes"/"no"/"No" are extracted, summed by group, and normalized to obtain the fake confidence score for downstream decision-making.
Key Designs¶
-
Token Probability Normalized Classification:
- Function: Extract classification confidence from the VLM's token distribution.
- Mechanism: \(P(I \in D) \approx \frac{P_{\text{no}}}{P_{\text{no}} + P_{\text{yes}}}\), where \(P_{\text{no}} = p(\text{"no"}) + p(\text{"No"})\) and \(P_{\text{yes}} = p(\text{"yes"}) + p(\text{"Yes"})\).
- Design Motivation: Compared to argmax (0/1 outputs), normalized probabilities support AUC/EER evaluation and threshold adjustment.
-
Multi-token/Multi-class Extension (Algorithm 1):
- Function: Support multi-token answers (e.g., "Yes for sure!") and multi-class classification.
- Mechanism: For all candidate answer strings \(s \in \mathcal{S}_c\) of class \(c\), compute the autoregressive probability \(P(s|I,Q) = \prod_k p(t_k|I,Q,t_{1:k-1}) \cdot p(\text{EOS}|I,Q,s)\), followed by summation and normalization.
- Design Motivation: Because tokenizer vocabularies differ across VLMs, it is necessary to cover all potential answer formats.
-
Prompt Engineering:
- Function: Design customized prompts for different VLMs.
- Mechanism: InstructBLIP only requires "Is this photo real?"; LLaVA needs an additional "Answer using a single word"; GPT-4o requires role-play-style long prompts.
- Design Motivation: Ensure models consistently return answers in a yes/no format.
Key Experimental Results¶
Main Results (Zero-Shot vs. Dedicated Detectors, CelebA-HQ SimSwap Dataset)¶
| Method | AUC ↑ | ACC ↑ | EER ↓ |
|---|---|---|---|
| FF++ (XceptionNet) | 58.9 | 59.2 | 44.5 |
| MAT | 49.0 | 50.0 | 50.6 |
| RECCE | 46.9 | 49.1 | 50.8 |
| SBI (SOTA Dedicated) | 93.6 | 85.2 | 14.0 |
| InstructBLIP (Zero-Shot) | 81.3 | 75.3 | 26.9 |
| InstructBLIP FT | 92.1 | 85.0 | 12.2 |
Method Comparison (Normalization vs. Softmax vs. Binary)¶
| VLM | Binary ACC | Normalize AUC | Softmax AUC |
|---|---|---|---|
| InstructBLIP | 68.0 | 81.3 | 80.9 |
| Idefics2 | 74.2 | 80.6 | 75.2 |
| LLaVA-1.6 | 58.3 | 74.2 | 74.2 |
Key Findings¶
- The normalization method outperforms binary argmax across all VLMs (with a maximum gain of ~16% AUC).
- Zero-shot InstructBLIP outperforms most dedicated detectors (only lagging behind SBI + CADDM).
- Fine-tuning InstructBLIP yields an AUC of 92.1%, approaching SBI's 93.6%.
Highlights & Insights¶
- Practical Framework: The token probability normalization method is generally applicable to any classification task using VLMs, not limited to deepfake detection.
- Demonstration of Zero-Shot Capability: The pre-trained knowledge of VLMs is sufficient to achieve viable performance on novel deepfakes.
- Multi-Token Extension: The autoregressive cumulative probability multiplication in Algorithm 1 supports answers of arbitrary length.
Related Work & Insights¶
- vs AntifakePrompt: AntifakePrompt fine-tunes soft prompts on InstructBLIP for deepfake VQA but only outputs 0/1; ours requires no fine-tuning and outputs continuous probabilities.
- vs SHIELD/ChatGPT deepfake: These works qualitatively evaluate the deepfake detection capabilities of GPT-4V/Gemini but do not systematically quantify token probabilities; ours proposes a complete probabilistic framework.
- vs SBI (SOTA): SBI trains highly generalizable classifiers via self-blending data augmentation; ours is completely zero-shot, and although its AUC is slightly lower, it requires no deepfake training data.
- The proposed token probability normalization framework can be directly applied to other scenarios requiring classification confidence from VLMs (e.g., medical image analysis, content moderation).
Limitations & Future Work¶
- Only face-swap deepfakes were tested, while other modalities such as full-face generation (StyleGAN) and expression manipulation (Face2Face) were not covered.
- Token probabilities cannot be accessed for GPT-4o, restricting it to binary evaluations and limiting the application of closed-source models.
- VLM inference speed is significantly slower than lightweight classifiers (e.g., EfficientNet), which presents latency challenges for real-world deployment.
- Evaluation on recent deepfake methods (e.g., full-body deepfakes generated by Flux) is missing.
- Although the multi-token answer extension is mathematically described, it has not been systematically validated in experiments.
Rating¶
- Novelty: ⭐⭐⭐⭐ The token probability normalization classification is a simple yet effective innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparison across multiple VLMs and detectors.
- Writing Quality: ⭐⭐⭐⭐ The methodological derivation is clear, with complete prompt and algorithm details.
- Value: ⭐⭐⭐⭐ Opens up a new application paradigm for VLMs in security detection.