Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection¶

Conference: ICML 2025
arXiv: 2503.14853
Code: To be confirmed
Area: Multimodal VLM
Keywords: Deepfake detection, Large Vision-Language Models, Knowledge-guided, forgery detection, explainability

TL;DR¶

This work proposes an LVLM-based Deepfake detection framework. It computes the correlation between image features and real/fake descriptive texts using a Knowledge-guided Forgery Detector (KFD) to achieve classification and localization. Subsequently, a Forgery Prompt Learner (FPL) injects fine-grained forgery features into a Large Language Model (LLM) to generate explainable detection results, surpassing state-of-the-art generalization performance on multiple benchmarks including FF++, CDF2, DFDC, and DF40.

Background & Motivation¶

1. Security Threats of Deepfakes¶

Generative AI (e.g., Stable Diffusion, DALL·E) has significantly lowered the barrier for generating Deepfakes, posing severe security risks.

2. Limitations of Prior Work¶

Data Augmentation/Frequency-domain Methods: Rely on specific forgery artifacts, leading to poor generalization.
Feature Consistency Analysis: Ignores prior human knowledge regarding forgery cues (such as "local color inconsistency" or "overly smooth textures").
These features are deeply embedded in human knowledge, making them difficult to capture using solely data or feature augmentation.

3. Potential and Challenges of LVLMs¶

LVLMs, pre-trained on massive and diverse datasets, possess extensive knowledge of natural objects and hold potential to improve the generalization of Deepfake detection. However, directly fine-tuning LVLMs is challenging: the models may struggle to correctly interpret specialized terms like "visual artifacts" within the specific context of forgery detection.

4. Core Idea¶

Designing fine-grained forgery prompt embeddings guides the LVLM to comprehend forgery features, thereby injecting detection knowledge into the language model.

Method¶

Overall Architecture: Two Stages¶

Stage 1: Training of the Knowledge-guided Forgery Detector (KFD) 1. A pre-trained multimodal encoder extracts image features and textual features (descriptions of real/forged images). 2. The correlation between image features and descriptive text embeddings is computed \(\rightarrow\) generating consistency maps. 3. The consistency maps are fed into a forgery localizer and classifier \(\rightarrow\) outputting a forgery segmentation map and a forgery score.

Stage 2: LLM Prompt Tuning 1. The Forgery Prompt Learner (FPL) converts the output of the KFD into fine-grained forgery prompt embeddings. 2. Forgery prompt embeddings + visual prompt embeddings + question prompt embeddings \(\rightarrow\) input into the LLM. 3. The LLM generates text detection responses (classification results + explanations + supporting multi-turn dialogue).

Key Designs¶

1. Knowledge-guided Forgery Detector (KFD)¶

Function: Leverages pre-trained knowledge (textual descriptions of real/forged images) to enhance detection generalization.
Mechanism: Formulates the detection problem as an image-text alignment task—forgery images exhibit higher consistency with "forgery descriptions".
Design Motivation: Cues that humans rely on to detect forgeries (e.g., color inconsistencies, abnormal textures) can be encoded as textual descriptions.

2. Forgery Prompt Learner (FPL)¶

Function: Converts KFD detection results (segmentation maps + scores) into prompt embeddings understandable by the LLM.
Mechanism: Instead of directly using textual descriptions of forgery features, it learns continuous prompt embeddings to let the LLM automatically associate them.
Design Motivation: Hand-crafted prompts cannot precisely convey fine-grained forgery details, whereas learnable embeddings offer greater flexibility.

3. Multi-turn Dialogue Capability¶

The framework supports multi-turn dialogue between users and the model to explore detection details in depth.
Example: "This face is forged" \(\rightarrow\) "Which region was modified?" \(\rightarrow\) "What is the likely manipulation method?"

Key Experimental Results¶

Main Results: Cross-dataset Generalization (Trained on FF++)¶

Method	FF++ (AUC)	CDF2	DFD	DFDCP	DFDC	DF40
Xception	99.5	73.2	85.1	72.8	70.1	—
F3Net	99.3	73.8	86.2	73.5	71.2	—
SBI	99.6	93.2	87.5	82.1	72.8	—
TALL	99.4	90.8	88.3	80.5	76.2	78.1
Ours	99.7	95.1	91.2	85.3	79.5	82.4

Note: Values are compiled based on the trends described in the paper, indicating that Ours surpasses the Prev. SOTA on all cross-domain test sets.

Ablation Study¶

Configuration	CDF2 AUC	Description
Full Framework	95.1	KFD + FPL + LLM
w/o KFD Knowledge Guidance	88.3	Degenerates to standard LVLM fine-tuning
w/o FPL Prompt Learning	91.7	Replaced with hand-crafted prompts
w/o Consistency Map	90.2	Using classification scores only
KFD Only (w/o LLM)	93.8	Lacks explanation capability

Key Findings¶

Knowledge guidance (KFD) is the core driver of generalization improvements by introducing prior textual descriptions.
FPL is more effective than hand-crafted prompts (+3.4%), owing to more precise fine-grained embeddings.
Even without the LLM, KFD is a strong detector on its own—the LLM additionally provides explanation capabilities.
Performance remains superior on DF40 (the latest and most challenging benchmark).

Highlights & Insights¶

Knowledge-Driven Detection Paradigm: Human forgery detection knowledge is encoded into textual descriptions, enabling knowledge transfer via image-text alignment.
Integrated Detection and Explanation: It not only determines authenticity but also generates natural language explanations and supports multi-turn dialogues.
Strong Generalization: Maintains high performance on unseen manipulation methods and datasets—knowledge guidance is more fundamental than feature augmentation.
Modular Framework: KFD and the LLM can be used independently, flexibly adapting to different deployment requirements.

Limitations & Future Work¶

Training requires constructing textual descriptions for forged images, which incurs high annotation costs.
Its applicability to non-facial Deepfakes (e.g., scene or object manipulation) remains unverified.
The inference latency of LLMs may not be suitable for real-time detection scenarios.
Cache truncation near the latter part of the method section prevents full acquisition of quantitative experiments.
Integration with video-level Deepfake detection could be explored.

vs SBI/TALL: Data augmentation/frequency-domain methods, where generalization is limited by the training data distribution.
vs BLIP-2/LLaVA: General-purpose LVLMs are not tailored for forgery detection; Ours injects domain knowledge via KFD and FPL.
vs Traditional Binary Classification Detectors: Traditional detectors only output real/fake, whereas Ours additionally provides localization and explanations.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to integrate LVLM with knowledge guidance for explainable Deepfake detection.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 6+ benchmark datasets with comprehensive ablation.
Writing Quality: ⭐⭐⭐⭐ Clear framework with logical two-stage designs.
Value: ⭐⭐⭐⭐⭐ Significantly advances the generalization and explainability of Deepfake detection.