Towards General Visual-Linguistic Face Forgery Detection¶

Conference: CVPR 2025
arXiv: 2307.16545
Code: None
Area: AI Safety
Keywords: Deepfake Detection, Vision-Language, Fine-grained Prompt, Multimodal, Generalizability

TL;DR¶

VLFFD proposes a vision-language paradigm for deepfake detection. It automatically generates blended forgery images with fine-grained text descriptions using a Prompt Forgery Image Generator (PFIG), and then jointly trains on coarse- and fine-grained data using a Coarse-and-Fine Co-training (C2F) framework, significantly enhancing both the generalizability and interpretability of the detection model.

Background & Motivation¶

Background: The rapid development of Deepfake technology poses a serious threat to security, privacy, and trust. Most existing face forgery detection methods formulate the task as a binary classification problem (real vs. fake), utilizing binary labels (0/1) or mask signals to train the detection models.

Limitations of Prior Work: (1) Lack of semantic information: Binary labels only inform the model "whether it is fake" without indicating "where it is fake and how it is fake." Consequently, models overfit to specific forgery artifacts in the training data, limiting their generalizability. (2) Poor interpretability: Detection systems deployed in real-world scenarios must explain "why an image is classified as forged." However, binary classification-based methods fail to provide such explanations. (3) Difficulty in cross-method generalization: Detectors trained on FaceSwap struggle to generalize to unseen forgery methods such as FaceShifter.

Key Challenge: Existing supervision signals (binary labels/masks) are too coarse-grained and lack semantic information, restricting models to learning superficial statistical features rather than essential forgery patterns. However, manually writing fine-grained text descriptions for large-scale forgery samples is impractical.

Goal: (1) Design a method to automatically generate fine-grained language descriptions for forged images as supervision signals. (2) Leverage language supervision to improve the generalization and interpretability of detection models without introducing additional manual annotation costs.

Key Insight: Since existing datasets lack text annotations, they should be self-created—automatically generating corresponding textual descriptions while synthesizing forgery images using a controllable forgery image generator. The key insight is that when forgery manipulation parameters are known (such as which region is modified, which method is used, and the blending ratio), accurate linguistic descriptions can be automatically constructed.

Core Idea: Create vision-language forgery detection data by utilizing the "known manipulation parameters \(\rightarrow\) automatic text label generation" paradigm, and then jointly train the model with coarse- and fine-grained data to simultaneously learn generic binary classification capabilities and fine-grained semantic understanding.

Method¶

Overall Architecture¶

The VLFFD framework consists of two core phases: (1) Data Generation Phase: The Prompt Forgery Image Generator (PFIG) takes real face images and generates forged images through controllable blending forgery operations, automatically producing corresponding sentence-level text prompts to form fine-grained image-text pairs. (2) Joint Training Phase: The Coarse-and-Fine Co-training (C2F) framework simultaneously uses the original dataset's coarse-grained annotations (binary classification) and the fine-grained data (image-text pairs) generated by PFIG to train the vision-language detection model.

Key Designs¶

Prompt Forgery Image Generator (PFIG):
- Function: Controllably generate forged face images and automatically produce corresponding fine-grained text descriptions.
- Mechanism: Starting with a real face image, PFIG performs a sequence of controllable forgery manipulations. Specifically, face parsing is first applied to segment the face into multiple semantic regions (eyes, nose, mouth, skin, etc.). Then, one or more regions are randomly selected for replacement—extracting content from the corresponding regions of another identity's face, and blending them using a controllable blending ratio \(\alpha\): \(I_{forge} = \alpha \cdot I_{source}^{region} + (1-\alpha) \cdot I_{target}^{region}\). Crucially, because all manipulation parameters (replaced regions, source identities, blending ratios) are known, precise textual descriptions can be automatically constructed, such as "The eyes and nose regions are replaced from another identity with a blending ratio of 0.7, showing subtle boundary artifacts around the nose bridge".
- Design Motivation: Conventional data augmentation only increases image diversity, whereas PFIG simultaneously enriches the semantic diversity of annotations. The concept of converting known manipulation parameters to automatically generated text avoids expensive manual annotation.
Coarse-and-Fine Co-training (C2F) Framework:
- Function: Jointly leverage coarse-grained (binary) and fine-grained (image-text pairs) supervision signals within a unified framework.
- Mechanism: The C2F framework consists of a shared vision encoder and two branches: the coarse-grained branch takes images with binary labels from the original dataset to perform real/fake discrimination via a standard classification head; the fine-grained branch takes the image-text pairs generated by PFIG and trains the vision encoder to understand the semantic features of forged regions via contrastive learning (similar to CLIP's image-text matching). The two branches share the vision encoder but employ their own training objectives, optimized jointly through a weighted loss: \(L = L_{cls} + \beta \cdot L_{contrastive}\). The coarse-grained branch guarantees basic detection capabilities, while the fine-grained branch infuses semantic understanding to boost generalization.
- Design Motivation: Training solely with fine-grained data causes bias due to the limited forgery types generated by PFIG, whereas training solely with coarse-grained data lacks semantic information. C2F allows both types of information to complement each other through joint training, with coarse-grained data providing a baseline and fine-grained data bringing enhancements.
Integration with Multimodal Large Language Models (MLLM):
- Function: Extend the VLFFD paradigm to Multimodal Large Language Models (MLLMs) to further improve interpretability.
- Mechanism: Use the image-text pairs generated by PFIG as instruction tuning data to adapt multimodal large models. The fine-tuned MLLM can not only classify real/fake images but also generate natural language explanations describing the specific locations and methods of forgery. This provides unprecedented interpretability for forgery detection.
- Design Motivation: The text generation capability of MLLMs allows detection results to be presented in a human-understandable manner directly, which is especially vital for application scenarios such as digital forensics and content moderation.

Loss & Training¶

The coarse-grained branch uses the standard binary cross-entropy loss \(L_{cls}\), while the fine-grained branch adopts the InfoNCE contrastive loss \(L_{contrastive}\). The total loss is a weighted sum of the two. Training is divided into two phases: first pre-training the fine-grained branch on PFIG-generated data, and then jointly fine-tuning the entire framework. Data augmentation includes randomized numbers of forged regions, blending ratios, and source identities within PFIG.

Key Experimental Results¶

Main Results: Cross-method Generalization (AUC %)¶

Training Data	Method	FF++ (In-domain)	Celeb-DF	DFDC	DeeperForensics	Avg Cross-domain
FF++	Xception	99.1	73.2	67.8	72.5	71.2
FF++	RECCE	99.3	76.5	70.1	75.8	74.1
FF++	SBI	98.8	79.3	72.6	77.2	76.4
FF++	VLFFD (Ours)	99.5	84.7	76.3	82.1	81.0
FF++	VLFFD + MLLM	99.2	86.1	78.8	83.5	82.8

Ablation Study¶

Configuration	FF++ AUC	Celeb-DF AUC	DFDC AUC	Description
Coarse-only (Baseline)	99.1	73.2	67.8	Standard binary classification training
Fine-only (PFIG only)	97.5	80.3	73.1	Strong semantic understanding but weaker discriminative power
C2F Joint Training	99.5	84.7	76.3	Best performance with complementary coarse and fine features
w/o Region-level Forgery	99.2	79.8	72.4	Whole-face replacement only, lacking local manipulation diversity
w/o Blending Ratio Randomization	99.3	82.1	74.5	Fixed blending ratio reduces data diversity
w/o Contrastive Learning	99.4	78.6	71.9	Fine-grained information cannot be effectively injected without contrastive learning

Key Findings¶

Significant Cross-domain Generalization Improvements: VLFFD improves the average AUC across three unseen datasets by approximately 5 to 9 percentage points (81.0% vs. 76.4% SBI), proving that language supervision indeed assists the model in learning more fundamental forgery features.
C2F Joint Training is Crucial: Training solely on fine-grained data slightly decreases in-domain performance (due to the limited forgery patterns generated by PFIG), yet joint training achieves the best of both worlds.
Contrastive Learning acts as a Key Bridge for Fine-grained Injection: Removing contrastive learning yields a notable drop in cross-domain AUC, demonstrating that simple multi-task learning is less effective than contrastive learning.
MLLM Integration Further Expands Capabilities: Although quantitative improvement is within 1-2%, the qualitative breakthrough lies in the model's ability to generate human-readable explanations for its decisions.

Highlights & Insights¶

The "known manipulation parameters \(\rightarrow\) automatic annotation generation" paradigm: This concept is not limited to forgery detection; any task with a controllable synthesis process can benefit from it—such as image editing detection (where describing edits is possible because the applied edits are known) and data tampering detection.
Coarse-and-fine joint training is a practical strategy for signal utilization: existing coarse labels are not wasted, and newly generated fine labels serve as the icing on the cake, lowering the threshold for real-world deployment.
Pioneers the injection of MLLMs into forgery detection and demonstrates the potential of interpretability, paving the way for the field to progress from "detection" to "explanation".

Limitations & Future Work¶

The forgery methods of PFIG (region replacement + blending) are relatively simple and do not cover high-quality forgeries generated by GANs and diffusion models.
The text descriptions are highly templated, lacking the ability to describe subtle visual artifacts in natural language.
The video dimension is not yet addressed—it currently processes only single-frame images without utilizing temporal artifact cues.
Integration with MLLMs is still in its infancy, with inference speeds far from meeting real-time detection demands.

vs. SBI (Self-Blended Images): SBI also trains detectors through self-synthesized forged data but only employs binary labels. VLFFD builds on this by adding a layer of language supervision, achieving stronger generalizability.
vs. CLIP-based Methods: Some works attempt to directly use CLIP features for forgery detection but lack fine-grained alignment specialized for the forgery domain. VLFFD achieves more effective alignment by generating domain-specific image-text pairs via PFIG.
vs. Face X-ray: Face X-ray focuses on detecting blending boundaries. VLFFD not only detects the presence of forgery but also describes the method and location of the forgery, offering richer dimensions of information.

Rating¶

Novelty: ⭐⭐⭐⭐ Introducing the vision-language paradigm to forgery detection is a creative direction, and PFIG is designed simply and effectively.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated on multiple cross-domain benchmarks with comprehensive ablations; the MLLM integration is an added bonus.
Writing Quality: ⭐⭐⭐⭐ The logical flow is clear, offering a natural transition from problem definition to solution.
Value: ⭐⭐⭐⭐ Opens up a new direction of language supervision for forgery detection, and the combination with MLLM is highly forward-looking.