Rethinking VLMs for Image Forgery Detection and Localization¶

Conference: CVPR 2026 arXiv: 2603.12930 Code: github.com/sha0fengGuo/IFDL-VLM Area: Multimodal VLM Keywords: Image forgery detection, VLM semantic bias, decoupled optimization, SAM localization, interpretability

TL;DR¶

This work reveals that VLMs inherently favor semantic plausibility over authenticity (CLIP cosine similarity for forged images reaches 96–99%), and proposes IFDL-VLM, which decouples detection/localization from language explanation into two stages: Stage-1 employs ViT+SAM for detection and localization, and Stage-2 feeds the resulting mask as auxiliary input to a VLM to enhance interpretability. The method achieves state-of-the-art performance across 9 benchmarks.

Background & Motivation¶

Background: In the AIGC era, image forgery detection and localization (IFDL) faces highly realistic synthetic images and hybrid forgeries (AI-generated content combined with traditional editing and post-processing), rendering conventional artifact-based methods increasingly ineffective. Recent approaches such as SIDA and FakeShield integrate VLMs (CLIP+LLM+SAM) into end-to-end pipelines to improve interpretability.

Limitations of Prior Work: (1) The CLIP visual encoder in VLMs is pre-trained on large-scale natural images with an objective of semantic alignment rather than authenticity discrimination. (2) As long as a forged image is "semantically coherent," CLIP features remain nearly unchanged — cosine similarity stays at 96.3% after object replacement and 98.5% after object insertion. (3) End-to-end training propagates CLIP's semantic bias directly into the detection and localization modules, yielding performance inferior to dedicated models.

Key Challenge: A fundamental conflict exists between VLMs' pursuit of "semantic plausibility" and IFDL's requirement to "perceive inauthenticity."

Goal: To answer two questions — does VLM prior knowledge genuinely benefit IFDL (no), and can detection/localization results in turn help VLMs generate better explanations (yes)?

Key Insight: Fully decouple detection/localization from language explanation, using dedicated ViT+SAM for the former and feeding the localization mask back to the VLM as explicit forgery concept encoding.

Core Idea: Rather than involving VLMs in detection and localization (where they underperform), the proposed approach lets detection results inform VLMs about what to explain — a task at which VLMs excel.

Method¶

Overall Architecture¶

A two-stage decoupled design: Stage-1 trains a ViT backbone (initialized from CLIP-ViT-L/14) with a frozen SAM for detection and localization. The global CLS token is fed into a linear classifier for three-class classification (real / fully synthetic / tampered), while patch tokens are aggregated via multi-head attention into a SEG token that is passed to the SAM mask decoder for localization mask generation. Stage-2 takes the Stage-1 localization mask as auxiliary input, fusing global semantics and local forgery cues via region-aware visual feature enhancement: \(T_{vis} = \alpha \cdot \text{CLIP}(x) + (1-\alpha) \cdot \text{CLIP}(x \odot M)\), and fine-tunes Vicuna-13B to generate language explanations.

Key Designs¶

Discovery and Verification of VLM Semantic Plausibility Bias:
- Function: Systematically verifies the negative impact of VLM priors on IFDL.
- Mechanism: CLIP cosine similarity between forged and original images reaches 96.3%–98.5%, demonstrating that CLIP optimizes for high-level scene consistency rather than visual authenticity, rendering forged and authentic image representations indistinguishable.
- Design Motivation: This finding motivates the decoupled design — since CLIP is insensitive to authenticity, it should not participate in detection and localization.
Reverse Information Flow: Localization Masks Encoding Forgery Concepts:
- Function: Uses Stage-1 detection and localization results as explicit input to the Stage-2 VLM.
- Mechanism: The localization mask \(M\) explicitly indicates "where the forgery is," relieving the VLM from implicitly learning forgery concepts from data. Via \(T_{vis} = \alpha \cdot \text{CLIP}(x) + (1-\alpha) \cdot \text{CLIP}(x \odot M)\) (\(\alpha=0.5\)), both global semantics and low-level cues from the local forged region are preserved.
- Design Motivation: This liberates the VLM from the difficult task of "finding where the forgery is" and refocuses it on "explaining why a given region is forged" — precisely where VLMs are capable.
Dedicated ViT+SAM Detection and Localization in Stage-1:
- Function: Bypasses VLM bias by using a dedicated model for detection and localization.
- Mechanism: A trainable ViT backbone produces a CLS token for three-class classification and a SEG token for a frozen SAM to generate masks. The loss is \(\mathcal{L}_{st-1} = \mathcal{L}_{bce}(\hat{M},M) + \mathcal{L}_{dice}(\hat{M},M) + \mathcal{L}_{ce}(\hat{D},D)\).
- Design Motivation: A dedicated model is unaffected by CLIP's semantic bias and is better suited to learning low-level forgery traces.

Loss & Training¶

Stage-1: BCE + Dice loss (localization) + CE loss (classification), all with weight 1.0. Stage-2: Standard language modeling CE loss. Optimizer: AdamW (lr=1e-5, β=(0.9, 0.95)), cosine decay after 100-step warmup, batch size=4 with gradient accumulation=10, mixed-precision training. The SAM image encoder is kept frozen; only the mask decoder is fine-tuned.

Key Experimental Results¶

Main Results¶

Dataset / Task	Metric	IFDL-VLM	SIDA-13B	FakeShield	Gain
SID-Set Detection	Overall ACC	0.997	0.94	—	+5.7%
SID-Set Detection	Overall F1	0.998	0.94	—	+5.8%
SID-Set Localization	IoU	0.65	0.44	—	+21% abs
SID-Set Localization	AUC	0.99	0.87	—	+12% abs
8-dataset Cross-dataset Avg	Avg IoU	0.47	0.38	0.34–0.39	+13%
8-dataset Cross-dataset Avg	Avg F1	0.58	0.45	0.39–0.45	+19%

Ablation Study¶

Configuration	Key Metric	Notes
α=0.5 (default)	CSS 0.853	Optimal balance between global semantics and local forgery cues
α=0 (mask region only)	CSS 0.821	Lacks global semantic context
α=1 (full image only)	CSS 0.798	No mask guidance; degrades to standard VLM
Unfrozen CLIP fine-tuning	Language quality ↓	Disrupts cross-modal alignment
Predicted mask vs. GT	CSS 0.842	Negligible gap from GT (0.853); framework is robust

Key Findings¶

Decoupling outperforms end-to-end training — the seemingly "simpler" pipeline achieves stronger performance by eliminating CLIP bias interference.
Localization masks substantially improve VLM interpretability: GPT-5 scores 2.36 (Ours) vs. 1.44 (SIDA), a +63.9% improvement.
In a human preference study, 65.2% of evaluators preferred the explanations generated by the proposed method.

Highlights & Insights¶

The core finding that VLMs favor semantic plausibility over authenticity may have broad implications for the entire AIGC detection field — any work using VLMs for anomaly or forgery detection should be aware of this bias.
The reverse information flow design — "detection results aid explanation" — is elegant: masks explicitly encode forgery concepts, simplifying VLM training optimization.
Strong cross-dataset generalization is demonstrated across 8 unseen datasets.
The framework is robust to Stage-1 localization errors (predicted masks and GT masks yield nearly equivalent results).

Limitations & Future Work¶

Severe Stage-1 localization failures can cascade and degrade Stage-2 explanation quality (though experiments demonstrate a degree of robustness).
Two-stage training increases engineering complexity; future work may explore single-stage decoupled designs.
Stage-2 still uses a frozen CLIP encoder, which may limit language generation quality in fine-grained scenarios.
The reliability and potential bias of GPT-5 automatic evaluation warrant further validation.

vs. SIDA: An end-to-end VLM pipeline; detection ACC 0.94 vs. 0.997 (Ours), localization IoU 0.44 vs. 0.65 (Ours). End-to-end training is inferior to the decoupled design.
vs. FakeShield: An IFDL method using SAM and MLLM; 8-dataset Avg IoU 0.34–0.39 vs. 0.47 (Ours).
vs. MVSS-Net/CAT-Net: Traditional IFDL methods without language explanation capability; localization performance also falls substantially short of the proposed method.
The decoupling principle generalizes broadly: when pre-trained model priors conflict with downstream task objectives, forcing end-to-end training is inadvisable — decoupling the strengths of each component is preferable.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The insight on VLM semantic plausibility bias is highly valuable; the decoupled design is counter-intuitive yet effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 9 datasets, three evaluation dimensions (detection, localization, interpretability), human preference studies, and GPT-based evaluation.
Writing Quality: ⭐⭐⭐⭐⭐ Motivation analysis (cosine similarity verification) is convincing; method description is clear.
Value: ⭐⭐⭐⭐⭐ High practical value for image authenticity verification in the AIGC era; the core insight is transferable.