The Coherence Trap: MLLM-Crafted Narratives Exploit Manipulated Visual Contexts¶

Conference: CVPR 2026 arXiv: 2505.17476 Code: https://github.com/YcZhangSing/AMD Area: Multimodal VLM Keywords: multimodal manipulation detection, MLLM-driven disinformation, semantic alignment, deepfake grounding, dataset

TL;DR¶

This paper identifies two fundamental flaws in existing multimodal disinformation detection—underestimating semantically coherent fake narratives generated by MLLMs and over-reliance on simple misalignment artifacts—and constructs the 441k-sample MDSM dataset (image manipulation + MLLM-generated semantically aligned text). The proposed AMD framework (Artifact Pre-perception + Manipulation-Oriented Reasoning) achieves 88.18 ACC / 60.25 mAP / 61.02 mIoU on cross-domain detection.

Background & Motivation¶

Multimodal fake news detection faces new challenges: (1) Existing methods (DGM4/HAMMER) primarily address rule-based text manipulation (e.g., simple named entity replacement), overlooking the ability of MLLMs to dynamically generate fluent, contextually plausible yet misleading narratives conditioned on manipulated images—this "semantic coherence trap" renders traditional contrastive learning ineffective; (2) In existing datasets, image and text manipulations are performed independently, producing semantic inconsistencies that are easily identified by the public. Real-world attackers deliberately maintain visual-textual consistency to maximize deceptive effect.

Core Problem¶

How to detect and localize semantically coherent, MLLM-driven multimodal manipulation—where image editing is followed by MLLM-based text regeneration to preserve visual-textual alignment?

Method¶

MDSM Dataset Construction¶

Data Sources: GoodNews / VisualNews / N24News; 2.1M+ image-text pairs filtered to retain samples containing faces and named entities
Image Manipulation: Face Swap (SimSwap / e4s) and Face Attribute editing (StyleCLIP / HFGI, reversing emotional expression)
Text Manipulation: Qwen2-VL generates semantically aligned fake narratives—given the manipulated image and a named entity list (meta-information), the MLLM produces text that is visually consistent yet factually false
5 Manipulation Combinations: FS / FS&TF / FA / FA&TF / TF
Scale: 441k samples from Guardian / NYT / USA Today / Washington Post / BBC, supporting cross-domain evaluation

AMD Framework¶

Built upon Florence-2; three-stage pipeline:

Multi-modal Input Embedding: Learnable Artifact Tokens \(E_a \in \mathbb{R}^{n_a \times d}\) are concatenated between image and text embeddings: \(S_{inp} = [E_v; E_a; E_t]\)
Artifact Pre-perception Encoding (APE): The input sequence is processed by a frozen pre-perception encoder \(\mathcal{E}_{mp}\) to extract artifact tokens \(\hat{E}_a\); a binary classification head (weighted pooling + classifier) provides manipulation detection supervision \(\mathcal{L}_{APE}\), injecting artifact cues into the artifact tokens. Critically, encoder parameters are frozen to preserve the original MLLM knowledge, and original image-text embeddings are restored to maintain reasoning capability.
Manipulation-Oriented Reasoning (MOR):
Visual Artifact Aggregation (VAA): Artifact tokens serve as queries in cross-attention over visual features to aggregate spatially localized manipulation information for bounding box prediction (\(\mathcal{L}_{IMG}\))
Dual-Branch Manipulation (DBM): Visual+artifact features and text features each serve as queries in cross-modal cross-attention, with dual-branch binary classification (\(\mathcal{L}_{DBM}\))—mAP improves from 47.18 to 66.47
Language Modeling: Autoregressive generation of textual answers (options + coordinates)
Token Redundancy Penalty (TRP): Orthogonality constraint \(\mathcal{L}_{orth}\) + KL divergence uniformity constraint \(\mathcal{L}_{mod}\), preventing artifact token collapse into redundancy

Total Loss¶

\(\mathcal{L} = \mathcal{L}_{APE} + \mathcal{L}_{DBM} + \mathcal{L}_{IMG} + \mathcal{L}_{TRP} + \mathcal{L}_{LM}\)

All auxiliary heads are discarded at inference; only text generation is retained.

Key Experimental Results¶

Method	AVG ACC↑	AVG mAP↑	AVG mIoU↑
ViLT	76.61	49.90	35.67
HAMMER++	75.10	49.01	48.49
FKA-Owl (7B)	84.12	58.13	52.20
AMD (0.27B)	88.18	60.25	61.02

DGM4 (trained on Guardian)	AVG ACC↑	AVG mAP↑	AVG mIoU↑
HAMMER++	65.61	47.36	46.19
FKA-Owl	71.96	42.68	44.15
AMD	74.47	52.91	51.87

Zero-shot general-purpose models (GPT-4o / Gemini-2.0 / Qwen3-VL-235B) achieve only ~33% ACC on MDSM—semantically coherent manipulation poses extreme challenges to current MLLMs
Cross-MLLM generalization: AMD maintains 53+ AP when evaluated on text generated by Qwen-VL / LLaVA / mPLUG-Owl / X-InstructBLIP
Efficiency: 276M parameters vs. FKA-Owl's 6,771M; inference speed 13.38 p/s vs. FKA-Owl's 1.33 p/s

Ablation Study¶

APE: ACC 76.92 → 82.93 (+6)—pre-perceiving artifact cues is essential
DBM: mAP 47.18 → 66.47 (+19)—dual-branch cross-modal discrimination substantially improves classification
IMG: mIoU 60.13 → 61.78—grounding auxiliary task provides additional benefit
TRP: consistent marginal gains across ACC and mIoU—reducing token redundancy helps
t-SNE visualization: artifact token class clusters become progressively more distinct across the three processing stages

Highlights & Insights¶

Defines a highly practically relevant new problem: MLLM-driven semantically coherent multimodal manipulation—harder to detect than rule-based text replacement
MDSM fills a critical gap: 441k samples, 5 media domains, semantic alignment, cross-domain evaluation support
AMD with only 276M parameters outperforms the 7B-scale FKA-Owl—the unified seq2seq framework is more efficient than multi-head architectures
APE's "freeze encoder + train artifact tokens only" strategy elegantly preserves MLLM knowledge
Thorough ethical consideration: generation pipeline and prompts are not released, access is restricted to research use, and images are watermarked

Limitations & Future Work¶

Only face-related manipulations (face swap / attribute editing) are considered; manipulations of other objects or scene-level edits are not covered
Text manipulation relies solely on Qwen2-VL for generation—cross-MLLM generalization is validated, but coverage of the latest reasoning-oriented models is lacking
Florence-2 is a relatively small backbone; larger backbones may yield further improvements
Noticeable cross-domain generalization gaps remain on certain domain pairs (e.g., performance drops when training on NYT and testing on USA Today)

vs. DGM4/HAMMER: DGM4's independent image-text manipulation produces semantic inconsistencies that are easier to detect; MDSM's aligned manipulation is harder, and HAMMER achieves only 44 mAP on MDSM
vs. MMFakeBench: Only 30% semantically aligned samples with 11k scale insufficient for training; MDSM provides 100% alignment at 441k scale
vs. FKA-Owl: 7B model performing only binary classification without fine-grained categorization or localization; AMD uses 0.27B with unified detection + classification + grounding
vs. general MLLMs (GPT-4o, etc.): Zero-shot ~33% ACC—demonstrating the necessity of task-specific training for semantically coherent manipulation detection

Broader Implications¶

MLLM-generated disinformation constitutes a genuine societal security threat—this work provides foundational infrastructure for defending against such attacks
The artifact token pre-perception design is generalizable to other MLLM applications requiring detection of specific signals

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First systematic definition and treatment of MLLM-driven semantically coherent multimodal manipulation
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 441k dataset, 5-domain cross-domain evaluation, 4-MLLM generalization, zero-shot general model comparison, comprehensive ablations
Writing Quality: ⭐⭐⭐⭐ Problem motivation is clear, dataset construction is detailed, ethical considerations are thorough
Value: ⭐⭐⭐⭐⭐ Dataset and method together define a new paradigm for disinformation detection in the MLLM era