The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts¶

Conference: CVPR 2026 arXiv: 2505.17476 Code: https://github.com/YcZhangSing/AMD Area: AI Safety / Multimodal Disinformation Detection Keywords: multimodal manipulation detection, MLLM-driven disinformation, semantic-aligned forgery, deepfake grounding, artifact token

TL;DR¶

This work identifies a critical and overlooked threat: existing multimodal manipulation detection methods fail to account for MLLMs' ability to generate semantically coherent deceptive narratives. The authors construct MDSM, a semantically aligned manipulation dataset of 441k samples, and propose AMD, a framework based on Artifact Tokens and manipulation-oriented reasoning. With only 0.27B parameters, AMD achieves state-of-the-art cross-domain generalization of 88.18 ACC / 60.25 mAP / 61.02 mIoU.

Background & Motivation¶

Real-World Threat¶

Advances in generative AI have made image manipulation (face swapping, attribute editing) increasingly convincing. More critically, attackers no longer merely alter images — they leverage MLLMs (e.g., Qwen2-VL) to dynamically generate semantically consistent, contextually plausible yet factually false textual narratives conditioned on manipulated images. This "Coherence Trap" renders traditional detection methods — which rely on visual-textual inconsistency — entirely ineffective.

Two Fundamental Limitations of Existing Methods¶

Underestimation of MLLM-driven deception: Mainstream methods such as DGM⁴ and HAMMER target rule-based text manipulation (e.g., simple entity substitution) and are defenseless against fluent, context-adapted false narratives generated by MLLMs. Their core assumption — that detectable semantic inconsistency exists between image and text — no longer holds under semantic-aligned manipulation.

Unrealistic misalignment artifacts: In existing datasets (e.g., DGM⁴), image and text manipulations are performed independently, producing semantically incoherent samples that are readily identifiable by the public without any detection model. Real-world attackers carefully maintain visual-textual consistency to maximize deceptive impact.

Root Cause of Contrastive Learning Failure¶

In MDSM scenarios, the manipulated image and the MLLM-generated text are inherently well-matched. Consequently, contrastive learning-based detection paradigms — as employed by ASAP and HAMMER — cannot extract meaningful signals from image-text alignment. Models must instead rely on external knowledge and artifact traces (e.g., unnatural textures from face swapping, statistical patterns in MLLM-generated text) to make judgments.

Method¶

Overall Architecture: AMD (Artifact-aware Manipulation Diagnosis)¶

AMD is built upon Florence-2 and adopts a sequence-to-sequence architecture that unifies detection and localization as text generation. The framework consists of three stages:

Multi-modal Input Embedding: Image, text, and learnable Artifact Tokens are concatenated into a unified input sequence.
Artifact Pre-perception Encoding (APE): A shallow encoder extracts manipulation artifact cues and injects them into the Artifact Tokens.
Manipulation-Oriented Reasoning (MOR): A deep encoder-decoder performs detection reasoning, generating text output containing verdicts and coordinates.

Key Design 1: Artifact Token Embedding¶

Learnable Artifact Tokens \(E_a \in \mathbb{R}^{n_a \times d}\) are introduced and concatenated with image embeddings \(E_v\) and text embeddings \(E_t\) to form the input sequence \(S_{inp} = [E_v; E_a; E_t]\). The Artifact Tokens serve as "artifact containers" that progressively accumulate manipulation-related pattern information during training, compensating for the absence of inconsistency signals in semantic-aligned scenarios.

Key Design 2: Artifact Pre-perception Encoding (APE)¶

After the input sequence passes through the pre-perception encoder \(\mathcal{E}_m^p\), the updated \(\hat{E}_a\) is extracted and a global artifact representation \(\bar{E}_a\) is obtained via weighted pooling:

\[\mathcal{W} = m^\top \text{ReLU}(\mathcal{M}\hat{E}_a^\top + b)\]

A binary classifier then determines whether manipulation artifacts are present. Key strategies:

Freezing encoder parameters: When optimizing the classification loss \(\mathcal{L}_{APE}\), \(\mathcal{E}_m^p\) is frozen so that more artifact cues accumulate into the Artifact Tokens while the MLLM's original world knowledge is preserved.
Replacing input embeddings: The image and text embeddings in \(\hat{S}\) are replaced with the original \(E_v, E_t\), retaining only the enhanced \(\hat{E}_a\), forming \(S_a = [E_v; \hat{E}_a; E_t]\).

Key Design 3: Manipulation-Oriented Reasoning (MOR)¶

MOR incorporates two auxiliary tasks to guide reasoning:

Visual Artifact Capture via Grounding (VAA): The Artifact Tokens \(\hat{E}_a^m\) are aggregated via attention pooling into a query vector \(q_a\), which then aggregates spatial manipulation cues from image features \(\hat{E}_v^m\) via cross-attention. The result is fed into a bounding box detector to generate manipulation region coordinates. The localization loss is \(\mathcal{L}_{IMG} = \mathcal{L}_1 + \mathcal{L}_{IoU}\).

Dual-Branch Manipulation Guidance (DBM): Image+Artifact features and text features are each used as queries in cross-attention interactions, forming dual-branch classification:

\[u_v = \text{Attention}(\hat{E}_{v+a}^m, \hat{E}_t^m, \hat{E}_t^m), \quad u_t = \text{Attention}(\hat{E}_t^m, \hat{E}_{v+a}^m, \hat{E}_{v+a}^m)\]

Each branch independently classifies manipulation, enhancing the model's sensitivity to forged media.

Key Design 4: Token Redundancy Penalty (TRP)¶

To prevent redundant or repetitive representations within Artifact Tokens, two regularization terms are designed:

Orthogonality constraint \(\mathcal{L}_{orth}\): Based on the Gram matrix, penalizes non-orthogonality among column vectors of \(E_a\) (encouraging different tokens to encode distinct information).
Distribution modulation \(\mathcal{L}_{mod}\): Uses KL divergence to drive each token's energy distribution toward a uniform distribution, avoiding information loss due to checkerboard patterns.

Loss & Training¶

The total loss is the sum of five terms:

\[\mathcal{L} = \mathcal{L}_{APE} + \mathcal{L}_{DBM} + \mathcal{L}_{IMG} + \mathcal{L}_{TRP} + \mathcal{L}_{LM}\]

At inference, all auxiliary heads (APE, DBM, IMG, TRP) are discarded and only the language modeling output is retained, making inference highly efficient. The model outputs detection results (real/fake verdict, manipulation type, coordinates) as plain text via a heuristic QA prompt.

Key Experimental Results¶

MDSM Dataset Statistics¶

Total scale: 441,423 samples across 5 news domains (NYT, Guardian, USA Today, Washington Post, BBC)
Manipulation types: Face Swap (FS), Face Attribute (FA), Text Fabrication (TF), FS&TF, FA&TF
Comparison with DGM⁴: MDSM is the first multimodal manipulation detection benchmark to simultaneously feature MLLM involvement, semantic alignment, large scale, and multi-source domains.

Main Results: MDSM Cross-Domain Detection (Table 2)¶

Method	Train Domain	Params	AVG ACC	AVG mAP	AVG mIoU
Qwen2.5-VL-72B (zero-shot)	—	72B	33.72	33.47	0.06
GPT-4o (zero-shot)	—	—	33.92	33.33	1.17
Gemini-2.0 (zero-shot)	—	—	38.83	32.03	1.72
ViLT	Guardian	121M	76.61	49.90	35.67
HAMMER	Guardian	441M	74.32	48.33	43.23
HAMMER++	Guardian	441M	75.10	49.01	48.49
FKA-Owl	Guardian	6,771M	84.12	58.13	52.20
AMD (Ours)	Guardian	277M	88.18	60.25	61.02

Key finding: AMD with only 277M parameters surpasses FKA-Owl at 6.8B (ACC +4.06, mAP +2.12, mIoU +8.82). Zero-shot large models nearly completely fail on this task (mIoU close to 0).

DGM⁴ Cross-Domain Detection (Table 3)¶

Method	AVG ACC	AVG mAP	AVG P_tok	AVG mIoU
HAMMER	65.45	47.10	77.41	45.97
HAMMER++	65.61	47.36	77.34	46.19
FKA-Owl	71.96	42.68	83.31	44.15
AMD (Ours)	74.47	52.91	80.01	51.87

AMD also achieves the best overall performance on the conventional DGM⁴ dataset, demonstrating that the framework generalizes not only to the new MDSM scenario but also to traditional manipulation settings.

Ablation Study (Table 4a)¶

LM	APE	IMG	DBM	TRP	NYT ACC	NYT mAP	NYT mIoU
✓					76.92	46.38	58.77
✓	✓				82.93	47.12	60.13
✓	✓	✓			82.97	47.18	61.78
✓	✓	✓	✓		83.42	66.47	62.14
✓	✓	✓	✓	✓	83.96	69.39	63.56

APE contributes most: ACC improves from 76.92 to 82.93 (+6.01), confirming that artifact pre-perception is critical for MLLM adaptation.
DBM yields the largest mAP gain: 47.18 → 66.47 (+19.29); dual-branch guidance substantially enhances manipulation type discrimination.
TRP provides consistent gains: Small but consistent improvements across all metrics, validating the effectiveness of token redundancy reduction.

Key Findings¶

Text-modality manipulation is harder to detect: In-domain AP for FA is 88.45 versus 79.84 for TF; cross-domain AP is 71.37 for FA versus 57.53 for TF. This indicates that MLLM-generated narratives are more deceptive, also reflecting the challenge of the MDSM dataset.
Cross-MLLM generalization: AMD trained on NYT and tested on narratives generated by Qwen-VL, X-InstructBLIP, LLaVA, and mPLUG-Owl achieves in-domain AP >76 and cross-domain AP >53, indicating that AMD does not overfit to generation patterns of any specific MLLM.
Efficiency advantage: AMD has only 277M parameters with an inference throughput of 13.38 pairs/s, far outperforming FKA-Owl's 6,771M parameters / 1.33 pairs/s.

Highlights & Insights¶

Forward-looking problem formulation: This work is the first to explicitly define "MLLM-driven semantically aligned multimodal manipulation" as a new threat scenario. Traditional methods assume that image-text inconsistency can be captured by contrastive learning, but this assumption completely breaks down when attackers deliberately maintain consistency — a long-overlooked yet highly practically relevant gap.
Elegant Artifact Token design: Rather than directly modifying pre-trained MLLM parameters, the method introduces learnable plug-in tokens to accumulate artifact information, simultaneously preserving world knowledge and injecting domain-specific capability. The strategy of freezing the encoder while replacing embeddings is an elegant knowledge-preservation mechanism.
Advantages of unified text output: Unifying detection (real/fake), classification (manipulation type), and localization (bbox coordinates) into text output is simpler, more general, and more extensible than multi-head architectures such as HAMMER. Discarding auxiliary heads at inference also avoids train-inference discrepancy.
Noteworthy dataset construction pipeline: The approach — first manipulating images, then feeding manipulation metadata (e.g., swapped-in identity names) to an MLLM to generate aligned text — can be viewed as a general paradigm for adversarial data augmentation applicable to any scenario requiring semantically consistent attacks. The idea of learnable plug-in tokens combined with frozen pre-trained parameters is also worth exploring in other domain adaptation settings.

Limitations & Future Work¶

Focus limited to face manipulation: The current MDSM dataset covers only face swapping and facial attribute editing, excluding broader manipulation types such as scene editing (e.g., background replacement, object removal) and full-image synthesis. Extending to non-face-centric manipulation is an important future direction.
Coarse text detection granularity: Although text manipulation is annotated at the sample level, word-level or sentence-level fine-grained annotations are absent (unlike DGM⁴'s fake token grounding), limiting precise localization of specific fabricated content within MLLM-generated text.
Evaluation confined to news domain: All experiments are conducted on news data; generalization to informal text environments such as social media, forums, and instant messaging remains unverified.
Backbone model choice: AMD is based on Florence-2 (0.27B); adopting a larger MLLM backbone may further improve performance but requires re-examining the efficiency-effectiveness trade-off.
Adversarial robustness not explored: Attackers may design adaptive attacks targeting AMD's Artifact Token mechanism; robustness analysis in this regard is absent.

DGM⁴ / HAMMER: Representative works in multimodal manipulation detection, but their assumption of image-text inconsistency leads to significant performance degradation in MDSM scenarios.
FKA-Owl: An MLLM-based detection method (6.8B parameters) that approaches AMD on some metrics but requires 24× more parameters, highlighting the importance of lightweight design.
Florence-2: The backbone of AMD, providing strong vision-language pre-training knowledge and a unified seq2seq architecture.
Broader implications: For any scenario requiring "using MLLMs to detect MLLM-generated content" (e.g., AI-generated text detection, synthetic image detection), the Artifact Token + knowledge-preservation strategy proposed here offers a reusable design paradigm. The idea of learnable plug-in tokens combined with frozen pre-trained parameters also warrants exploration in other domain adaptation settings.

Rating¶

Dimension	Score (1–10)	Notes
Problem Importance	9	MLLM-driven semantic consistency manipulation is a real and overlooked threat
Novelty	8	Artifact Token + APE + MOR + TRP combination is elegantly designed
Experimental Thoroughness	8	Cross-domain, cross-MLLM, ablation, and efficiency comparisons are comprehensive
Dataset Contribution	9	441k large-scale semantically aligned multimodal manipulation benchmark fills a critical gap
Writing Quality	8	Motivation is clearly articulated; figures and tables are professional
Overall	8.4	Precise problem definition with dual contributions in dataset and method; a significant advancement in the field