The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts¶
Conference: CVPR 2026 arXiv: 2505.17476 Code: https://github.com/YcZhangSing/AMD Area: AI Safety / Multimodal Disinformation Detection Keywords: multimodal manipulation detection, MLLM-driven disinformation, semantic-aligned forgery, deepfake grounding, artifact token
TL;DR¶
This work identifies a critical and overlooked threat: existing multimodal manipulation detection methods fail to account for MLLMs' ability to generate semantically coherent deceptive narratives. The authors construct MDSM, a semantically aligned manipulation dataset of 441k samples, and propose AMD, a framework based on Artifact Tokens and manipulation-oriented reasoning. With only 0.27B parameters, AMD achieves state-of-the-art cross-domain generalization of 88.18 ACC / 60.25 mAP / 61.02 mIoU.
Background & Motivation¶
Real-World Threat¶
Advances in generative AI have made image manipulation (face swapping, attribute editing) increasingly convincing. More critically, attackers no longer merely alter images — they leverage MLLMs (e.g., Qwen2-VL) to dynamically generate semantically consistent, contextually plausible yet factually false textual narratives conditioned on manipulated images. This "Coherence Trap" renders traditional detection methods — which rely on visual-textual inconsistency — entirely ineffective.
Two Fundamental Limitations of Existing Methods¶
Underestimation of MLLM-driven deception: Mainstream methods such as DGM⁴ and HAMMER target rule-based text manipulation (e.g., simple entity substitution) and are defenseless against fluent, context-adapted false narratives generated by MLLMs. Their core assumption — that detectable semantic inconsistency exists between image and text — no longer holds under semantic-aligned manipulation.
Unrealistic misalignment artifacts: In existing datasets (e.g., DGM⁴), image and text manipulations are performed independently, producing semantically incoherent samples that are readily identifiable by the public without any detection model. Real-world attackers carefully maintain visual-textual consistency to maximize deceptive impact.
Root Cause of Contrastive Learning Failure¶
In MDSM scenarios, the manipulated image and the MLLM-generated text are inherently well-matched. Consequently, contrastive learning-based detection paradigms — as employed by ASAP and HAMMER — cannot extract meaningful signals from image-text alignment. Models must instead rely on external knowledge and artifact traces (e.g., unnatural textures from face swapping, statistical patterns in MLLM-generated text) to make judgments.
Method¶
Overall Architecture: AMD (Artifact-aware Manipulation Diagnosis)¶
AMD is built upon Florence-2 and adopts a sequence-to-sequence architecture that unifies detection and localization as text generation. The framework consists of three stages:
- Multi-modal Input Embedding: Image, text, and learnable Artifact Tokens are concatenated into a unified input sequence.
- Artifact Pre-perception Encoding (APE): A shallow encoder extracts manipulation artifact cues and injects them into the Artifact Tokens.
- Manipulation-Oriented Reasoning (MOR): A deep encoder-decoder performs detection reasoning, generating text output containing verdicts and coordinates.
Key Design 1: Artifact Token Embedding¶
Learnable Artifact Tokens \(E_a \in \mathbb{R}^{n_a \times d}\) are introduced and concatenated with image embeddings \(E_v\) and text embeddings \(E_t\) to form the input sequence \(S_{inp} = [E_v; E_a; E_t]\). The Artifact Tokens serve as "artifact containers" that progressively accumulate manipulation-related pattern information during training, compensating for the absence of inconsistency signals in semantic-aligned scenarios.
Key Design 2: Artifact Pre-perception Encoding (APE)¶
After the input sequence passes through the pre-perception encoder \(\mathcal{E}_m^p\), the updated \(\hat{E}_a\) is extracted and a global artifact representation \(\bar{E}_a\) is obtained via weighted pooling:
A binary classifier then determines whether manipulation artifacts are present. Key strategies:
- Freezing encoder parameters: When optimizing the classification loss \(\mathcal{L}_{APE}\), \(\mathcal{E}_m^p\) is frozen so that more artifact cues accumulate into the Artifact Tokens while the MLLM's original world knowledge is preserved.
- Replacing input embeddings: The image and text embeddings in \(\hat{S}\) are replaced with the original \(E_v, E_t\), retaining only the enhanced \(\hat{E}_a\), forming \(S_a = [E_v; \hat{E}_a; E_t]\).
Key Design 3: Manipulation-Oriented Reasoning (MOR)¶
MOR incorporates two auxiliary tasks to guide reasoning:
Visual Artifact Capture via Grounding (VAA): The Artifact Tokens \(\hat{E}_a^m\) are aggregated via attention pooling into a query vector \(q_a\), which then aggregates spatial manipulation cues from image features \(\hat{E}_v^m\) via cross-attention. The result is fed into a bounding box detector to generate manipulation region coordinates. The localization loss is \(\mathcal{L}_{IMG} = \mathcal{L}_1 + \mathcal{L}_{IoU}\).
Dual-Branch Manipulation Guidance (DBM): Image+Artifact features and text features are each used as queries in cross-attention interactions, forming dual-branch classification:
Each branch independently classifies manipulation, enhancing the model's sensitivity to forged media.
Key Design 4: Token Redundancy Penalty (TRP)¶
To prevent redundant or repetitive representations within Artifact Tokens, two regularization terms are designed:
- Orthogonality constraint \(\mathcal{L}_{orth}\): Based on the Gram matrix, penalizes non-orthogonality among column vectors of \(E_a\) (encouraging different tokens to encode distinct information).
- Distribution modulation \(\mathcal{L}_{mod}\): Uses KL divergence to drive each token's energy distribution toward a uniform distribution, avoiding information loss due to checkerboard patterns.
Loss & Training¶
The total loss is the sum of five terms:
At inference, all auxiliary heads (APE, DBM, IMG, TRP) are discarded and only the language modeling output is retained, making inference highly efficient. The model outputs detection results (real/fake verdict, manipulation type, coordinates) as plain text via a heuristic QA prompt.
Key Experimental Results¶
MDSM Dataset Statistics¶
- Total scale: 441,423 samples across 5 news domains (NYT, Guardian, USA Today, Washington Post, BBC)
- Manipulation types: Face Swap (FS), Face Attribute (FA), Text Fabrication (TF), FS&TF, FA&TF
- Comparison with DGM⁴: MDSM is the first multimodal manipulation detection benchmark to simultaneously feature MLLM involvement, semantic alignment, large scale, and multi-source domains.
Main Results: MDSM Cross-Domain Detection (Table 2)¶
| Method | Train Domain | Params | AVG ACC | AVG mAP | AVG mIoU |
|---|---|---|---|---|---|
| Qwen2.5-VL-72B (zero-shot) | — | 72B | 33.72 | 33.47 | 0.06 |
| GPT-4o (zero-shot) | — | — | 33.92 | 33.33 | 1.17 |
| Gemini-2.0 (zero-shot) | — | — | 38.83 | 32.03 | 1.72 |
| ViLT | Guardian | 121M | 76.61 | 49.90 | 35.67 |
| HAMMER | Guardian | 441M | 74.32 | 48.33 | 43.23 |
| HAMMER++ | Guardian | 441M | 75.10 | 49.01 | 48.49 |
| FKA-Owl | Guardian | 6,771M | 84.12 | 58.13 | 52.20 |
| AMD (Ours) | Guardian | 277M | 88.18 | 60.25 | 61.02 |
Key finding: AMD with only 277M parameters surpasses FKA-Owl at 6.8B (ACC +4.06, mAP +2.12, mIoU +8.82). Zero-shot large models nearly completely fail on this task (mIoU close to 0).
DGM⁴ Cross-Domain Detection (Table 3)¶
| Method | AVG ACC | AVG mAP | AVG P_tok | AVG mIoU |
|---|---|---|---|---|
| HAMMER | 65.45 | 47.10 | 77.41 | 45.97 |
| HAMMER++ | 65.61 | 47.36 | 77.34 | 46.19 |
| FKA-Owl | 71.96 | 42.68 | 83.31 | 44.15 |
| AMD (Ours) | 74.47 | 52.91 | 80.01 | 51.87 |
AMD also achieves the best overall performance on the conventional DGM⁴ dataset, demonstrating that the framework generalizes not only to the new MDSM scenario but also to traditional manipulation settings.
Ablation Study (Table 4a)¶
| LM | APE | IMG | DBM | TRP | NYT ACC | NYT mAP | NYT mIoU |
|---|---|---|---|---|---|---|---|
| ✓ | 76.92 | 46.38 | 58.77 | ||||
| ✓ | ✓ | 82.93 | 47.12 | 60.13 | |||
| ✓ | ✓ | ✓ | 82.97 | 47.18 | 61.78 | ||
| ✓ | ✓ | ✓ | ✓ | 83.42 | 66.47 | 62.14 | |
| ✓ | ✓ | ✓ | ✓ | ✓ | 83.96 | 69.39 | 63.56 |
- APE contributes most: ACC improves from 76.92 to 82.93 (+6.01), confirming that artifact pre-perception is critical for MLLM adaptation.
- DBM yields the largest mAP gain: 47.18 → 66.47 (+19.29); dual-branch guidance substantially enhances manipulation type discrimination.
- TRP provides consistent gains: Small but consistent improvements across all metrics, validating the effectiveness of token redundancy reduction.
Key Findings¶
- Text-modality manipulation is harder to detect: In-domain AP for FA is 88.45 versus 79.84 for TF; cross-domain AP is 71.37 for FA versus 57.53 for TF. This indicates that MLLM-generated narratives are more deceptive, also reflecting the challenge of the MDSM dataset.
- Cross-MLLM generalization: AMD trained on NYT and tested on narratives generated by Qwen-VL, X-InstructBLIP, LLaVA, and mPLUG-Owl achieves in-domain AP >76 and cross-domain AP >53, indicating that AMD does not overfit to generation patterns of any specific MLLM.
- Efficiency advantage: AMD has only 277M parameters with an inference throughput of 13.38 pairs/s, far outperforming FKA-Owl's 6,771M parameters / 1.33 pairs/s.
Highlights & Insights¶
-
Forward-looking problem formulation: This work is the first to explicitly define "MLLM-driven semantically aligned multimodal manipulation" as a new threat scenario. Traditional methods assume that image-text inconsistency can be captured by contrastive learning, but this assumption completely breaks down when attackers deliberately maintain consistency — a long-overlooked yet highly practically relevant gap.
-
Elegant Artifact Token design: Rather than directly modifying pre-trained MLLM parameters, the method introduces learnable plug-in tokens to accumulate artifact information, simultaneously preserving world knowledge and injecting domain-specific capability. The strategy of freezing the encoder while replacing embeddings is an elegant knowledge-preservation mechanism.
-
Advantages of unified text output: Unifying detection (real/fake), classification (manipulation type), and localization (bbox coordinates) into text output is simpler, more general, and more extensible than multi-head architectures such as HAMMER. Discarding auxiliary heads at inference also avoids train-inference discrepancy.
-
Noteworthy dataset construction pipeline: The approach — first manipulating images, then feeding manipulation metadata (e.g., swapped-in identity names) to an MLLM to generate aligned text — can be viewed as a general paradigm for adversarial data augmentation applicable to any scenario requiring semantically consistent attacks. The idea of learnable plug-in tokens combined with frozen pre-trained parameters is also worth exploring in other domain adaptation settings.
Limitations & Future Work¶
-
Focus limited to face manipulation: The current MDSM dataset covers only face swapping and facial attribute editing, excluding broader manipulation types such as scene editing (e.g., background replacement, object removal) and full-image synthesis. Extending to non-face-centric manipulation is an important future direction.
-
Coarse text detection granularity: Although text manipulation is annotated at the sample level, word-level or sentence-level fine-grained annotations are absent (unlike DGM⁴'s fake token grounding), limiting precise localization of specific fabricated content within MLLM-generated text.
-
Evaluation confined to news domain: All experiments are conducted on news data; generalization to informal text environments such as social media, forums, and instant messaging remains unverified.
-
Backbone model choice: AMD is based on Florence-2 (0.27B); adopting a larger MLLM backbone may further improve performance but requires re-examining the efficiency-effectiveness trade-off.
-
Adversarial robustness not explored: Attackers may design adaptive attacks targeting AMD's Artifact Token mechanism; robustness analysis in this regard is absent.
Related Work & Insights¶
- DGM⁴ / HAMMER: Representative works in multimodal manipulation detection, but their assumption of image-text inconsistency leads to significant performance degradation in MDSM scenarios.
- FKA-Owl: An MLLM-based detection method (6.8B parameters) that approaches AMD on some metrics but requires 24× more parameters, highlighting the importance of lightweight design.
- Florence-2: The backbone of AMD, providing strong vision-language pre-training knowledge and a unified seq2seq architecture.
- Broader implications: For any scenario requiring "using MLLMs to detect MLLM-generated content" (e.g., AI-generated text detection, synthetic image detection), the Artifact Token + knowledge-preservation strategy proposed here offers a reusable design paradigm. The idea of learnable plug-in tokens combined with frozen pre-trained parameters also warrants exploration in other domain adaptation settings.
Rating¶
| Dimension | Score (1–10) | Notes |
|---|---|---|
| Problem Importance | 9 | MLLM-driven semantic consistency manipulation is a real and overlooked threat |
| Novelty | 8 | Artifact Token + APE + MOR + TRP combination is elegantly designed |
| Experimental Thoroughness | 8 | Cross-domain, cross-MLLM, ablation, and efficiency comparisons are comprehensive |
| Dataset Contribution | 9 | 441k large-scale semantically aligned multimodal manipulation benchmark fills a critical gap |
| Writing Quality | 8 | Motivation is clearly articulated; figures and tables are professional |
| Overall | 8.4 | Precise problem definition with dual contributions in dataset and method; a significant advancement in the field |