ASAP: Advancing Semantic Alignment for Multi-Modal Manipulation Detection¶

Conference: CVPR 2025
Institution: Hefei University of Technology
Keywords: Multimodal Manipulation Detection, Semantic Alignment, LLM-Assisted, Cross-Attention

Background & Motivation¶

With the rapid development of AI-generated content (AIGC) technologies, multimodal misinformation has become a severe societal challenge. Unlike traditional single-modal manipulation (e.g., only image editing or only text fabrication), modern misinformation often involves joint manipulation of both images and texts:

Image Manipulation: Modifying images using techniques like Deepfake and Inpainting

Text Manipulation: Fabricating or modifying description texts accompanying the image

Cross-Modal Inconsistency: Pairing authentic images with false texts, or manipulated images with rationalized texts

The DGM4 (Detecting and Grounding Multi-Modal Media Manipulation) task requires models not only to determine whether an image-text pair has been manipulated but also to ground the manipulated regions (Which regions in the image? Which words/phrases in the text?).

The main issue with current methods lies in the insufficient visual-linguistic semantic alignment: - Pre-trained models like CLIP learn coarse-grained image-text matching, failing to capture subtle manipulation clues. - Fine-grained correspondences between image patches and text tokens are not fully utilized. - There is a lack of an explicit manipulation guidance mechanism, leaving the model unaware of "what to focus on."

The core motivation of ASAP is to enhance manipulation detection and grounding capabilities through superior semantic alignment, utilizing the knowledge of large language models to assist in understanding "what constitutes an authentic image-text relationship."

Method¶

Overall Architecture¶

ASAP comprises three core modules: LMA (Large Model-Assisted Alignment), MGCA (Manipulation-Guided Cross-Attention), and PMM (Patch Manipulation Modeling), each addressing alignment problems at different levels.

Module 1: LMA - Large Model-Assisted Alignment¶

Design Motivation: The text encoder of CLIP performs well on short descriptions but lacks sufficient understanding of complex semantic relations. In contrast, large language models possess stronger reasoning and description capabilities.

Workflow: 1. MLLM Description Generation: Utilizing multimodal large language models (such as GPT-4V) to generate detailed descriptive text for the images. 2. LLM Explanation Generation: Using an LLM to analyze the discrepancy between the original text and the MLLM description, generating explanatory text. 3. VLC Contrastive Loss: Conducting multi-way contrastive learning between the three types of texts (original text, MLLM description, LLM explanation) and the image.

\[\mathcal{L}_{ ext{VLC}} = -\log rac{\exp( ext{sim}(v, t^+) / au)}{\sum_j \exp( ext{sim}(v, t_j) / au)}\]

Where positive pairs include matching image-text pairs and the image with its MLLM description, while negative samples include mismatched texts and manipulated samples.

Key Insight: The explanation texts generated by the LLM provide reasoning clues regarding "why this image-text pair is inconsistent," assisting the model in learning deeper semantic alignment.

Module 2: MGCA - Manipulation-Guided Cross-Attention¶

Design Motivation: Standard cross-attention treats all patches and tokens equally. However, manipulated regions typically occupy only a small portion, requiring guidance to focus the attention.

Design: - Introducing a guidance mask \(G \in \{0, 1\}^{N_v imes N_t}\) to mark suspected manipulated image-text corresponding areas. - During cross-attention computation, the guidance mask modulates the attention weights:

\[ ext{Attn}(Q, K, V) = ext{softmax}\left(rac{QK^T}{\sqrt{d}} + \lambda \cdot G ight) V\]

Component	Input	Output	Function
Visual Encoder	Image patches	Visual features \(V\)	Extract image region features
Text Encoder	Text tokens	Text features \(T\)	Extract text semantic features
Guidance Mask Generator	\(V, T\)	Guidance mask \(G\)	Localize suspected manipulated regions
MGCA Layer	\(V, T, G\)	Enhanced features \(V', T'\)	Manipulation-aware cross-modal fusion

The guidance mask is calculated through the mismatch of shallow features and is progressively refined as the network depth increases.

Module 3: PMM - Patch Manipulation Modeling¶

Design Motivation: Manipulation detection requires not only global judgment but also patch-level localization capability.

Workflow: 1. Hard Negative Construction: Selecting semantically similar patches from different sources within the training data for replacement, constructing indistinguishable manipulated samples. 2. Patch-level Classification: Predicting a binary classification label ("authentic/manipulated") for each image patch. 3. Contrastive Enhancement: Minimizing the distance between authentic patches within the same image while maximizing the distance between authentic and manipulated patches.

\[\mathcal{L}_{ ext{PMM}} = ext{BCE}(p_{ ext{patch}}, y_{ ext{patch}}) + \lambda \cdot \mathcal{L}_{ ext{contrast}}\]

Hard Negative Selection Strategy: Selecting other image patches most similar to the current patch feature for replacement, rather than random replacement. This forces the model to learn more subtle manipulation clues.

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{ ext{cls}} + lpha \mathcal{L}_{ ext{VLC}} + eta \mathcal{L}_{ ext{grounding}} + \gamma \mathcal{L}_{ ext{PMM}}\]

where \(\mathcal{L}_{ ext{cls}}\) is the global manipulation classification loss, and \(\mathcal{L}_{ ext{grounding}}\) is the pixel/token-level grounding loss.

Key Experimental Results¶

Main Results on DGM4 Dataset¶

Method	AUC	mAP	Image F1	Text F1
HAMMER	91.53	83.45	72.34	67.89
DGM4-baseline	92.15	85.23	74.56	70.12
MMFED	93.19	86.22	76.12	71.35
ASAP	94.38	88.53	78.34	76.52
Gain vs MMFED	+1.19	+2.31	+2.22	+5.17

The gain of +5.17 in Text F1 is particularly significant, demonstrating the advantage of the LMA module in text manipulation grounding.

Ablation Study¶

Configuration	AUC	Text F1
Baseline	92.15	70.12
+ LMA	94.28	74.89
+ LMA + MGCA	94.34	75.67
+ LMA + MGCA + PMM	94.38	76.52

LMA is the most critical module (AUC +2.13), with MGCA and PMM further improving grounding accuracy on top of it.

Cross-Dataset Generalization¶

Training Set	Test Set	AUC
DGM4	NewsCLIPpings	84.56
DGM4	COSMOS	81.23
DGM4	VERITE	79.87

Cross-dataset performance indicates that ASAP has learned a generalized capability for manipulation detection, rather than overfitting to a specific dataset.

Highlights & Insights¶

LLM-Assisted LMA: Introducing MLLM description and LLM reasoning into alignment learning for manipulation detection for the first time.
MGCA-Guided Attention: Directing cross-attention to focus on manipulation-related regions using a guidance mask.
PMM Hard Negative Strategy: Feature similarity-based hard negative selection enhances patch-level detection accuracy.

Limitations & Future Work¶

The LMA stage relies on external large models like GPT-4V, which increases inference cost (though it only requires one-time pre-computation).
In scenarios with short texts (e.g., Twitter captions), the performance of text manipulation grounding may degrade.
The selection of the guidance mask threshold has a certain impact on performance.

Conclusion¶

ASAP systematically enhances multimodal manipulation detection capabilities through a three-layer semantic alignment mechanism (global -> region -> patch). The LLM-assisted alignment strategy is the most prominent highlight—leveraging the reasoning capabilities of LLMs to comprehend "image-text consistency", thereby providing richer semantic anchors for manipulation detection. Its comprehensive leadership on DGM4 demonstrates the effectiveness of the proposed method.