OpenSDI: Spotting Diffusion-Generated Images in the Open World¶

Conference: CVPR 2025
arXiv: 2503.19653
Code: https://github.com/iamwangyabin/OpenSDI
Area: Diffusion Models / AI Safety
Keywords: AI-Generated Image Detection, Diffusion Image Localization, Open World, Foundation Model Collaboration, CLIP+MAE

TL;DR¶

OpenSDI defines the open-world diffusion image detection challenge, constructs a large-scale dataset OpenSDID containing multi-VLM-generated instructions and various diffusion models, and proposes MaskCLIP—which synergizes CLIP and MAE through a Synergizing Pretrained Models (SPM) framework, significantly outperforming existing methods on both detection and localization tasks.

Background & Motivation¶

Background: With the popularization of advanced diffusion models such as Stable Diffusion and FLUX, the realism of AI-generated content continues to improve, making the differentiation of real and generated images a crucial challenge. Existing detection methods primarily target traditional manipulations (such as splicing, copy-move) or specific generators.

Limitations of Prior Work: Existing methods and datasets fail to address three core dimensions of open-world scenarios: (1) user diversity—styles and intents vary heavily across users; (2) model innovation—diffusion models iterate rapidly with new models constantly emerging; (3) manipulation spectrum—the complete range from global synthesis to local editors. Existing datasets usually cover only 1-2 dimensions.

Key Challenge: Detection and localization are heterogeneous tasks—detection requires image-level semantic judgment, while localization requires pixel-level precise segmentation. Existing methods typically excel at only one of these. Furthermore, overfitting to specific generators severely restricts generalization capabilities.

Goal: Define the OpenSDI challenge, construct a comprehensive benchmark dataset, and propose a unified generalization solution for both detection and localization.

Key Insight: Leverage large-scale VLMs to simulate real user behavior to generate diverse manipulation instructions, and maintain generalization capabilities by synergizing multiple pretrained foundation models instead of training them in isolation.

Core Idea: Adapt a "Synergizing Pretrained Models" (SPM) strategy—using prompting and attending mechanisms to coordinate CLIP (proficient in semantic judgment) and MAE (proficient in spatial reconstruction), executing both detection and localization while preserving their respective pre-trained generalization capabilities.

Method¶

Overall Architecture¶

MaskCLIP consists of three core components: (1) a CLIP vision encoder to extract semantic features, (2) a CLIP text encoder to provide "real/fake" class embeddings, and (3) an MAE encoder to extract spatial reconstruction features. Synergy is achieved through three attention modules: VCA, TVCA, and VSA. Pixel-level predictions are generated based on an FPN decoder. The detection output is derived from the cosine similarity between the CLIP global feature and the text embeddings, while the localization output is obtained from the segmentation map generated by MAE+TVCA.

Key Designs¶

Prompt-Tuning Preserving CLIP Generalization:
- Function: Learn continuous prompt vectors for "real/fake" concepts to avoid modifying CLIP pretrained weights.
- Mechanism: Learn a pair of learnable prompts \(\mathbf{V}_c \in \mathbb{R}^{M \times D}\) (\(c \in \{\text{real}, \text{fake}\}\)), which are passed through the text encoder to generate class embeddings \(\mathbf{t}_{\text{real}}\), \(\mathbf{t}_{\text{fake}}\). This is analogous to hand-crafting prompts like "a photo of a [real/fake]" but uses learnable continuous vectors to capture semantics more effectively. The CLIP parameters are fully frozen.
- Design Motivation: Global fine-tuning would destroy CLIP's broad vision-language knowledge, whereas prompt-tuning preserves generalization with minimal parameters.
Visual Cross-Attention (VCA) Aligning CLIP and MAE:
- Function: Inject CLIP's semantic understanding into MAE's spatial features.
- Mechanism: Deployed across multiple layers—bilinear interpolation + 1x1 convolution adjust CLIP patch tokens to the dimension of MAE features, acting as the queries. MAE tokens act as keys/values to perform cross-attention \(\mathbf{G}^l = \text{Softmax}(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d}})\mathbf{V}\). The result updates MAE features via residual connections \(V_m^{l+1} = V_m^l + \mathbf{G}^l\).
- Design Motivation: CLIP is skilled at extracting "fake or not" semantic judgments, while MAE excels at spatial reconstruction to find local anomalies. VCA allows their strengths to complement each other.
TVCA + VSA for Dual-Task Detection and Localization:
- Function: TVCA injects textual semantics into decoding features for localization; VSA aggregates multi-layer CLS tokens for detection.
- Mechanism: TVCA treats class text embeddings as queries and FPN decoded features as keys/values for cross-attention, generating segmentation logits \(M_{\text{fake}}\). VSA collects CLIP multi-layer CLS token embeddings \(V_{cls} = \{v_{cls}^l\}_{l \in L}\), yielding a global representation \(\mathbf{g}\) through self-attention + global pooling + linear projection. The detection result is computed using the cosine similarity between \(\mathbf{g}\) and text embeddings.
- Design Motivation: Different layers of CLS tokens encode varying granularities of information—lower layers focus on low-level artifacts, while higher layers focus on semantic consistency. TVCA introduces textual semantic constraints into pixel-level predictions.

Loss & Training¶

Total loss = \(\mathcal{L}_{CE}\) (detection cross-entropy) + \(\mathcal{L}_{BCE}\) (localization binary cross-entropy) + \(\mathcal{L}_{EDG}\) (edge-weighted loss), with all weights set to 1. CLIP is completely frozen; only VCA/TVCA/VSA/FPN/prompt vectors and the MAE encoder are trained.

Key Experimental Results¶

Main Results (OpenSDID Cross-Domain Testing)¶

Method	SD1.5 IoU	SDXL IoU	SD3 IoU	Flux.1 IoU	Average IoU
CAT-Net	0.664	0.255	0.356	0.050	0.374
TruFor	0.634	0.266	0.323	0.076	0.369
IML-ViT	0.665	0.215	0.236	0.061	0.325
MaskCLIP	0.671	0.310	0.438	0.162	0.427

Ablation Study¶

Configuration	Effect Description
W/o VCA (Simple concatenation of CLIP+MAE)	Localization IoU drops by ~7%, cross-domain generalization degrades
W/o TVCA	Loses textual semantic guidance, reducing localization accuracy
W/o VSA (Only using the last-layer CLS)	Detection accuracy drops, losing low-level artifact info
Full fine-tune CLIP (vs prompt-tune)	Generalization deteriorates significantly, leading to training set overfitting

Key Findings¶

MaskCLIP holds a huge advantage in cross-domain generalization: On the newest Flux.1 model, its IoU is nearly double that of the runner-up (0.162 vs 0.082), demonstrating that the SPM framework effectively preserves generalization ability.
Newer diffusion models are harder to detect: Flux.1 yields an IoU of only 0.16, significantly lower than SD1.5's 0.67.
Localization is far more challenging than detection: Global detection accuracy can reach 70%+, but pixel-level localization IoU is usually under 50%.
The localization task secures relative improvements of 14.23% (IoU) and 14.11% (F1), while the detection task achieves gains of 2.05% (Acc) and 2.38% (F1).

Highlights & Insights¶

The definition of the OpenSDI challenge systematically organizes the three dimensions of open-world diffusion image detection (user diversity, model innovation, manipulation spectrum), providing a clear problem formulation for future research.
Simulating user behavior with VLMs to generate manipulation instructions is a highly intelligent data construction strategy—far more scalable than human annotation and more natural/diverse than template generation.
The "synergy over replacement" philosophy of the SPM framework—preserving the pretrained knowledge of each foundation model without corruption, coordinating them via lightweight modules. This design paradigm can be extended to other tasks requiring multi-model collaboration.

Limitations & Future Work¶

Localization performance on the latest diffusion models (e.g., Flux.1) remains weak (IoU of only 0.16), showing that open-world detection is far from solved.
The training set in the dataset is generated only using SD1.5, limiting the model's adaptation capability to entirely new models.
MaskCLIP requires running two encoders (CLIP and MAE) concurrently, resulting in high inference costs.
Future work could explore online learning or few-shot adaptation strategies for rapid adaptation to newly emerging diffusion models.
Generalization to GAN-generated images is not fully validated (primary focus is on diffusion models).

vs TruFor: TruFor fuses spatial, frequency, and noise domain features for localization but generalizes poorly to diffusion-generated images. MaskCLIP achieves better generalization via foundation model synergy.
vs CLIP-based detectors (e.g., DeCLIP): These only use CLIP for binary classification and lack pixel-level localization capabilities. MaskCLIP complements this with MAE for spatial precision.
vs IML-ViT: Based on ImageNet-pretrained ViT for localization, which fails to capture diffusion-specific artifacts. MaskCLIP's VCA mechanism introduces CLIP's semantic knowledge to supplement this.

Rating¶

Novelty: ⭐⭐⭐⭐ The definition of the OpenSDI challenge is valuable, and the design of the SPM framework is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive construction of a large-scale dataset, dual-task evaluation of detection + localization, and thorough cross-domain generalization analysis.
Writing Quality: ⭐⭐⭐⭐ Clearly structured, though the methodology section is slightly detailed.
Value: ⭐⭐⭐⭐⭐ An important contribution to AI safety, with high community demand for both the dataset and the methodology.