Is Artificial Intelligence Generated Image Detection a Solved Problem?¶
Conference: NeurIPS 2025 arXiv: 2505.12335 Code: HorizonTEL/AIGIBench Area: Image Generation Keywords: AI-generated image detection, benchmark, robustness evaluation, data augmentation, generalization
TL;DR¶
This paper proposes AIGIBench, a comprehensive benchmark that systematically evaluates 11 state-of-the-art detectors across four tasks—multi-source generalization, multi-degradation robustness, data augmentation sensitivity, and test-time preprocessing impact—revealing that existing AIGI detection methods suffer severe performance degradation in real-world scenarios, demonstrating that the problem is far from solved.
Background & Motivation¶
With the rapid advancement of GANs and diffusion models, synthetic images have become increasingly photorealistic, giving rise to serious concerns regarding misinformation, deepfakes, and copyright infringement. Although numerous AIGI detectors have reported detection accuracies exceeding 95%, such high performance is predominantly achieved under idealized experimental conditions. Existing benchmarks suffer from the following deficiencies:
- Incomplete coverage of generation methods: Most benchmarks cover only methods prior to 2022 and lack the latest diffusion models (e.g., FLUX, Imagen-3, SD-3).
- Narrow evaluation dimensions: Only generalization is assessed, while robustness, data augmentation effects, and test-time preprocessing are overlooked.
- Absence of real-world data: Samples from social media and AI art communities that reflect authentic distribution shifts are not included.
- Outdated detection methods: Most benchmarks evaluate detectors concentrated in the pre-2022 era.
The paper thus poses a central question: Is AI-generated image detection already a solved problem?
Core Problem¶
The paper answers this question by constructing the AIGIBench benchmark, which simulates an end-to-end real-world AIGI detection pipeline covering the complete workflow from training and augmentation to inference preprocessing. Four core evaluation tasks are defined:
- Task 1 — Multi-Source Generalization: Evaluates detectors' ability to generalize to images from unknown generative models.
- Task 2 — Multi-Degradation Robustness: Assesses stability under degradations such as JPEG compression, Gaussian noise, and up/downsampling.
- Task 3 — Data Augmentation Variation: Examines the impact of augmentation strategies (rotation, color jitter, random masking) on detection performance.
- Task 4 — Test-Time Preprocessing: Analyzes the effect of Resize vs. Crop preprocessing strategies on detection outcomes.
Method¶
Dataset Construction¶
Training Settings:
- Setting-I: 72K images generated by ProGAN (four categories: car, cat, chair, horse), covering a single GAN source.
- Setting-II: 144K images (ProGAN + SD-v1.4), same four categories, introducing diffusion models to expand the training distribution.
Test Sets (23 subsets + 2 real-world subsets):
| Category | Generation Methods |
|---|---|
| GAN Noise-to-Image | ProGAN, StyleGAN3, StyleGAN-XL, StyleSwim, R3GAN, WFIR |
| Diffusion Text-to-Image | SD-XL, SD-3, DALLE-3, Midjourney-v6, FLUX.1-dev, Imagen-3, GLIDE |
| GAN Deepfake | BlendFace, E4S, FaceSwap, InSwap, SimSwap |
| Diffusion Personalized Generation | InstantID, Infinite-ID, PhotoMaker, BLIP-Diffusion, IP-Adapter |
| Open Platforms | SocialRF (social media), CommunityAI (AI art community) |
Data Quality Control:
- Near-duplicate images removed using CLIP embeddings (cosine similarity threshold 0.98).
- Low-quality images filtered via CLIP aesthetic scores.
- Manual review to remove overtly fabricated images.
Real Image Sources: FFHQ, CelebA-HQ, and Open Images V7, paired one-to-one with fake images to ensure class balance.
Evaluation Metrics¶
- Acc.: Overall accuracy.
- A.P.: Average precision.
- R.Acc.: Real image detection accuracy (correct classification rate for real images).
- F.Acc.: Fake image detection accuracy (correct classification rate for fake images).
Decomposing accuracy into R.Acc. and F.Acc. is a key design choice of this paper, enabling more granular diagnosis of detector bias.
Evaluated Detectors (11 Total)¶
Representative methods spanning 2016–2025: ResNet-50, CNNDetection, Gram-Net, LGrad, CLIPDetection, FreqNet, NPR, LaDeDa, DFFreq, AIDE, and SAFE. More than half were published after 2024.
Loss & Training¶
Training Settings¶
- Setting-I: 72K ProGAN-generated images (car/cat/chair/horse), covering a single GAN source only.
- Setting-II: 144K images (ProGAN + SD-v1.4), introducing diffusion models to expand the training distribution.
- All 11 detectors are retrained using original hyperparameters to ensure fair comparison.
- Transitioning from Setting-I to Setting-II by adding SD-v1.4 significantly improves R.Acc., but often at the cost of F.Acc., indicating a sensitivity–precision trade-off.
Inference Pipeline¶
- Test images originate from unknown generative models and may have undergone unknown degradation.
- Preprocessing (Resize or Crop) is applied prior to inference to match the training resolution.
- Resize inadvertently smooths local correlations in synthetic images, attenuating subtle discriminative artifacts in low-level feature spaces.
- Crop better preserves fine-grained textures and local structures, but may discard discriminative cues such as boundary artifacts.
- Detectors produce binary outputs (real/fake); performance is decomposed into R.Acc. and F.Acc. during evaluation to avoid overall accuracy masking bias.
Key Experimental Results¶
Task 1: Generalization Evaluation (Setting-II)¶
| Detector | Avg. R.Acc. | Avg. F.Acc. | Acc. | A.P. |
|---|---|---|---|---|
| SAFE | 96.8% | 63.0% | 79.9% | 82.6% |
| AIDE | 88.1% | 67.0% | 77.6% | 82.7% |
| LaDeDa | 91.7% | 54.9% | 73.4% | 79.3% |
| CLIPDetection | 73.3% | 71.5% | 72.5% | 75.6% |
| DFFreq | 89.6% | 51.9% | 71.1% | 75.7% |
| CNNDetection | 98.2% | 11.6% | 54.9% | 67.0% |
Key Finding: Even the best-performing SAFE exhibits extremely low or near-zero F.Acc. on deepfake datasets (FaceSwap, SimSwap) and on DALLE-3, SocialRF, and CommunityAI, demonstrating severe failure under real-world distribution shifts.
Task 2: Robustness Evaluation¶
| Degradation Type | Typical Consequence |
|---|---|
| JPEG Compression | F.Acc. of nearly all detectors drops to ~0%, while R.Acc. remains ~100% (severe bias toward predicting real) |
| Gaussian Noise | F.Acc. broadly falls below 35% |
| Up/Downsampling | Relatively moderate impact; some methods maintain reasonable performance |
The most perturbation-resistant methods are CLIPDetection and FreqNet, each for distinct reasons:
- CLIPDetection: Performs binary classification in the feature space of a large-scale pretrained CLIP-ViT using a nearest-neighbor + linear probing strategy, requiring no explicit training of forgery-detection features and thus exhibiting strong decoupling from degradation type.
- FreqNet: Operates in the frequency domain, capturing forgery patterns that are inherently insensitive to spatial-domain perturbations (compression, noise), thereby providing better robustness.
- Overall Trend: The mean row shows that all detectors maintain R.Acc. ≥ 90% but F.Acc. < 35% under perturbation, indicating that detectors severely bias toward predicting real in degraded conditions, raising serious concerns about practical detection reliability.
Task 3: Data Augmentation Evaluation¶
Three augmentation strategies—RandomRotation, Color-Jitter, and RandomMask—are evaluated on five advanced detectors:
- Common augmentation strategies yield limited benefit for AIGI detection and may even introduce performance trade-offs.
- Augmentation typically improves R.Acc. but can reduce F.Acc.—for example, applying Rotation to CLIPDetection raises R.Acc. from 73.3% to 86.1% but lowers F.Acc. from 71.5% to 54.9%.
- Combining all three augmentations offers no clear advantage: FreqNet's F.Acc. drops to 62.5% (from 66.4%), and NPR's falls to 32.5% (from 41.9%).
- Augmentation effectiveness is highly detector-dependent: SAFE is least sensitive to augmentation strategies, with Acc. fluctuating by only ~2% across combinations, while frequency-domain models such as FreqNet and DFFreq are more vulnerable to semantic or frequency perturbations.
- Core Conclusion: Data augmentation is not a silver bullet for AIGI detection; augmentation-aware training pipelines tailored to specific detectors are required.
Task 4: Preprocessing Evaluation¶
| Crop vs. Resize | Conclusion |
|---|---|
| R.Acc. | Crop yields significant improvement (e.g., SAFE: 63.3% → 96.8%) |
| F.Acc. | Crop shows negligible or slightly negative effect |
| Overall Acc. | Crop is generally superior |
Core Explanation: Real images have a concentrated and consistent modal distribution; Crop preserves high-frequency local features and texture details that facilitate recognition of real content. Fake images originate from diverse generative models with large modal variance, and Crop may remove discriminative cues such as boundary artifacts, resulting in unstable improvements for fake image detection. This modal asymmetry explains why Crop consistently improves R.Acc. but provides limited benefit for F.Acc.
Highlights & Insights¶
- First end-to-end evaluation framework: Covers generalization, robustness, data augmentation, and preprocessing in a unified benchmark, substantially surpassing existing benchmarks in systematic coverage.
- R.Acc./F.Acc. decomposition: Reveals severe bias concealed by overall accuracy—most detectors exhibit high R.Acc. but extremely low F.Acc.
- Coverage of the latest generation methods: 11 of the 25 test subsets derive from methods published in 2024 or later (FLUX, Imagen-3, SD-3, etc.).
- Introduction of real-world data: SocialRF and CommunityAI fill the evaluation gap for socially distributed content.
- Modal asymmetry analysis of preprocessing: Provides a clear mechanistic explanation for the differential effects of Crop and Resize.
Limitations & Future Work¶
- Limited training settings: Only ProGAN and SD-v1.4 are used for training; multi-source joint training or large-scale dataset training remains unexplored.
- Extensible detector coverage: Although 11 detectors cover representative methods, the latest multimodal and foundation model-based approaches could be incorporated.
- Video and multimodal extension: The current work focuses exclusively on static images; video deepfake and multimodal forgery detection are not addressed.
- Absence of adversarial robustness evaluation: Detector performance under adversarial attacks (e.g., adversarial perturbations, steganography) is not assessed.
- Prompt diversity in text-guided generation: While Gemini is used to generate diverse descriptions, whether the prompt distribution faithfully reflects real-world usage remains to be validated.
Related Work & Insights¶
| Benchmark | # Gen. Methods | 2024+ | # Detectors | Task Dims | Real-World Data |
|---|---|---|---|---|---|
| GenImage (NeurIPS'23) | 8 | 0 | 7 | 1 | ✗ |
| DeepfakeBench (NeurIPS'23) | 9 | 0 | 34 | 1 | ✗ |
| WildFake (AAAI'25) | 22 | 0 | 6 | 2 | ✗ |
| Chameleon (ICLR'25) | — | — | 10 | 2 | AI community |
| DF40 (NeurIPS'24) | 40 | 3 | 7 | 1 | ✗ |
| AIGIBench (Ours) | 25 | 11 | 11 | 4 | Social + AI community |
AIGIBench substantially leads in coverage of recent generation methods, comprehensiveness of evaluation task dimensions, and diversity of real-world data sources.
My Notes¶
- Frequency-domain features merit deeper investigation: FreqNet and DFFreq demonstrate superior generalization and robustness via frequency-domain features, suggesting that frequency-domain analysis is a promising direction for improving detector stability. FreqNet's relatively high F.Acc. under JPEG compression in particular indicates that frequency-domain artifacts are more persistent than spatial-domain artifacts.
- Large-scale pretraining is critical: CLIPDetection and AIDE leverage feature spaces from large-scale models such as CLIP-ViT to achieve cross-distribution generalization; future work may explore even larger vision foundation models. AIDE achieves the highest A.P. (82.7%) under Setting-II, suggesting that self-supervised feature spaces hold inherent advantages for AIGI detection.
- Detectors require "debiasing" training: The severe imbalance between R.Acc. and F.Acc. indicates that existing methods are excessively biased toward predicting real, necessitating dedicated bias mitigation strategies. CNNDetection is an extreme case—R.Acc. of 98.2% but F.Acc. of only 11.6%, rendering it nearly non-functional as a detector.
- Real-world deployment demands end-to-end consideration: Both data augmentation and inference preprocessing affect final performance; detection pipelines must be designed holistically for practical deployment. The modal asymmetry revealed by this paper (concentrated real-image distribution vs. dispersed fake-image distribution) provides a key framework for understanding the impact of each component.
- Connection to AI safety: This work provides a critical technical evaluation foundation for AI-generated content governance, and can be integrated with complementary security techniques such as watermarking and provenance tracking.
- Methodological implications for benchmark design: The R.Acc./F.Acc. decomposition combined with the four-dimensional evaluation framework offers a methodology worth emulating in other detection domains (e.g., deepfake video, AI-generated text detection); single-metric Acc. is highly misleading in imbalanced scenarios.
- SocialRF/CommunityAI are the hardest subsets: Nearly all detectors achieve F.Acc. < 20% on these two real-world subsets, implying that current detection technology remains far from practical deployment readiness and that domain adaptation or test-time adaptation strategies are urgently needed.
Rating¶
- Novelty: ⭐⭐⭐ — The core contribution lies in evaluation framework design rather than a novel algorithm, though the four-dimensional assessment and R.Acc./F.Acc. decomposition are valuable.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 11 detectors × 25 test subsets × 4 evaluation task dimensions constitute an extensive experimental effort with in-depth analysis.
- Writing Quality: ⭐⭐⭐⭐ — Well-structured, with information-rich tables and insightful analysis.
- Value: ⭐⭐⭐⭐ — Provides a sober assessment of the field's current state and offers meaningful guidance for future research directions.