Skip to content

Beyond Semantic Features: Pixel-Level Mapping for Generalized AI-Generated Image Detection

Conference: AAAI 2026 arXiv: 2512.17350 Code: None Area: Image Generation Keywords: AI-generated image detection, pixel-level mapping, semantic bias, high-frequency artifacts, cross-model generalization

TL;DR

This paper proposes a pixel-level mapping preprocessing method that suppresses low-frequency semantic bias and enhances high-frequency generation artifacts by breaking the monotonic ordering of pixel values, achieving a cross-model generalization accuracy of 98.4% in AI-generated image detection.

Background & Motivation

The central challenge facing current AI-generated image detectors is generalization failure: while detectors perform well within the training distribution, their performance degrades sharply when confronted with unseen generative models. The root cause is that detectors overfit to semantic biases specific to the training set—model-specific artifacts such as blurring and texture anomalies arising from architectural and training differences—rather than learning more fundamental, cross-model generation traces.

Existing approaches for reducing semantic influence exhibit notable shortcomings:

  • High-pass filtering: Removing low-frequency components inevitably discards useful generation artifact information and cannot fully eliminate semantic interference.
  • Patch Shuffling: Randomly shuffling image patches to constrain the receptive field still allows models to extract semantic information from shuffled patches even at the minimum patch size of 2, as evidenced by the convergence of ImageNet classification under such settings.
  • NPR residual operations: Frequency-domain operations cannot fully decouple the entanglement between semantic content and generation artifacts.

Method

Overall Architecture

The overall pipeline is remarkably simple: input image → pixel-level mapping module (preprocessing) → classification head (ResNet-50) → binary output. The pixel-level mapping is a fixed lookup-table transformation that introduces no learnable parameters.

Key Designs

1. Fixed Pixel-Level Mapping

A transformation is applied to each pixel value \(v \in [0, 256)\):

\[\phi_f(v) = v - \text{round}\left(\frac{v}{256}, 2\right) \times 256\]

where round(·, 2) denotes rounding to 2 decimal places. The effects of this simple formula are:

  • Breaking the monotonic ordering of pixel values: Adjacent pixel values (e.g., 127, 128) may be mapped far apart, converting originally smooth low-frequency regions into high-frequency information.
  • Preserving local inter-pixel correlations: Unlike shuffling, which permutes spatial positions, pixel mapping preserves the spatial arrangement of pixels.
  • Automatic normalization: Mapped values approximately fall within the range \([-1.28, 1.28]\).

The choice of decimals=2 is well-justified: decimals=1 retains a linear relationship due to coarse quantization (\(1/256 \approx 0.0039\)), whereas decimals>1 effectively breaks monotonicity, with decimals=2 simultaneously achieving normalization.

2. Random Pixel-Level Mapping

To verify the hypothesis that the specific mapping relationship is unimportant—what matters is breaking the monotonic ordering—a random mapping variant is designed:

\[T_c \sim \mathcal{U}(-1, 1)^{256}, \quad c \in \{0, 1, 2\}\]

A random lookup table is independently generated per channel per sample, and the transformation is applied as \(I'_c[x,y] = T_c[I_c[x,y]]\).

A key finding is that random mapping achieves detection accuracy comparable to fixed mapping, confirming that the effective mechanism lies not in any particular mapping relationship but in the disruption of monotonic pixel ordering.

3. Theoretical Analysis of Effectiveness

  • Semantic bias resides primarily in the low-frequency components of images (smooth regions).
  • Generation artifacts correlate with high-frequency details.
  • CNN classifiers exhibit an inductive bias toward low-frequency features, causing semantics to dominate during training.
  • Pixel mapping amplifies differences between adjacent pixel values, converting low-frequency information into high-frequency while preserving inter-pixel correlations.
  • Spectral analysis confirms that after mapping, the energy gap between low and high frequencies is significantly reduced, yielding a more uniform spectral energy distribution.

Loss & Training

  • Standard binary cross-entropy loss: \(\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}[y_i \log f_\theta(x_i) + (1-y_i)\log(1-f_\theta(x_i))]\)
  • ResNet-50 backbone, Adam optimizer, lr = \(2 \times 10^{-4}\)
  • 200 epochs, batch size 128, 8× NVIDIA 3090
  • Random crop to 128×128 during training to avoid resize bias; center crop during testing

Key Experimental Results

Main Results

Table 1: Cross-GAN Generalization (trained on 4 ProGAN categories, tested on 9 GAN models, ACC/AP)

Method S3GAN SNGAN STGAN Mean ACC
UnivFD 85.2 77.6 74.2 77.6
NPR 79.0 88.8 98.0 93.2
Fixed-mapping 85.2 99.1 99.9 97.9
Random-mapping 77.4 98.3 99.9 96.9

Table 2: Cross-Model Generalization on GenImage (trained on SDv1.4, ACC)

Method Midjourney SDv1.4 SDv1.5 ADM GLIDE Wukong VQDM BigGAN mAcc
UnivFD 93.9 96.4 96.2 71.9 85.4 94.3 81.6 90.5 88.8
C2P-CLIP 88.2 90.9 97.9 96.4 99.0 98.8 96.5 98.7 95.8
Fixed-mapping 96.8 98.9 98.8 98.7 98.4 98.2 98.8 98.8 98.4

Table 3: Comparison of Semantic Reduction Methods (GenImage, trained on SDv1.4)

Method mAcc mAP
ResNet-50 (baseline) 67.0 76.9
High-pass filtering 64.4 70.2
Patch shuffle (size=8) 70.7 80.6
Patch shuffle (size=2) 50.5 51.0
NPR 88.6 93.7
Fixed-mapping 98.4 99.8

Ablation Study

  • High-pass filtering underperforms the baseline: Discarding low-frequency components simultaneously removes useful information.
  • Extremely small patch shuffle (size=2) completely fails: Excessive fragmentation prevents meaningful feature learning.
  • Random vs. fixed mapping yields comparable performance: This validates that breaking monotonicity is the core mechanism, not the specific mapping relationship.
  • Spectral analysis: The proposed mapping method significantly equalizes the low- and high-frequency energy distribution, whereas shuffle methods reduce low-frequency energy without enhancing high-frequency components.

Key Findings

  • A simple pixel-value lookup-table transformation outperforms all state-of-the-art methods, including C2P-CLIP which leverages large pretrained models.
  • The method remains effective on high-resolution commercial models such as Midjourney, demonstrating robustness to resolution bias.
  • t-SNE visualizations show that mapped features effectively separate real from generated images, with a higher degree of separation than NPR.

Highlights & Insights

  1. Extreme simplicity: The core method is merely a lookup-table operation with zero additional parameters and zero computational overhead, yet delivers substantial performance gains.
  2. Incisive experimental insight: The paper demonstrates that an ImageNet classifier can still learn semantic information from shuffled patches at size=2, revealing the fundamental limitations of existing semantic reduction approaches.
  3. Random mapping experiment: This elegantly validates the causal mechanism—what matters is not the specific mapping but the disruption of pixel monotonicity.
  4. Thorough frequency-domain analysis: Visualizing spectral energy distributions intuitively illustrates how the method balances low- and high-frequency energy.

Limitations & Future Work

  • Only ResNet-50 is used as the backbone; integration with large pretrained models (e.g., CLIP) is not explored.
  • The fixed mapping applies an identical transformation to all samples, potentially creating vulnerabilities to adversarial attacks.
  • Mapped images are unreadable to the human eye, precluding manual quality inspection.
  • Robustness to post-processing operations (JPEG compression, social media transmission, etc.) is not discussed.
  • The selection space for the decimals parameter is limited, and deeper theoretical guidance is lacking.
  • The method provides a methodological contrast to frequency-domain and shuffle-based approaches such as NPR and BSA, offering a third paradigm.
  • The core idea—breaking low-frequency structure while preserving high-frequency signals—is transferable to other detection tasks such as deepfake detection.
  • This work highlights the importance of accounting for the inductive biases of classifiers when designing detectors, as such biases significantly affect generalization.

Rating

  • Novelty: ⭐⭐⭐⭐ — A minimalist method achieves state-of-the-art results; the pixel-mapping perspective is original.
  • Technical Depth: ⭐⭐⭐ — The method itself is simple, but the analyses (spectral, t-SNE, random mapping validation) are thorough.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Covers both GAN- and diffusion-based generative models across multiple datasets and baselines.
  • Practical Value: ⭐⭐⭐⭐⭐ — A zero-cost preprocessing step directly integrable into any detector, offering extremely high practical utility.