Diversity over Uniformity: Rethinking Representation in Generated Image Detection¶

Conference: CVPR 2026 arXiv: 2603.00717 Code: GitHub Area: Image Forensics / AI-Generated Image Detection Keywords: Generated image detection, feature collapse, representation diversity, information bottleneck, CLIP

TL;DR¶

This paper proposes an Anti-Feature-Collapse Learning (AFCL) framework that filters task-irrelevant features via an information bottleneck and suppresses excessive overlap among heterogeneous forgery cues, thereby preserving diversity and complementarity in discriminative representations. The method achieves significant improvements in cross-model generated image detection.

Background & Motivation¶

The core problem with existing generated image detectors is not insufficient features but representation homogenization: during training, models tend to compress multi-source information into a small number of salient discriminative patterns, forming shortcut-based decision rules. This feature collapse phenomenon leads to:

Discriminative capacity concentrated in a few principal component directions (effective rank of only 1–2)
Detection accuracy saturating after a small number of principal components, with additional features left unexploited
Severe performance degradation when the generative method or image distribution shifts

The authors validate this hypothesis through UMAP visualization and principal component analysis: the feature space of pre-trained models inherently contains rich discriminative cues, yet after fine-tuning these are compressed into an extremely low-rank subspace. Upon introducing representation heterogeneity constraints, the detector effectively leverages more principal components and exhibits substantially improved generalization.

Method¶

Overall Architecture¶

Built upon a frozen CLIP ViT-L/14 image encoder, the framework extracts multi-stage CLS features. Features at each stage are first filtered by a CIB module, then decorrelated by an AFCL module, and finally aggregated via weighted pooling with class-specific prompt learning for real/fake classification.

Key Designs¶

Cue Information Bottleneck (CIB): Applies information bottleneck filtering to each stage's features, maximizing mutual information with label \(y\) while minimizing mutual information with input \(x\):

\(\max_{\{\mathrm{CIB}_i\}} \sum_{i=1}^{N} [I(\tilde{v}_i; y) - \beta I(\tilde{v}_i; x)]\)

A tractable CIB loss is derived via a variational lower bound:

$\mathcal{L}_{\mathrm{CIB}} = \sum_{i=1}^{N} D_{\mathrm{KL}}[p(y|\tilde{\mathcal{V}}) \| p(y|\tilde{\mathcal{V}} \setminus \tilde{v}_i)]$

This ensures each cue carries indispensable discriminative information, yielding a purified and complementary feature set.

Anti-Feature-Collapse Learning (AFCL): Employs the Hilbert-Schmidt Independence Criterion (HSIC) as a kernel-based dependency measure to enforce statistical independence among features across different stages:

\(\mathcal{L}_{\mathrm{AFCL}} = \frac{1}{N(N-1)} \sum_{i \neq j} \mathrm{HSIC}(\tilde{v}_i, \tilde{v}_j)\)

where \(\mathrm{HSIC}(\tilde{v}_i, \tilde{v}_j) = \frac{1}{(B-1)^2} \mathrm{Tr}(K_i H K_j H)\). A weight uniformity regularizer \(\mathcal{L}_{\mathrm{reg}} = (\sum_{i=1}^{N} \alpha_i^2 - 1/N)^2\) is further introduced to prevent aggregation weight collapse.

Class-Specific Prompt Learning (CSP): Learns a set of trainable context vectors for each of the "real" and "fake" classes, aligning the final visual representation with text prototypes via cosine similarity:

\(s_c = \frac{\tilde{v}_{\mathrm{final}} \cdot e_c}{\|\tilde{v}_{\mathrm{final}}\| \|e_c\|}\)

Optimized using a cross-entropy loss \(\mathcal{L}_{\mathrm{CSP}}\).

Loss & Training¶

The total loss jointly optimizes four terms:

\[\mathcal{L} = \mathcal{L}_{\mathrm{CSP}} + \lambda_1 \mathcal{L}_{\mathrm{CIB}} + \lambda_2 \mathcal{L}_{\mathrm{AFCL}} + \lambda_3 \mathcal{L}_{\mathrm{reg}}\]

Backbone: CLIP ViT-L/14 (frozen)
Optimizer: Adam, learning rate \(1 \times 10^{-4}\), batch size 512
Training data: SD v1.4 subset of GenImage
Early stopping applied to prevent overfitting

Key Experimental Results¶

Main Results¶

Dataset / Metric	Ours (AFCL)	VIB-Net	CLIPping	Gain (vs. VIB-Net)
Mean AP	99.52%	96.13%	87.32%	+3.39%
Mean ACC	92.81%	87.13%	87.81%	+5.68%
Cross-model ACC	90.02%	—	—	—

Evaluation covers 21 generative models (11 from UniversalFakeDetect + 7 from GenImage + 6 from AIGI-Holmes), spanning both GAN and diffusion model paradigms.

Ablation Study¶

Configuration	Mean ACC	Cross-model ACC	Mean AP	Note
Baseline	87.81%	85.56%	87.32%	No CIB / AFCL / reg
+CIB	89.72%	85.99%	99.38%	Information bottleneck filtering
+AFCL+reg	91.60%	91.15%	99.40%	Decorrelation + regularization
Full model	92.81%	90.02%	99.52%	All components combined

Key Findings¶

The proposed method achieves an effective rank of 67.38, far exceeding CNNDet (1.37) and VIB-Net (1.92)
The number of principal components required to explain 90% of the variance is only 26 fewer than the pre-trained backbone, whereas other detectors require hundreds fewer
With only 0.1% of training data (320 samples), the method achieves 80.98% ACC / 90.81% AP
Maintains state-of-the-art robustness under JPEG compression and Gaussian blur perturbations

Highlights & Insights¶

Precise problem diagnosis: Through effective rank and principal component analysis, the paper quantitatively identifies feature collapse as the root cause of generalization failure, rather than insufficient information capacity
HSIC-based decorrelation: Using kernel methods to measure feature independence is more flexible than simple orthogonality constraints, capturing nonlinear dependencies
Dual strategy of purification and diversity: CIB denoises and AFCL decorrelates — the two are complementary and neither is dispensable

Limitations & Future Work¶

Training is conducted solely on SD v1.4; whether the advantages persist under larger-scale multi-source training remains unverified
The effect of the choice of the number of multi-stage features \(N\) on performance is not thoroughly discussed
Detection capability for video generation models (e.g., Sora) is not addressed

Complementary to the variational information bottleneck mechanism in VIB-Net: VIB-Net compresses redundancy but does not address cross-layer collapse
DRCT/Bias-Free methods mitigate the issue indirectly through debiasing operations; the proposed approach targets the representational structure directly, offering a more fundamental solution
Inspiration: the anti-collapse idea underlying AFCL can be transferred to other discriminative tasks requiring generalization, such as deepfake video detection and cross-domain classification

Rating¶

Novelty: ⭐⭐⭐⭐ Re-examines the detection problem from the perspective of representation collapse — a novel and theoretically grounded angle
Experimental Thoroughness: ⭐⭐⭐⭐ Covers 21 generative models with complete ablations and thorough robustness/few-shot experiments
Writing Quality: ⭐⭐⭐⭐ Motivation analysis and visualizations are convincing; presentation is clear and well-structured
Value: ⭐⭐⭐⭐ Provides important insights for the field of generated image detection; the AFCL idea is broadly transferable