Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection¶

Conference: CVPR 2026 arXiv: 2603.10598 Code: https://github.com/yywencs/LTD Area: Image Generation / AI-Generated Image Detection Keywords: Synthetic Image Detection, Layer Transition Discrepancy, CLIP-ViT, Cross-Domain Generalization, Dynamic Layer Selection

TL;DR¶

This paper identifies that real images exhibit stable layer-wise transitions in intermediate feature representations within a frozen CLIP ViT, whereas synthetic images exhibit abrupt attention shifts at intermediate layers. Based on this observation, the paper proposes Layer Transition Discrepancy (LTD) to model this difference, achieving mean Acc of 96.90% on UFD, 99.54% on DRCT-2M, and 91.62% on GenImage, surpassing all prior state-of-the-art methods.

Background & Motivation¶

Background: Generative models (GANs, diffusion models) produce increasingly photorealistic images, making the distinction between real and synthetic images an urgent challenge. Existing detection methods fall into three categories: (1) spatial texture/frequency-based methods (CNNSpot, NPR, FreqNet)—relying on model-specific artifacts with poor cross-domain generalization; (2) diffusion-model-specific detectors (DRCT, LaRE2)—underperforming on GAN-generated images; (3) frozen CLIP-based methods (UnivFD, RINE, FatFormer)—leveraging pretrained semantic features.

Limitations of Prior Work: Low-level artifacts (frequency, texture) shift as generative models evolve, causing detectors to learn model-specific biases. Among CLIP-based methods, UnivFD uses only the final-layer features, ignoring low-level information; RINE and similar methods aggregate all layer features but introduce noise from irrelevant information.

Key Challenge: A generalization-capable detection cue is needed that is agnostic to specific generative models—one that neither relies on particular artifacts nor fails to capture the essential differences between real and synthetic images.

Core Finding: By analyzing cosine similarity and L2 distance between features across CLIP ViT layers, the paper finds that real images maintain stable semantic attention consistency across intermediate layers (smooth inter-layer feature transitions), whereas synthetic images exhibit abrupt foreground/background attention shifts at intermediate layers (large inter-layer transition discrepancy). This phenomenon likely arises because generative models prioritize pixel-level fidelity and high-level semantic alignment but lack strict physical constraints, preventing them from maintaining continuous spatial correlations when integrating texture into structure at intermediate layers.

Core Idea: Leverage Layer Transition Discrepancy (LTD)—the discrepancy between features of adjacent intermediate layers—as a model-agnostic detection signal, jointly modeling global structural consistency and local inter-layer variation.

Method¶

Overall Architecture¶

A frozen CLIP ViT-L/14 is used as the backbone to extract hierarchical features. A dynamic layer selection strategy adaptively selects the most discriminative subset of consecutive intermediate layers. LTD features are computed as differences between adjacent-layer features. A dual-branch detection architecture is designed: one branch processes raw intermediate-layer features to model overall consistency, and the other processes LTD difference features to amplify inter-layer variation. Both branches are processed by weight-sharing Transformer blocks, concatenated, and fed into an MLP classifier.

Key Designs¶

Dynamic Layer-wise Selection:
Function: Adaptively selects \(n\) consecutive intermediate layers with the highest discriminability from the ViT's 24 layers.
Mechanism: Learnable logits \(\boldsymbol{\pi} \in \mathbb{R}^C\) are defined (\(C = l - n + 1\) candidate windows); Gumbel-Softmax determines the optimal starting layer index \(s\), maintaining differentiability during training.
Design Motivation: The most discriminative layers may vary across images (experiments show layers 11–19 are optimal); fixed layer selection lacks flexibility. Gumbel-Softmax enables end-to-end differentiable discrete selection.
Layer Transition Discrepancy (LTD):
Function: Captures inter-layer transition differences between real and synthetic images in ViT intermediate layers.
Mechanism: For \(n\) selected consecutive layers \(\{\mathbf{f}_s^{(k)}\}_{k=1}^n\), adjacent-layer CLS token differences are computed as \(\mathbf{d}_s^{(k)} = \mathbf{f}_s^{(k+1)} - \mathbf{f}_s^{(k)}\), yielding \(n-1\) LTD difference vectors.
Design Motivation: Compared to using raw features directly, difference features focus on inter-layer change patterns and suppress irrelevant redundant information. Real images exhibit small, stable differences; synthetic images exhibit large, abrupt differences.
Dual-Branch Weight-Sharing Detection Architecture:
Function: Jointly models global structural consistency and local inter-layer variation.
Mechanism: The raw feature branch \(\mathbf{F}_s = [\mathbf{f}_s, \mathbf{f}_{cls}, \mathbf{f}_p]\) and the LTD branch \(\mathbf{D} = [\mathbf{d}, \mathbf{d}_{cls}, \mathbf{d}_p]\) each incorporate a CLS token and positional encoding, then interact through weight-sharing trainable Transformer blocks.
Design Motivation: Weight sharing enforces feature alignment, mapping spatial consistency and inter-layer transitions into a unified semantic space and preventing distributional divergence.

Loss & Training¶

Standard binary cross-entropy loss. Training requires only 2 categories of ProGAN data (chair + tvmonitor) and converges within 5 epochs. All CLIP ViT parameters are frozen; only the layer selection logits, positional encodings, and the dual-branch Transformer + MLP are trained.

Key Experimental Results¶

Main Results¶

Dataset	Metric	Ours (LTD)	ForgeLens	FatFormer	Gain
UFD	Mean Acc	96.90%	95.56%	95.98%	+0.92%
UFD	Mean AP	99.51%	99.11%	99.15%	+0.36%
DRCT-2M	Mean Acc	99.54%	98.22%	-	+1.32%
DRCT-2M	Mean AP	99.99%	99.76%	-	+0.23%
GenImage	Mean Acc	91.62%	89.18%	84.34%	+2.44%
GenImage	Mean AP	97.17%	96.76%	95.01%	+0.41%

Ablation Study¶

Configuration	UFD Acc	DRCT-2M Acc	Mean Acc	Note
Raw ML. only	84.92%	92.75%	88.84%	Raw intermediate-layer features only
Raw ML. + Pos.Enc	94.22%	96.12%	95.17%	With positional encoding
LTD only	86.42%	93.50%	89.96%	LTD difference features only
LTD + Pos.Enc	92.43%	94.01%	93.22%	LTD with positional encoding
Full model	96.90%	99.54%	98.22%	Complete dual-branch model

Key Findings¶

Real images maintain stable attention consistency across ViT intermediate layers (approximately layers 11–19), while synthetic images exhibit significant attention shifts in the same range.
Shallow layers (0–7) and deep layers (16–23) offer limited discriminability; intermediate layers (8–15) are most discriminative.
The optimal window comprises 5 consecutive layers starting at layer 11; performance degrades with either more or fewer layers.
Training on only 2 categories generalizes to 16 different GAN and diffusion model generators.
The method is robust to JPEG compression (QF 60–100) and downsampling (0.5×–1.0×).

Highlights & Insights¶

A previously overlooked detection cue is identified: Inter-layer transition discrepancy is model-agnostic and independent of generator-specific artifacts, endowing it with inherent cross-domain generalization.
Exceptional training efficiency: Only 2 training categories and 5 epochs are required; training completes within minutes on an RTX 4090.
Fastest inference speed: The frozen CLIP backbone combined with lightweight dual branches yields significantly higher FPS compared to methods such as FatFormer.
Physical prior insight: Generative models optimize primarily for pixel-level realism and high-level semantic alignment, but intermediate-layer structural continuity is left unconstrained, making it a revealing "window" that leaks information about generative origin.

Limitations & Future Work¶

Accuracy on Midjourney within GenImage reaches only 62.97%, indicating room for improvement on certain high-quality commercial models.
The method is heavily dependent on CLIP ViT pretrained representations—if CLIP is incorporated into the generation process itself (as in future models), detection performance may degrade.
Dynamic layer selection via Gumbel-Softmax degenerates to a fixed selection at inference time, failing to achieve truly per-image adaptive selection.
Only CLS token LTD is exploited; local inter-layer discrepancy information from spatial tokens remains unutilized.

vs. UnivFD: UnivFD performs linear probing using only the final CLIP layer, neglecting intermediate-layer information; LTD leverages inter-layer discrepancy in intermediate layers, improving mean Acc by 11%.
vs. FatFormer/RINE: These methods aggregate all layer features but introduce irrelevant information and noise; LTD focuses on inter-layer variation to suppress redundancy.
vs. NPR/FreqNet: These methods rely on low-level statistical artifacts (upsampling fingerprints, frequency spectra) and generalize poorly to diffusion models; LTD exploits inter-layer structural consistency and is effective against both GANs and diffusion models.
Insight: Intermediate-layer representations of large pretrained models contain rich forensic signals; inter-layer dynamics are more discriminative than single-layer features.

Rating¶

Novelty: ⭐⭐⭐⭐ — Identifies inter-layer transition discrepancy as a novel detection cue; the observation is insightful and highly inspiring.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across three major benchmarks, 16+ generators, with extensive robustness and ablation experiments.
Writing Quality: ⭐⭐⭐⭐ — Motivation is clearly articulated and visualizations are convincing, though the mathematical formulation of the method is relatively straightforward.
Value: ⭐⭐⭐⭐ — High practical value; simple and efficient to train, strong generalization, and well-suited for deployment.