Skip to content

Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection

Conference: CVPR 2026
arXiv: 2603.10598
Code: https://github.com/yywencs/LTD
Area: Image Generation / AI-Generated Image Detection
Keywords: Synthetic Image Detection, Layer Transition Discrepancy, CLIP-ViT, Cross-domain Generalization, Dynamic Layer Selection

TL;DR

It is observed that real images exhibit stable inter-layer transitions in the intermediate representations of a frozen CLIP ViT, whereas synthetic images show significant attention mutations. The Layer Transition Discrepancy (LTD) method is proposed to model this difference, achieving a mean Acc of 96.90% on UFD, 99.54% on DRCT-2M, and 91.62% on GenImage, outperforming current SOTAs.

Background & Motivation

Background: Images synthesized by generative models (GANs, Diffusion Models) are becoming increasingly realistic, making fake image detection urgent. Existing methods include: (1) Spatial/frequency-based methods (CNNSpot, NPR, FreqNet), which rely on specific artifacts and generalize poorly; (2) Diffusion-specific detectors (DRCT, LaRE2), which perform poorly on GANs; (3) Frozen CLIP-based methods (UnivFD, RINE, FatFormer), which utilize pre-trained semantic features.

Limitations of Prior Work: Low-level artifacts (frequency, texture) evolve with generative models, leading to model-specific biases. Among CLIP-based methods, UnivFD uses only the last layer, ignoring lower-level information, while RINE fuses all layers but introduces noise from irrelevant information.

Key Challenge: A model-agnostic, universal detection cue is needed—one that does not rely on specific artifacts but captures the essential difference between real and synthetic images.

Key Insight: By analyzing cosine similarity and L2 distance of features across CLIP ViT layers, it was found that real images maintain stable semantic attention consistency (smooth transitions), while synthetic images exhibit abrupt foreground/background attention jumps (high discrepancy). This phenomenon may stem from generative models prioritizing pixel-level realism and high-level semantic alignment while lacking strict physical constraints, causing a failure to maintain spatial correlation when integrating texture into structure.

Core Idea: Leverage Layer Transition Discrepancy (LTD)—the difference between adjacent intermediate layers—as a model-agnostic signal to model both global structural consistency and local inter-layer changes.

Method

Overall Architecture

The method identifies a detection cue independent of specific generator artifacts. It starts from the observation that when images pass through a frozen CLIP ViT-L/14, the CLS features of real images transition smoothly, while those of synthetic images jump abruptly in middle layers. The pipeline directly models "how layers change" rather than specific fingerprints.

Input images pass through frozen CLIP ViT to extract 24 layers of features. A dynamic layer selection module picks the most discriminative consecutive segment (typically layers 11–19). One branch retains raw intermediate features to monitor global consistency, while the other computes the Layer Transition Discrepancy (LTD) by subtracting adjacent layers to amplify the jump. Both branches, augmented with CLS tokens and position encodings, are processed by a weight-sharing Transformer block for interaction before being concatenated for MLP-based binary classification.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Input Image"] --> B["Frozen CLIP ViT-L/14<br/>24-layer features"]
    B --> C["Dynamic Layer Selection<br/>Gumbel-Softmax picks n layers"]
    C --> D["Raw Mid-layer Branch<br/>Global structural consistency"]
    C --> E["LTD Branch<br/>Adjacent CLS differences"]
    D --> G["Shared-weight Branch<br/>Single Transformer block"]
    E --> G
    G --> H["Concatenate → MLP → Binary Classification"]

Key Designs

1. Dynamic Layer Selection: Learning the Most Discriminative Layers

Fixed layer selection is risky as artifacts vary across models. Signals are concentrated in middle layers, while shallow (0–7) and deep (16–23) layers have limited discriminative power. The starting index \(s\) for \(n\) consecutive layers is made learnable. Using learnable logits \(\boldsymbol{\pi} \in \mathbb{R}^C\) (where \(C = l - n + 1\)), the Gumbel-Softmax trick enables end-to-end differentiable selection of the window during training.

2. Layer Transition Discrepancy (LTD): Focusing on Inter-layer Change

Using raw features can introduce irrelevant semantic noise. LTD focuses on the difference between adjacent layers. For \(n\) selected layers \(\{\mathbf{f}_s^{(k)}\}_{k=1}^n\), the CLS tokens are subtracted: \(\mathbf{d}_s^{(k)} = \mathbf{f}_s^{(k+1)} - \mathbf{f}_s^{(k)}\). This operation cancels out shared content and highlights how representations "move" between layers.

3. Dual-branch Weight Sharing: Aligning Consistency and Jumps

To preserve both global structure and local transitions, the raw branch \(\mathbf{F}_s = [\mathbf{f}_s, \mathbf{f}_{cls}, \mathbf{f}_p]\) and LTD branch \(\mathbf{D} = [\mathbf{d}, \mathbf{d}_{cls}, \mathbf{d}_p]\) are fed into the same trainable Transformer block. Weight sharing forces distinct features into a unified semantic space, facilitating joint discrimination. Ablations show combining both branches significantly improves mean Acc from ~89% to 98.22%.

Loss & Training

Standard binary cross-entropy loss is used. Training utilizes only 2 classes of ProGAN data (chair + tvmonitor) and converges within 5 epochs. With a frozen CLIP backbone, the few trainable parameters allow training to complete in minutes on a single RTX 4090.

Key Experimental Results

Main Results

Dataset Metric Ours (LTD) ForgeLens FatFormer Gain
UFD Mean Acc 96.90% 95.56% 95.98% +0.92%
UFD Mean AP 99.51% 99.11% 99.15% +0.36%
DRCT-2M Mean Acc 99.54% 98.22% - +1.32%
DRCT-2M Mean AP 99.99% 99.76% - +0.23%
GenImage Mean Acc 91.62% 89.18% 84.34% +2.44%
GenImage Mean AP 97.17% 96.76% 95.01% +0.41%

Ablation Study

Configuration UFD Acc DRCT-2M Acc Mean Acc Description
Raw ML. only 84.92% 92.75% 88.84% Raw mid-layer features only
Raw ML. + Pos.Enc 94.22% 96.12% 95.17% Added position encoding
LTD only 86.42% 93.50% 89.96% LTD features only
LTD + Pos.Enc 92.43% 94.01% 93.22% LTD with position encoding
Full model 96.90% 99.54% 98.22% Complete dual-branch model

Key Findings

  • Real images maintain stable attention consistency in ViT middle layers (approx. Layers 11-19), while synthetic images show jumps.
  • Middle layers (8-15) are more discriminative than shallow (0-7) or deep (16-23) layers.
  • The optimal window is 5 consecutive layers; more or fewer layers degrade performance.
  • Training on just 2 categories generalizes to 16 different GAN and DM generators.
  • Robust against JPEG compression (QF 60-100) and downsampling (0.5x-1.0x).

Highlights & Insights

  • Novel Detection Cue: Identified inter-layer transition discrepancy as a model-agnostic feature, providing inherent cross-domain generalization.
  • High Efficiency: Training completes in minutes with only 2 categories and 5 epochs.
  • Superior Speed: Faster inference than FatFormer due to the lightweight dual-branch head and frozen backbone.
  • Physical Prior: Generative models lack structural continuity constraints in intermediate layers, creating a "window" for forensic detection.

Limitations & Future Work

  • Midjourney Acc on GenImage is only 62.97%, showing room for improvement on high-quality commercial models.
  • Heavy reliance on CLIP ViT representations; if CLIP is integrated into the generation process, performance may drop.
  • Gumbel-Softmax reduces to a fixed selection during inference, lacking per-image adaptivity.
  • LTD currently uses only CLS tokens, ignoring potential local discrepancy information in spatial tokens.
  • vs UnivFD: UnivFD uses only the last layer for linear probing; LTD uses inter-layer differences, improving mean Acc by 11%.
  • vs FatFormer/RINE: These methods fuse all layers, introducing noise; LTD focuses on changes to suppress redundancy.
  • vs NPR/FreqNet: These rely on low-level artifacts (upsampling/spectrum) and fail on diffusion models; LTD utilizes structural consistency effective for both GANs and DMs.
  • Insight: Intermediate representations of large pre-trained models contain rich forensic signals; inter-layer dynamics are more discriminative than single-layer features.

Rating

  • Novelty: ⭐⭐⭐⭐ Discovers a new, insightful detection cue.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive testing on 3 benchmarks and 16+ generators.
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation and convincing visualizations.
  • Value: ⭐⭐⭐⭐ High practical value due to training efficiency and generalization.