Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification¶

Conference: CVPR 2026 arXiv: 2602.18842 Code: To be confirmed Area: Image Segmentation / AI Forgery Detection Keywords: AI-generated image detection, manifold deviation, MAE reconstruction, iterative amplification, image forgery localization

TL;DR¶

This paper proposes IFA-Net, which detects AI-generated forgeries from the perspective of "modeling what is real" rather than "learning what is fake." A frozen MAE reconstructs the input to produce residuals that expose regions deviating from the natural image manifold. A two-stage closed-loop pipeline—coarse detection → task-adaptive prior injection → residual amplification → refinement—iteratively amplifies manifold deviation, achieving state-of-the-art performance on both diffusion inpainting and traditional image tampering detection.

Background & Motivation¶

With the rapid proliferation of AI image generation technologies such as Stable Diffusion and DALL-E, detecting and localizing AI-generated content (AIGC) forgeries has become critically important. Most existing methods follow the paradigm of "learning what is fake," extracting forgery-specific artifacts (e.g., spectral anomalies, GAN fingerprints) from forged samples. However, such methods suffer from fundamental limitations:

Poor generalization: Detectors trained on specific generators struggle to generalize to unseen generators.

Adversarial fragility: Forgers can bypass artifact-based detection by simply fine-tuning the generation process.

Data dependency: Large amounts of annotated real-fake paired data are required.

Core shift: Rather than learning "what fake images look like," if we precisely model "what real images should look like," any region deviating from the real image manifold becomes suspicious. This approach inherently generalizes across generators, as it models the statistical regularities of natural images rather than the artifacts of specific forgery methods.

Pre-trained MAEs (Masked Autoencoders), having learned powerful natural image manifold priors from massive real image datasets, produce large reconstruction residuals for forged regions (which deviate from the manifold) while accurately reconstructing authentic regions (which lie on the manifold). The residual map thus serves as a natural "spotlight" for forged regions.

Method¶

Overall Architecture¶

IFA-Net adopts a two-stage closed-loop architecture: - Stage 1: Frozen MAE reconstruction → residual map → DSSN dual-stream segmentation → coarse mask $M_{\text{crs}}$ - Stage 2: Coarse mask injected into MAE via TAPI → amplified residual → shared DSSN refinement → final mask $M_{\text{ref}}$

Key Designs¶

Stage 1 — MAE Residual-Based Coarse Detection:
- Frozen MAE reconstruction: The potentially tampered image $I$ is reconstructed as $\hat{I}$ through a frozen MAE encoder-decoder.
- Residual map computation: $R = |I - \hat{I}|$; authentic regions yield small residuals (accurate MAE reconstruction), while forged regions yield large residuals (manifold deviation).
- DSSN (Dual-Stream Segmentation Network):
  - Content Stream: encodes semantic content of the original image (SegFormer backbone).
  - Artifact Stream: encodes forgery cues from the residual map.
  - Cross-Attention Fusion: the two streams exchange information via cross-attention; the content stream provides "where to look" and the artifact stream provides "what anomaly is observed."
- Output: coarse mask $M_{\text{crs}}$.
Stage 2 — TAPI (Task-Adaptive Prior Injection) Iterative Amplification:
- Motivation: Stage 1 residuals may be insufficiently prominent (higher generation quality yields weaker residuals), necessitating amplification.
- Prompt Encoder: encodes the coarse mask $M_{\text{crs}}$ into a global context vector via convolutional downsampling and linear projection.
- FiLM Modulation: the global context modulates intermediate features of the frozen MAE encoder via Feature-wise Linear Modulation: $$\tilde{Z} = \gamma \odot Z + \beta$$ where $\gamma$ and $\beta$ are generated from the context vector produced by the Prompt Encoder, and $Z$ denotes intermediate features of the frozen MAE encoder.
- Core effect: TAPI directs the MAE to "focus on these regions," enabling it to allocate more reconstruction capacity to suspected areas and produce larger residual deviations.
- Trainable MAE Decoder: the MAE decoder in Stage 2 is trainable (unlike the frozen decoder in Stage 1), further amplifying reconstruction error in forged regions.
- The amplified residual $R_{\text{amp}} = |I - \hat{I}_{\text{amp}}|$ is fed into the shared DSSN to obtain the refined mask $M_{\text{ref}}$.
DSSN Architecture Details:
- Based on the SegFormer architecture with a dual-stream design.
- Content Stream input: original image $I$.
- Artifact Stream input: residual map $R$ (Stage 1) or amplified residual map $R_{\text{amp}}$ (Stage 2).
- Cross-attention fusion modules are applied after each SegFormer stage.
- DSSN weights are shared across both stages (parameter-efficient, and Stage 1 gradients also benefit Stage 2 learning).

Loss & Training¶

\[\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{ref}} + 0.5 \cdot \mathcal{L}_{\text{crs}}\]

Each stage loss comprises: $$\mathcal{L}_{\text{stage}} = \mathcal{L}_{\text{BCE}} + \mathcal{L}_{\text{Dice}}$$

BCE loss handles pixel-level classification.
Dice loss addresses class imbalance (forged regions are typically far smaller than authentic regions).
The refined mask $M_{\text{ref}}$ is weighted at 1.0 and the coarse mask $M_{\text{crs}}$ at 0.5, guiding the network to prioritize optimization of the final output.

Key Experimental Results¶

Main Results — Diffusion Inpainting Detection¶

Average results across four diffusion inpainting benchmarks:

Method	IoU (%)	F1 (%)
MVSS-Net	41.2	52.7
ObjectFormer	43.8	55.1
SAFIRE	47.3	59.6
UnionFormer	49.1	61.3
IFA-Net (Ours)	55.6 (+6.5)	69.4 (+8.1)

Key Findings: - IFA-Net outperforms the best baseline by an average of +6.5% in IoU and +8.1% in F1. - The most significant gains are observed on Stable Diffusion v2 inpainting, indicating that the manifold deviation approach is more effective against higher-quality generation.

Generalization — Traditional Image Tampering Detection¶

Results on CASIA, Columbia, NIST, and other traditional copy-move/splicing datasets:

Method	CASIA F1	Columbia F1	NIST F1
ManTra-Net	48.2	72.5	35.8
SPAN	52.1	76.3	39.2
IFA-Net	56.8	79.1	43.7

Key Finding: IFA-Net surpasses dedicated tampering detection methods in a zero-shot setting without any training on traditional tampering data, validating the generalization advantage of the "model real rather than learn fake" paradigm.

Ablation Study¶

Configuration	MAE Residual	TAPI Amplification	Dual-Stream DSSN	IoU (%)
Content stream only	✗	✗	✗	38.5
+ MAE residual	✓	✗	✗	46.2
+ Dual-stream fusion	✓	✗	✓	50.8
+ TAPI (full model)	✓	✓	✓	55.6

Introducing MAE residuals yields +7.7% IoU, confirming the effectiveness of the manifold deviation signal.
Dual-stream DSSN contributes an additional +4.6%, demonstrating complementarity between content and artifact information.
TAPI iterative amplification adds a further +4.8%, establishing the critical role of the residual amplification mechanism.

Highlights & Insights¶

Paradigm shift: Moving from "learning fake" to "modeling real," the approach leverages pre-trained MAE manifold priors to achieve natural cross-generator generalization.
Closed-loop amplification design: The pipeline of coarse mask → MAE injection → residual amplification → refined mask forms an elegant "detect → focus → amplify → refine" closed loop.
Frozen encoder + modulation: The MAE encoder remains frozen to preserve manifold priors, with task information injected solely through FiLM modulation, achieving parameter efficiency.
Zero-shot generalization: Training on diffusion inpainting and zero-shot transfer to traditional copy-move/splicing demonstrates that manifold deviation serves as a unified forgery indicator.

Limitations & Future Work¶

The MAE's reconstruction capacity is limited; residuals for very small forged regions (<32×32 pixels) may not be sufficiently prominent.
Two-stage sequential inference increases latency; efficiency optimization is needed for real-time video forgery detection.
TAPI performs only one iteration (Stage 1 → Stage 2); whether multiple iterations could yield further improvements remains unexplored.
Detection capability for fully AI-generated images (as opposed to local inpainting) has not been thoroughly validated.
Shared DSSN weights may introduce optimization conflicts between the two stages.

Compared to ObjectFormer (which learns object-level artifacts): IFA-Net does not learn specific artifacts but instead models manifold deviation.
The MAE reconstruction residual approach shares theoretical grounding with anomaly detection methods (e.g., PatchCore)—both follow the paradigm of "learn normal → identify anomaly."
The FiLM modulation in TAPI likely draws inspiration from the prompt encoder in SAM (Segment Anything Model).
Insight: The manifold deviation amplification paradigm is extensible to deepfake video detection (temporal manifold deviation) and AI-generated text detection.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The paradigm shift of "modeling real rather than learning fake" combined with closed-loop residual amplification is original within the forgery detection domain.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers diffusion inpainting, traditional tampering, and comprehensive ablations, but lacks evaluation on deepfake face scenarios.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; the concept of "manifold deviation" is intuitively presented.
Value: ⭐⭐⭐⭐⭐ Cross-generator generalization makes the method practically deployable; the paradigm is broadly extensible.