Skip to content

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Conference: ICCV 2025 arXiv: 2503.15264 Code: opendatalab.github.io/LEGION Area: Image Segmentation Keywords: Synthetic Image Detection, Artifact Localization, MLLM, Explainability, Image Refinement

TL;DR

This paper proposes the LEGION framework and the SynthScars dataset, leveraging a multimodal large language model (MLLM) to unify artifact detection, pixel-level segmentation, and textual explanation for synthetic image detection. It further innovatively extends the role of the detector from a "Defender" to a "Controller," guiding generative models to produce higher-quality images.

Background & Motivation

The rapid advancement of generative techniques (GAN → Diffusion → Autoregressive models) has made synthetic images increasingly photorealistic, giving rise to serious risks including privacy violations, copyright disputes, and misinformation. Existing synthetic image detection methods suffer from three major limitations:

Outdated Datasets: Datasets such as OpenForensics primarily contain low-quality or anime-style images generated by early GANs, making it difficult for models to generalize to modern generators such as Stable Diffusion 3.5 and FLUX. RichHF-18K relies solely on point annotations with low spatial precision; SID-Set is designed exclusively for tampered images.

Methodological Limitations: Traditional methods (e.g., PAL4VST) rely on low-level structural cues and struggle to handle artifacts requiring global reasoning, such as violations of physical lighting and shadow laws. Existing MLLM-based methods focus primarily on tampered images, leaving fully synthetic images understudied.

Disconnect Between Detection and Generation: Existing detection methods serve only as "Defenders" and do not explore leveraging detection feedback to improve generation quality.

The core motivation of this paper is to upgrade the detection paradigm from Defender to Controller—not only detecting artifacts but also guiding generative models to eliminate them.

Method

Overall Architecture

LEGION comprises four core components: (i) a global image encoder (ViT-H/14 CLIP), (ii) an LLM (Vicuna-based), (iii) a grounding image encoder (SAM encoder), and (iv) a pixel decoder (a variant of the SAM decoder). The framework supports three tasks: artifact detection (binary classification), artifact localization (pixel-level segmentation), and explanation generation (natural language).

Key Designs

  1. Deepfake Detection: The CLS token from the CLIP global encoder is passed through a two-layer MLP for real/fake binary classification: $\(y_d = \text{MLP}(\text{CLS}(I_x))\)$ This design is concise and effective, leveraging the strong feature representations of pre-trained CLIP.

  2. Explanation Generation: The 256 image tokens (excluding CLS) from the CLIP global encoder are projected into the LLM input space via a vision-language (V-L) projection layer, and textual explanations are generated conditioned on a forgery analysis prompt: $\(y_e = \mathcal{L}(x_p, \mathcal{P}_{vl}(I'_x))\)$ A prompt template is adopted: "The \<image> provides an overview of the image." followed by a forgery analysis-specific prompt.

  3. Artifact Localization: After each artifact description in the LLM output, a <SEG> token is appended. Its embedding is transformed into the decoder feature space via a language-pixel (L-P) projection layer, and the SAM decoder generates binary masks: $\(M = \mathcal{D}(\mathcal{E}_l(x_i), \mathcal{P}_{lp}(v_{seg}))\)$ This achieves language-guided pixel-level artifact segmentation.

  4. Image Refinement Pipeline:

    • Regeneration: LEGION detects artifacts → explanations are recorded in a memory bank → a text modifier revises the prompt → the T2I model regenerates the image.
    • Inpainting: LEGION outputs region-level triplets \((L_i, M_i, E_i)\) (location, mask, explanation) → inpainting is applied region by region, preserving artifact-free areas.

Loss & Training

A two-stage independent training strategy is adopted:

Stage 1 (Localization + Explanation): $\(\mathcal{L}_{s1} = \lambda_{bce}\mathcal{L}_{BCE}(M, \hat{M}) + \lambda_{dice}\mathcal{L}_{Dice}(M, \hat{M}) + \lambda_{ce}\mathcal{L}_{CE}(y_e, \hat{y}_e)\)$ where \(\lambda_{ce}=1.0, \lambda_{dice}=0.2, \lambda_{bce}=0.4\).

Stage 2 (Detection): A standard cross-entropy loss is used for classification: $\(\mathcal{L}_{s2} = \mathcal{L}_{CE}(y_d, \hat{y}_d)\)$

The model is fine-tuned from GLaMM pre-trained weights using LoRA (\(\alpha=8\)) on 8×A100 GPUs.

Key Experimental Results

Main Results (Artifact Localization)

Method Type SynthScars F1 SynthScars mIoU LOKI F1 RichHF-18K F1
PAL4VST Traditional Expert 50.46~52.55 11.58~21.61 49.88 14.78
LISA-v1-7B* MLLM 31.10~37.56 9.29~23.70 35.90 21.94
InternVL2-8B MLLM 41.08~42.03 3.91~13.36 39.90 9.58
LEGION Ours 48.66~60.82 16.71~39.44 50.07 17.41

On SynthScars, LEGION surpasses the strongest traditional expert (PAL4VST) by +3.31% in mIoU and +7.75% in F1.

Explanation Quality

Method Parameters SynthScars ROUGE-L SynthScars CSS LOKI ROUGE-L
Qwen2-VL 72B 25.84 58.15 11.80
LLaVA-v1.6 7B 29.61 61.75 16.07
LEGION - Best Best Best

Ablation Study

SynthScars Dataset Statistics Value
Total fully synthetic images 12,236
Image content categories 4 (Object / Animal / Human / Scene)
Artifact categories 3 (physics / distortion / structure)
Annotation completeness 100% (pixel-level masks + textual explanations + artifact types)

Key Findings

  • LEGION achieves state-of-the-art performance on the majority of metrics across three benchmarks, with F1 exceeding PAL4VST by 10.65 points on the Object category.
  • General-purpose MLLMs (Ferret, Griffon, Qwen2-VL) exhibit two extremes in artifact localization: either complete failure or marking most of the image as artifacts.
  • The image refinement pipeline operating as a Controller significantly outperforms baselines in Human Preference Score (HPS).
  • LEGION demonstrates strong robustness under various perturbations (compression, noise, blurring).

Highlights & Insights

  • Paradigm Shift from Defender to Controller: This is the first work to systematically leverage artifact detection feedback to guide higher-quality image generation, opening a new research direction.
  • SynthScars Fills a Critical Gap: It is the first benchmark for fully synthetic images that provides pixel-level masks, textual explanations, and artifact type labels simultaneously.
  • Unified Multi-Task Framework: Detection, localization, and explanation are unified within a single MLLM, offering greater efficiency than disjoint approaches.
  • Language-Guided Segmentation via <SEG> Tokens: Natural language descriptions from the LLM are seamlessly coupled with pixel-level segmentation.

Limitations & Future Work

  • The model relies on pre-trained SAM and CLIP; adapting to images from entirely new generators may require updating the foundation models.
  • The two-stage training increases complexity; end-to-end joint training is worth exploring.
  • The image refinement pipeline requires multiple iterations, and efficiency remains to be improved.
  • F1 on RichHF-18K is lower than that of LISA-v1-7B, indicating room for improvement in cross-domain generalization.
  • The language-guided segmentation approach is similar to LISA (CVPR 2024) but is specifically tailored for artifact detection.
  • The refinement pipeline shares conceptual similarities with the iterative optimization in Idea2Img but incorporates additional spatial localization information.
  • The annotation pipeline of SynthScars employs Qwen2-VL-72B for quality filtering, which serves as a valuable reference.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — Defender→Controller paradigm + high-quality dataset
  • Experimental Thoroughness: ⭐⭐⭐⭐ — 4 benchmarks + 19 comparison methods + robustness analysis
  • Practical Value: ⭐⭐⭐⭐ — Directly applicable to content moderation and generation quality improvement
  • Overall: ⭐⭐⭐⭐⭐