LEGION: Learning to Ground and Explain for Synthetic Image Detection¶

Conference: ICCV 2025 arXiv: 2503.15264 Code: opendatalab.github.io/LEGION Area: Image Segmentation Keywords: Synthetic Image Detection, Artifact Localization, MLLM, Explainability, Image Refinement

TL;DR¶

This paper proposes the LEGION framework and the SynthScars dataset, leveraging a multimodal large language model (MLLM) to unify artifact detection, pixel-level segmentation, and textual explanation for synthetic image detection. It further innovatively extends the role of the detector from a "Defender" to a "Controller," guiding generative models to produce higher-quality images.

Background & Motivation¶

The rapid advancement of generative techniques (GAN → Diffusion → Autoregressive models) has made synthetic images increasingly photorealistic, giving rise to serious risks including privacy violations, copyright disputes, and misinformation. Existing synthetic image detection methods suffer from three major limitations:

Outdated Datasets: Datasets such as OpenForensics primarily contain low-quality or anime-style images generated by early GANs, making it difficult for models to generalize to modern generators such as Stable Diffusion 3.5 and FLUX. RichHF-18K relies solely on point annotations with low spatial precision; SID-Set is designed exclusively for tampered images.

Methodological Limitations: Traditional methods (e.g., PAL4VST) rely on low-level structural cues and struggle to handle artifacts requiring global reasoning, such as violations of physical lighting and shadow laws. Existing MLLM-based methods focus primarily on tampered images, leaving fully synthetic images understudied.

Disconnect Between Detection and Generation: Existing detection methods serve only as "Defenders" and do not explore leveraging detection feedback to improve generation quality.

The core motivation of this paper is to upgrade the detection paradigm from Defender to Controller—not only detecting artifacts but also guiding generative models to eliminate them.

Method¶

Overall Architecture¶

LEGION comprises four core components: (i) a global image encoder (ViT-H/14 CLIP), (ii) an LLM (Vicuna-based), (iii) a grounding image encoder (SAM encoder), and (iv) a pixel decoder (a variant of the SAM decoder). The framework supports three tasks: artifact detection (binary classification), artifact localization (pixel-level segmentation), and explanation generation (natural language).

Key Designs¶

Deepfake Detection: The CLS token from the CLIP global encoder is passed through a two-layer MLP for real/fake binary classification: $$y_d = \text{MLP}(\text{CLS}(I_x))$$ This design is concise and effective, leveraging the strong feature representations of pre-trained CLIP.
Explanation Generation: The 256 image tokens (excluding CLS) from the CLIP global encoder are projected into the LLM input space via a vision-language (V-L) projection layer, and textual explanations are generated conditioned on a forgery analysis prompt: $$y_e = \mathcal{L}(x_p, \mathcal{P}_{vl}(I'_x))$$ A prompt template is adopted: "The \<image> provides an overview of the image." followed by a forgery analysis-specific prompt.
Artifact Localization: After each artifact description in the LLM output, a <SEG> token is appended. Its embedding is transformed into the decoder feature space via a language-pixel (L-P) projection layer, and the SAM decoder generates binary masks: $$M = \mathcal{D}(\mathcal{E}_l(x_i), \mathcal{P}_{lp}(v_{seg}))$$ This achieves language-guided pixel-level artifact segmentation.
Image Refinement Pipeline:
- Regeneration: LEGION detects artifacts → explanations are recorded in a memory bank → a text modifier revises the prompt → the T2I model regenerates the image.
- Inpainting: LEGION outputs region-level triplets $(L_i, M_i, E_i)$ (location, mask, explanation) → inpainting is applied region by region, preserving artifact-free areas.

Loss & Training¶

A two-stage independent training strategy is adopted:

Stage 1 (Localization + Explanation): $$\mathcal{L}_{s1} = \lambda_{bce}\mathcal{L}_{BCE}(M, \hat{M}) + \lambda_{dice}\mathcal{L}_{Dice}(M, \hat{M}) + \lambda_{ce}\mathcal{L}_{CE}(y_e, \hat{y}_e)$$ where $\lambda_{ce}=1.0, \lambda_{dice}=0.2, \lambda_{bce}=0.4$.

Stage 2 (Detection): A standard cross-entropy loss is used for classification: $$\mathcal{L}_{s2} = \mathcal{L}_{CE}(y_d, \hat{y}_d)$$

The model is fine-tuned from GLaMM pre-trained weights using LoRA ($\alpha=8$) on 8×A100 GPUs.

Key Experimental Results¶

Main Results (Artifact Localization)¶

Method	Type	SynthScars F1	SynthScars mIoU	LOKI F1	RichHF-18K F1
PAL4VST	Traditional Expert	50.46~52.55	11.58~21.61	49.88	14.78
LISA-v1-7B*	MLLM	31.10~37.56	9.29~23.70	35.90	21.94
InternVL2-8B	MLLM	41.08~42.03	3.91~13.36	39.90	9.58
LEGION	Ours	48.66~60.82	16.71~39.44	50.07	17.41

On SynthScars, LEGION surpasses the strongest traditional expert (PAL4VST) by +3.31% in mIoU and +7.75% in F1.

Explanation Quality¶

Method	Parameters	SynthScars ROUGE-L	SynthScars CSS	LOKI ROUGE-L
Qwen2-VL	72B	25.84	58.15	11.80
LLaVA-v1.6	7B	29.61	61.75	16.07
LEGION	-	Best	Best	Best

Ablation Study¶

SynthScars Dataset Statistics	Value
Total fully synthetic images	12,236
Image content categories	4 (Object / Animal / Human / Scene)
Artifact categories	3 (physics / distortion / structure)
Annotation completeness	100% (pixel-level masks + textual explanations + artifact types)

Key Findings¶

LEGION achieves state-of-the-art performance on the majority of metrics across three benchmarks, with F1 exceeding PAL4VST by 10.65 points on the Object category.
General-purpose MLLMs (Ferret, Griffon, Qwen2-VL) exhibit two extremes in artifact localization: either complete failure or marking most of the image as artifacts.
The image refinement pipeline operating as a Controller significantly outperforms baselines in Human Preference Score (HPS).
LEGION demonstrates strong robustness under various perturbations (compression, noise, blurring).

Highlights & Insights¶

Paradigm Shift from Defender to Controller: This is the first work to systematically leverage artifact detection feedback to guide higher-quality image generation, opening a new research direction.
SynthScars Fills a Critical Gap: It is the first benchmark for fully synthetic images that provides pixel-level masks, textual explanations, and artifact type labels simultaneously.
Unified Multi-Task Framework: Detection, localization, and explanation are unified within a single MLLM, offering greater efficiency than disjoint approaches.
Language-Guided Segmentation via <SEG> Tokens: Natural language descriptions from the LLM are seamlessly coupled with pixel-level segmentation.

Limitations & Future Work¶

The model relies on pre-trained SAM and CLIP; adapting to images from entirely new generators may require updating the foundation models.
The two-stage training increases complexity; end-to-end joint training is worth exploring.
The image refinement pipeline requires multiple iterations, and efficiency remains to be improved.
F1 on RichHF-18K is lower than that of LISA-v1-7B, indicating room for improvement in cross-domain generalization.

The language-guided segmentation approach is similar to LISA (CVPR 2024) but is specifically tailored for artifact detection.
The refinement pipeline shares conceptual similarities with the iterative optimization in Idea2Img but incorporates additional spatial localization information.
The annotation pipeline of SynthScars employs Qwen2-VL-72B for quality filtering, which serves as a valuable reference.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Defender→Controller paradigm + high-quality dataset
Experimental Thoroughness: ⭐⭐⭐⭐ — 4 benchmarks + 19 comparison methods + robustness analysis
Practical Value: ⭐⭐⭐⭐ — Directly applicable to content moderation and generation quality improvement
Overall: ⭐⭐⭐⭐⭐