PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection¶

Conference: ICML 2026
arXiv: 2509.26272
Code: https://github.com/tuanrpt/PRPO (available)
Area: AI Safety / Multimodal VLM / Deepfake Detection / RLHF
Keywords: deepfake detection, GRPO, paragraph-level reward, visual grounding, MLLM reasoning

TL;DR¶

The authors introduce a 115k-sample DF-R5 dataset with reasoning annotations, replace CLIP ViT with ConvNeXT in the DX-LLaVA architecture, and propose PRPO—a paragraph-level GRPO variant. Each paragraph is rewarded based on CLIP text-image similarity (VCR) and majority-vote consistency between reasoning and conclusion (PCR). This approach boosts cross-domain deepfake detection F1 from SOTA 75.26% to 89.91%, and reasoning quality from 4.2/5 to 4.55/5.

Background & Motivation¶

Background: Deepfake images generated by diffusion models/GANs are nearly indistinguishable from real images. Binary detectors (CLIP-ViT, ConvNeXT, frequency-domain features) are strong but lack interpretability. MLLMs (LLaVA, GPT-4o, Gemini) have strong reasoning abilities and could theoretically provide "why this image is fake" explanations, but their detection accuracy is poor in practice.

Limitations of Prior Work: (1) Data scarcity—existing deepfake datasets lack high-quality reasoning annotations, and direct QA distillation only yields "short answer" predictions; (2) Architectural issues—LLaVA's CLIP ViT captures global semantics but is insensitive to local high-frequency textures (hair, pores, background discontinuities) crucial for deepfake detection; (3) Reasoning quality—MLLMs often "conclude first, justify later," leading to conclusions decoupled from image evidence and even hallucinated flaws.

Key Challenge: RL algorithms (GRPO/PPO) typically reward only the final label, while deepfake reasoning is structured as "multiple clues per paragraph + integrated conclusion." Token-level/sequence-level rewards cannot provide inter-paragraph consistency signals or directly align each paragraph with image evidence.

Goal: (1) Construct a large-scale dataset with high-quality paragraph-level reasoning annotations; (2) Use a backbone sensitive to local textures for supervised fine-tuning; (3) Design a paragraph-level RL method to maintain visual alignment and inter-paragraph consistency during reasoning, enabling continual self-improvement at test time with label-free rewards.

Key Insight: Leverage the natural "paragraph-structured" nature of reasoning, treating each paragraph as an independent trajectory unit in RL, and reward both visual consistency (VCR) and semantic consistency with the final conclusion (PCR) per paragraph, using GRPO's group-relative advantage for weighted learning.

Core Idea: Elevate GRPO's token-level advantage to the paragraph level, with each paragraph's reward composed of frozen CLIP-ConvNeXT computed text-image similarity and inter-paragraph majority-vote consistency, ensuring that reasoning paragraphs both describe real evidence and are mutually coherent.

Method¶

Overall Architecture¶

Three stages: (1) DF-R5 Data Synthesis—Pool 200 candidate deepfake features using 4 MLLMs → Gemini scores each image → Cluster scores into ≤7 semantic groups → Generate 115k paragraph-level reasoning samples; (2) DX-LLaVA Fine-tuning—Replace LLaVA's CLIP ViT with CLIP ConvNeXT Stage-3 output (10×10 pixel-level features flattened to 100 tokens), jointly train projector + Vicuna + binary classification head, with loss \(\mathcal L_{\text{lm}}+\alpha\mathcal L_{\text{binary}}\) (\(\alpha=10\)); (3) PRPO Test-time RL—For each image, sample \(L\) complete reasoning outputs, split into paragraphs \(\{p_1^{(i)},\dots,p_{M_i+1}^{(i)}\}\) (last is final answer), compute reward \(R(p_j^{(i)})=\tfrac12(R_{\text{VCR}}+R_{\text{PCR}})\) per paragraph, normalize within group to get advantage \(A_j^{(i)}\), update policy with PPO-clip + KL regularization.

Key Designs¶

Visual Consistency Reward (VCR):
- Function: Ensures each reasoning paragraph describes actual visual evidence in the image, eliminating hallucinations.
- Mechanism: Use YAKE unsupervised keyword extraction to obtain \(s_j^{(i)}\) from paragraph \(p_j^{(i)}\), feed into frozen CLIP-ConvNeXT text encoder, compute cosine similarity with image encoder output: \(R_{\text{VCR}}(p_j^{(i)})=\tfrac12[\text{sim}(\text{CLIP}_{\text{txt}}(s_j^{(i)}),\text{CLIP}_{\text{img}}(x))+1]\in[0,1]\).
- Design Motivation: Feeding the entire paragraph to CLIP exceeds length limits and dilutes semantics; extracting keywords with YAKE focuses the signal on "specific features mentioned," fitting CLIP's input constraints. Reusing the same ConvNeXT (already in the architecture) as the reward model avoids external models and extra computation.
Prediction Consistency Reward (PCR):
- Function: Ensures the final conclusion is consistent with the majority of reasoning paragraphs, mitigating internal contradictions (e.g., evidence points to fake but conclusion says real).
- Mechanism: Use predefined vocabularies \(\mathcal F\) (unnatural, inconsistent…), \(\mathcal R\) (authentic, natural…), \(\mathcal N\) (no, not…) to map to paragraph-level labels \(\hat y(p_j^{(i)})\); intermediate paragraphs always get reward 1 (assumed consistent with image), final paragraph reward is \(\mathbb I[\hat y(p_{M_i+1}^{(i)})=\hat y_{\text{maj}}^{(i)}]\), where \(\hat y_{\text{maj}}^{(i)}\) is the majority vote of previous paragraphs.
- Design Motivation: In deepfake reasoning, there is no "step-wise gold" as in mathematical reasoning; using model's own inter-paragraph consistency as a label-free signal avoids external models/annotation costs, aligning with test-time RL needs.
Paragraph-level GRPO Loss (PRPO):
- Function: Upgrades GRPO from token/sequence granularity to paragraph granularity, allowing each paragraph to independently receive advantage and be updated via PPO-clip.
- Mechanism: For group \(\mathcal O=\{o^{(1)},\dots,o^{(L)}\}\), compute mean and variance \(\mu_R,\sigma_R\) across all paragraphs, normalize \(A_j^{(i)}=(R(p_j^{(i)})-\mu_R)/(\sigma_R+\epsilon)\); policy ratio \(r_j^{(i)}=\pi_\theta(p_j^{(i)}|v,z)/\pi_{\text{old}}(p_j^{(i)}|v,z)\); loss \(\mathcal L_{\text{PRPO}}=\mathbb E\sum_{i,j}\min(r_j^{(i)}A_j^{(i)},\text{clip}(r_j^{(i)},1-\epsilon,1+\epsilon)A_j^{(i)})\); add paragraph-level KL term \(\mathcal L_{\text{KL}}\) to align with reference model; total objective \(\max_\theta\mathcal J=\mathcal L_{\text{PRPO}}-\beta\mathcal L_{\text{KL}}\) (\(\beta=0.01\)).
- Design Motivation: Token-level GRPO shares the same advantage across an entire reasoning chain, causing both "good" and "bad" paragraphs to be rewarded/penalized equally; paragraph-level rewards precisely target the relevant text, and group normalization prevents reward drift.

Loss & Training¶

Fine-tuning stage: \(\mathcal L_{\text{total}}=\mathcal L_{\text{lm}}+\alpha\mathcal L_{\text{binary}}\) (\(\alpha=10\), binary head uses GAP then linear classification); PRPO stage: \(\mathcal J=\mathcal L_{\text{PRPO}}-\beta\mathcal L_{\text{KL}}\) (\(\beta=0.01\)). Learning rates: fine-tuning \(2\times 10^{-5}\), PRPO \(3\times 10^{-7}\). CLIP ConvNeXT is frozen; only projector, Vicuna, and classification head are fine-tuned. 8 H200 GPUs, verl framework.

Key Experimental Results¶

Main Results¶

Leave-one-domain-out cross-domain testing on DF-40 (train on 4 domains, test on the 5th), F1:

Method	→DDIM	→PixArt	→SD	→SiT	→StyleGAN	Avg
LLaVA	49.86	65.46	26.54	15.36	57.03	42.85
DE-FAKE	8.83	86.45	95.80	4.55	76.50	54.43
FakeShield	31.84	88.57	92.28	33.22	98.70	68.92
UnivCLIP	74.85	89.31	74.81	40.01	86.46	73.09
SIDA	70.07	73.86	92.37	46.53	94.98	75.26
DX-LLaVA (ours, SFT)	92.34	83.11	89.35	26.46	99.13	78.08
PRPO (ours, RL)	95.88	88.10	94.99	71.26	99.32	89.91

Cross-domain average F1 improves by 14.65 pp over SIDA, with a 24.7 pp jump (46.53→71.26) on the hardest SiT domain.

Ablation Study¶

Configuration	F1 / Key Metric	Notes
LLaVA + \(\mathcal L_{\text{lm}}\) (language loss only)	35.82 (inter-domain)	High precision, low recall; model predicts all real
LLaVA + \(\mathcal L_{\text{lm}}+\alpha\mathcal L_{\text{binary}}\)	61.66	Binary head significantly improves discrimination
Replace with ConvNeXT backbone (DX-LLaVA)	78.08	Local texture advantage
+ PRPO	89.91	Paragraph-level RL further tightly couples reasoning and image
Reasoning quality score (Gemini judge)	4.55/5 (PRPO) vs 4.20/5 (Gemini-2.5)	First time surpassing teacher model

Key Findings¶

PRPO yields the largest gains on the hardest, nearly indistinguishable SiT domain, indicating that paragraph-level rewards shift the model from "texture-based discrimination" to "systematic multi-clue reasoning."
SFT + ConvNeXT alone is insufficient—RL is necessary to improve from 78→89.
PRPO uses purely label-free rewards (CLIP similarity + majority vote), yet delivers much greater downstream gains than traditional supervised baselines, demonstrating the effectiveness of continual self-consistency and self-alignment at test time.
Reasoning quality score (4.55) surpasses Gemini-2.5 (4.20) for the first time, showing that structured rewards improve explanation quality more than simple scaling.

Highlights & Insights¶

Elevating reward granularity from token to "paragraph = semantic unit" is a natural extension of RLHF/GRPO frameworks for long, structured reasoning—this approach is transferable to tasks like legal documents, medical reports, code review, or any "paragraph-structured, internally consistent" task.
VCR uses off-the-shelf frozen CLIP as the reward model, avoiding the cost and instability of training a reward model; PCR uses predefined dictionaries + majority vote for prediction consistency, making the entire reward scheme nearly "zero-cost" and highly suitable for test-time RL.
RL fine-tuning on OSS models like LLaVA outperforms closed-source models like GPT-4o/Gemini-2.5 on vertical tasks, indicating that "appropriate reward structure + RL" is much more cost-effective than simply scaling parameters.

Limitations & Future Work¶

Cross-domain evaluation covers only 5 generator domains (DDIM / PixArt / SD / SiT / StyleGAN); generalization to newer models (SD-3 / Flux / Sora / video deepfake) is unknown.
PCR relies on manually designed keyword dictionaries \(\mathcal F/\mathcal R/\mathcal N\), which may need redesign for different languages or forgery types; majority vote can still yield high consistency rewards even if all paragraphs are wrong, posing risks.
VCR uses CLIP-ConvNeXT as the judge, essentially using the detector itself as the reward—this coupling may amplify backbone biases (reward hacking risk).
Does not cover video/audio deepfakes, nor discuss reward robustness under adversarial perturbations.

vs GRPO (Shao et al. 2024): GRPO normalizes token-level advantage within groups; PRPO raises granularity to paragraphs, better suited for long, structured reasoning.
vs TTRL (Zuo et al. 2026) / self-certainty reward (Zhao et al. 2026): Also label-free, but TTRL uses overall majority vote as reward, while PRPO further refines to paragraph × visual consistency, providing denser signals.
vs SIDA / FakeShield: Traditional deepfake detection focuses on binary classification + local features; PRPO's "reasoning + reflection" structure improves both detection accuracy and interpretability.

Rating¶

Novelty: ⭐⭐⭐⭐ PRPO elevates reward granularity to paragraphs + fully label-free reward, a practical adaptation of the GRPO family
Experimental Thoroughness: ⭐⭐⭐⭐ 5-domain leave-one-out + multiple MLLM baselines + reasoning quality scoring + detailed ablation
Writing Quality: ⭐⭐⭐⭐ Three-stage pipeline is clearly explained, with formulas and algorithms well-placed
Value: ⭐⭐⭐⭐ Achieves SOTA on the safety-critical "explainable deepfake detection" task with clear engineering reproducibility