PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection¶
Conference: ICML 2026
arXiv: 2509.26272
Code: https://github.com/tuanrpt/PRPO (Available)
Area: AI Safety / Multimodal VLM / Deepfake Detection / RLHF
Keywords: deepfake detection, GRPO, paragraph-level reward, visual grounding, MLLM reasoning
TL;DR¶
The authors utilize a 115k DF-R5 dataset with reasoning annotations + a DX-LLaVA architecture that replaces CLIP ViT with ConvNeXT, and propose PRPO—a paragraph-level GRPO variant. Each paragraph is rewarded based on CLIP-text-image similarity (VCR) + reasoning-conclusion majority vote consistency (PCR). This pushes cross-domain deepfake detection F1 from the SOTA of 75.26% to 89.91% and improves reasoning quality from 4.2/5 to 4.55/5.
Background & Motivation¶
Background: Deepfake images generated by diffusion models or GANs are nearly indistinguishable from real images. While binary detectors (CLIP-ViT, ConvNeXT, frequency domain features) are strong, they are completely uninterpretable. MLLMs (LLaVA, GPT-4o, Gemini) possess strong reasoning capabilities and should theoretically provide the basis for "why this image is fake," but their actual detection accuracy is poor.
Limitations of Prior Work: (1) Data Scarcity—Existing deepfake datasets lack high-quality reasoning annotations, and direct distillation using QA only learns "short answer" style predictions; (2) Architectural Issues—LLaVA uses CLIP ViT to capture global semantics, making it insensitive to key local high-frequency textures in deepfakes (hair, pores, background discontinuities); (3) Reasoning Quality—MLLMs frequently "conclude before justifying," causing conclusions to decouple from visual evidence or even hallucinating flaws that do not exist in the image.
Key Challenge: RL algorithms (GRPO/PPO) generally focus on the final label. Deepfake reasoning is a structured text consisting of "paragraph-wise descriptions of multiple fake clues + synthesis of conclusion." Token-level/sequence-level rewards can neither provide inter-paragraph consistency signals nor directly force the model to align each paragraph with image evidence.
Goal: (1) Create a large-scale dataset with high-quality paragraph-style reasoning annotations; (2) Perform supervised fine-tuning using a backbone sensitive to local textures; (3) Design a paragraph-level RL to ensure continuous visual alignment and inter-paragraph consistency during reasoning, allowing self-improvement at test-time with label-free rewards.
Key Insight: Starting from the natural structure where reasoning is partitioned into paragraphs, each paragraph is treated as an independent trajectory unit in RL. Each unit is rewarded for its visual consistency (VCR) with the image and semantic consistency (PCR) with the final conclusion, utilizing GRPO's group-relative advantage for weighted learning.
Core Idea: Elevate GRPO's token-level advantage to a "paragraph-level." The reward for each paragraph is a synthesis of image-text similarity calculated by a frozen CLIP-ConvNeXT + majority vote consistency across paragraphs. This ensures that reasoning paragraphs must describe real evidence in the image and remain self-consistent.
Method¶
Overall Architecture¶
Three stages: (1) DF-R5 Data Synthesis—Aggregating 200 candidate deepfake features using 4 MLLMs → Gemini scoring each image → Clustering scores into \(\leq 7\) semantic groups → Generating 115k paragraph-style reasonings; (2) DX-LLaVA SFT—Replacing LLaVA's CLIP ViT with CLIP ConvNeXT Stage-3 output (flattening 10x10 pixel features into 100 tokens), training a projector + Vicuna + a binary classification head on top, with a loss of \(\mathcal L_{\text{lm}}+\alpha\mathcal L_{\text{binary}}\) (\(\alpha=10\)); (3) PRPO test-time RL—Sampling \(L\) complete reasonings per image, segmenting into paragraphs \(\{p_1^{(i)},\dots,p_{M_i+1}^{(i)}\}\) (final paragraph is the final answer), calculating \(R(p_j^{(i)})=\tfrac12(R_{\text{VCR}}+R_{\text{PCR}})\) for each paragraph, normalizing within the group to obtain advantage \(A_j^{(i)}\), and updating the policy via PPO-clip with KL regularization.
Key Designs¶
-
Visual Consistency Reward (VCR):
- Function: Ensures each paragraph of reasoning text actually describes visual evidence existing in the image, eliminating hallucinations.
- Mechanism: Use YAKE unsupervised keyword extraction to extract \(s_j^{(i)}\) from paragraph \(p_j^{(i)}\), feed into the frozen CLIP-ConvNeXT text encoder, and calculate cosine similarity with the image encoder output: \(R_{\text{VCR}}(p_j^{(i)})=\tfrac12[\text{sim}(\text{CLIP}_{\text{txt}}(s_j^{(i)}),\text{CLIP}_{\text{img}}(x))+1]\in[0,1]\).
- Design Motivation: Feeding the entire paragraph into CLIP would exceed length limits and dilute semantics; calculating similarity after keyword extraction fits CLIP input constraints and focuses signals on "specific features mentioned." Reusing the existing ConvNeXT from the architecture as the reward model saves external models and computing power.
-
Prediction Consistency Reward (PCR):
- Function: Ensures the final conclusion matches the majority of reasoning paragraphs, alleviating internal contradictions where evidence points to fake but the conclusion says real.
- Mechanism: Use predefined vocabularies \(\mathcal F\) (unnatural, inconsistent...), \(\mathcal R\) (authentic, natural...), \(\mathcal N\) (no, not...) to regularize into paragraph-level labels \(\hat y(p_j^{(i)})\). The reward for intermediate paragraphs is a constant 1 (default consistency), and the reward for the final paragraph is \(\mathbb I[\hat y(p_{M_i+1}^{(i)})=\hat y_{\text{maj}}^{(i)}]\), where \(\hat y_{\text{maj}}^{(i)}\) is the majority vote of all preceding paragraphs.
- Design Motivation: In deepfake reasoning scenarios without "step-wise gold" labels, process rewards from mathematical reasoning cannot be borrowed; using the model's own inter-paragraph consistency as a label-free signal avoids external model/annotation costs, perfectly suiting the needs of test-time RL.
-
Paragraph-level GRPO Loss (PRPO):
- Function: Upgrades GRPO from token/sequence granularity to paragraph granularity, allowing each paragraph to obtain an independent advantage for weighted updates via PPO-clip.
- Mechanism: Uniformly calculate the mean and variance \(\mu_R, \sigma_R\) for all paragraphs within a group \(\mathcal O=\{o^{(1)},\dots,o^{(L)}\}\), normalize \(A_j^{(i)}=(R(p_j^{(i)})-\mu_R)/(\sigma_R+\epsilon)\); the policy ratio \(r_j^{(i)}=\pi_\theta(p_j^{(i)}|v,z)/\pi_{\text{old}}(p_j^{(i)}|v,z)\); the loss is \(\mathcal L_{\text{PRPO}}=\mathbb E\sum_{i,j}\min(r_j^{(i)}A_j^{(i)},\text{clip}(r_j^{(i)},1-\epsilon,1+\epsilon)A_j^{(i)})\); an additional paragraph-level KL term \(\mathcal L_{\text{KL}}\) is added for alignment with the reference model; total objective \(\max_\theta\mathcal J=\mathcal L_{\text{PRPO}}-\beta\mathcal L_{\text{KL}}\) (\(\beta=0.01\)).
- Design Motivation: Token-level GRPO makes an entire reasoning sequence share the same advantage, leading to identical rewards/penalties for both "excellent paragraphs" and "incorrect paragraphs." Paragraph-level rewards pinpoint signals to the corresponding text, and intra-group normalization prevents reward value drift.
Loss & Training¶
SFT stage: \(\mathcal L_{\text{total}}=\mathcal L_{\text{lm}}+\alpha\mathcal L_{\text{binary}}\) (\(\alpha=10\), binary head uses linear classification after GAP). PRPO stage: \(\mathcal J=\mathcal L_{\text{PRPO}}-\beta\mathcal L_{\text{KL}}\) (\(\beta=0.01\)). LR: SFT \(2\times 10^{-5}\), PRPO \(3\times 10^{-7}\). CLIP ConvNeXT is frozen, only the projector, Vicuna, and classification head are fine-tuned. 8x H200 cards, verl framework.
Key Experimental Results¶
Main Results¶
Leave-one-domain-out cross-domain testing on DF-40 (trained on 4 domains, tested on the 5th), F1:
| Method | →DDIM | →PixArt | →SD | →SiT | →StyleGAN | Mean |
|---|---|---|---|---|---|---|
| LLaVA | 49.86 | 65.46 | 26.54 | 15.36 | 57.03 | 42.85 |
| DE-FAKE | 8.83 | 86.45 | 95.80 | 4.55 | 76.50 | 54.43 |
| FakeShield | 31.84 | 88.57 | 92.28 | 33.22 | 98.70 | 68.92 |
| UnivCLIP | 74.85 | 89.31 | 74.81 | 40.01 | 86.46 | 73.09 |
| SIDA | 70.07 | 73.86 | 92.37 | 46.53 | 94.98 | 75.26 |
| DX-LLaVA (ours, SFT) | 92.34 | 83.11 | 89.35 | 26.46 | 99.13 | 78.08 |
| PRPO (Ours, RL) | 95.88 | 88.10 | 94.99 | 71.26 | 99.32 | 89.91 |
The cross-domain average F1 increased by 14.65 pp compared to SIDA, with a jump of 24.7 pp on the most difficult SiT domain (46.53 → 71.26).
Ablation Study¶
| Configuration | F1 / Key Metrics | Description |
|---|---|---|
| LLaVA + \(\mathcal L_{\text{lm}}\) | 35.82 (inter-domain) | High precision, low recall; model predicts all as real |
| LLaVA + \(\mathcal L_{\text{lm}}+\alpha\mathcal L_{\text{binary}}\) | 61.66 | Binary head significantly enhances discrimination |
| ConvNeXT backbone (DX-LLaVA) | 78.08 | Local texture advantage |
| + PRPO | 89.91 | Paragraph-level RL further locks reasoning to image |
| Reasoning Quality Score (Gemini judge) | 4.55/5 (PRPO) vs 4.20/5 (Gemini-2.5) | Surpassed the teacher model for the first time |
Key Findings¶
- PRPO achieves the largest gain in the most difficult and nearly indistinguishable SiT domain, indicating that paragraph-level rewards effectively pull the model from "relying on backbone textures" to "relying on systematic multi-clue reasoning."
- SFT + ConvNeXT alone is insufficient—RL is necessary to jump from 78 to 89.
- PRPO uses purely label-free rewards (CLIP similarity + majority vote) yet brings significant downstream benefits over traditional supervised baselines, proving the effectiveness of continuous self-consistency and self-alignment at test-time.
- Reasoning quality scores (4.55) surpassed Gemini-2.5 (4.20) for the first time, showing that structured rewards improve explanation quality better than simple scaling.
Highlights & Insights¶
- Pushing reward granularity from tokens to "paragraph = one semantic unit" is a natural extension of the RLHF/GRPO framework for long structured reasoning. This approach can be applied to any task that is partitioned and requires internal consistency, such as legal documents, medical reports, or code reviews.
- VCR uses an existing frozen CLIP as the reward model, avoiding the cost and instability of training a separate reward model. PCR uses predefined dictionaries + majority votes as prediction consistency signals. The entire reward system is nearly "zero-cost" and extremely friendly to test-time RL.
- Fine-tuning RL on an OSS model like LLaVA to surpass closed-source models like GPT-4o/Gemini-2.5 shows that "appropriate reward structure + RL" is much more cost-effective than stacking parameters for vertical tasks.
Limitations & Future Work¶
- Cross-domain evaluation only covered 5 generator domains (DDIM / PixArt / SD / SiT / StyleGAN). Latest models like SD-3 / Flux / Sora and video deepfakes were not covered; generalization capability remains unknown.
- PCR relies on hand-designed keyword dictionaries \(\mathcal F/\mathcal R/\mathcal N\), which may need redesigning for different languages or forgery types. The majority vote rule can still provide high consistency rewards even if all paragraphs are incorrect.
- VCR uses CLIP-ConvNeXT as a judge, essentially treating the detector as its own reward—coupling with the training objective and potentially amplifying the backbone's inherent bias (risk of reward hacking).
- Video/audio deepfakes were not covered, nor was reward robustness under adversarial perturbations discussed.
Related Work & Insights¶
- vs GRPO (Shao et al. 2024): GRPO normalizes token-level advantage within a group; PRPO elevates the granularity to paragraphs, making it better suited for long structured reasoning.
- vs TTRL (Zuo et al. 2026) / self-certainty reward (Zhao et al. 2026): Also label-free, but TTRL uses an overall majority vote as the reward. PRPO further refines this to paragraph \(\times\) visual consistency, providing denser signals.
- vs SIDA / FakeShield: Traditional deepfake detection mainly relies on binary classification + local features; PRPO improves both detection accuracy and interpretability using a "reasoning + reflection" structure.
Rating¶
- Novelty: ⭐⭐⭐⭐ PRPO elevates reward granularity to paragraphs + uses fully label-free rewards, representing a practical modification to the GRPO family.
- Experimental Thoroughness: ⭐⭐⭐⭐ 5-domain leave-one-out + multiple MLLM baselines + reasoning quality scoring + detailed ablation studies.
- Writing Quality: ⭐⭐⭐⭐ The three-stage pipeline is explained clearly, and the placement of formulas and algorithms is logical.
- Value: ⭐⭐⭐⭐ Provides a SOTA for "interpretable deepfake detection," a security-critical task, with clear engineering reproducibility.