ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection¶
Conference: ICLR 2026 arXiv: 2510.08630 Code: GitHub Area: Interpretability Keywords: hate speech detection, multimodal, GRPO, curriculum learning, conditional decision entropy, interpretability
TL;DR¶
ExPO-HM is proposed, inspired by the training pipeline of human content moderators. By combining policy manual SFT warm-up, GRPO curriculum learning, and a Conditional Decision Entropy (CDE) reward, it is the first Explain-then-Detect system to comprehensively surpass direct detection baselines across binary classification, fine-grained classification, and reasoning quality in hateful meme detection, achieving up to 15–17% F1 improvement.
Background & Motivation¶
Hateful meme detection is a highly challenging online content moderation task. Existing approaches fall into two main paradigms:
Direct Detection: Outputs only a binary label (hateful/benign). Representative works such as RA-HMD rely on CLIP-based methods, achieving strong performance but providing no explanations, which does not satisfy real-world moderation requirements.
Explain-then-Detect: Generates natural language explanations before classification. However, existing systems of this type (e.g., LOREHM, U-CoT+), which use CoT prompting or agent frameworks, perform worse than simple SFT baselines. Even post-training methods such as GRPO fail to close this gap.
The authors identify two key issues:
Model explanations omit critical cues: Policy-relevant information such as attack targets and attack types is not considered as plausible explanatory hypotheses by the model.
Binary reward signals are insufficient to guide reasoning: Just as human annotators cannot learn effectively from yes/no labels alone, models also require more fine-grained feedback.
The central analogy is the human moderator training pipeline—first studying a detailed policy manual, then practicing from fine-grained category judgments to binary decisions—which motivates the design of ExPO-HM.
Method¶
Overall Architecture¶
ExPO-HM consists of three stages that simulate the human moderator training process:
- SFT-PM Warm-up: Policy manual-enhanced supervised fine-tuning, teaching the model to understand moderation policies.
- GRPO-CL: GRPO with curriculum learning, progressing from fine-grained to binary classification.
- CDE Reward: Conditional Decision Entropy as a proxy reward for reasoning quality.
Key Designs¶
SFT Policy Manual Warm-up (SFT-PM): Fine-grained dataset labels are converted into structured policy manuals used as input prompts, with each policy item accompanied by descriptions from annotation guidelines. The LMM is trained with policy manual-augmented inputs, and the target responses are fine-grained labels. Human-written gold-standard explanations are not used (off-policy usage leads to worse performance).
GRPO Curriculum Learning (GRPO-CL): A simple 50/50/50 strategy is adopted: - First 50% of training steps: fine-grained data only (encouraging reasoning exploration) - Remaining 50%: 50/50 mixture of fine-grained and binary classification data
Key effect: Standard GRPO produces an average response length of only 28 tokens for binary classification, whereas GRPO-CL nearly doubles this to 52 tokens, indicating more detailed reasoning.
Conditional Decision Entropy (CDE):
Definition: Given explanation \(\mathbf{e}\) and input \(\mathbf{x}\), CDE is the entropy of the decision conditioned on the explanation:
Rationale: Good reasoning should lead to clear and correct decisions (low entropy), while poor reasoning produces confusion (high entropy).
CDE Reward Design:
where \(\delta = \mathbf{1}[d = d^*]\) indicates prediction correctness. Correct and confident (low CDE) → reward; incorrect but confident → penalty (coefficient \(\rho\)); incorrect but uncertain → tolerated.
Loss & Training¶
Total reward: \(r = r_{\text{format}} + r_{\text{acc}} + w \cdot r_{\text{CDE}}\)
where \(r_{\text{format}} \in \{0,1\}\) checks output format and \(r_{\text{acc}} \in [0,1\}\) measures prediction correctness. Standard GRPO clipped surrogate loss and KL regularization are applied. Default hyperparameters: \(a=0.1\), \(b=0.5\), \(w=0.2\), \(\rho=0.25\).
Key Experimental Results¶
Main Results¶
Evaluation is conducted on three datasets—HatefulMemes, MAMI, and PrideMM—using Qwen2.5-VL-3B and 7B as backbone models.
Qwen2.5-VL-7B results on HatefulMemes:
| Method | Binary F1 | Attack F1 | Target F1 | LLM Judge | CDE ↓ |
|---|---|---|---|---|---|
| Zero-shot | 65.9 | 44.7 | 64.5 | 5.0 | 0.33 |
| SFT | 74.5 | 58.4 | 69.4 | 5.0 | 0.33 |
| DPO | 73.6 | 63.2 | 66.6 | 4.9 | 0.32 |
| GRPO | 74.5 | 61.2 | 64.5 | 5.2 | 0.26 |
| RA-HMD (SOTA direct detection) | 80.2 | — | — | 5.5 | — |
| ExPO-HM | 81.1 | 75.6 | 77.2 | 6.2 | 0.03 |
ExPO-HM is the first Explain-then-Detect system to comprehensively surpass the direct detection SOTA (RA-HMD), while substantially leading in reasoning quality.
Cross-dataset consistency (7B model):
| Dataset | GRPO Binary F1 | ExPO-HM Binary F1 | Gain |
|---|---|---|---|
| HatefulMemes | 74.5 | 81.1 | +6.6 |
| MAMI | 76.8 | 82.3 | +5.5 |
| PrideMM | 73.2 | 78.7 | +5.5 |
Ablation Study¶
| # | SFT-PM | GRPO-CL | CDE | Binary F1 | Attack F1 | Target F1 | LLM ↑ | CDE ↓ |
|---|---|---|---|---|---|---|---|---|
| 1 | - | - | - | 74.5 | 61.2 | 64.5 | 5.2 | 0.263 |
| 2 | ✓ | - | - | 75.8 | 70.8 | 70.2 | 5.6 | 0.092 |
| 3 | ✓ | ✓ | - | 78.4 | 74.3 | 76.1 | 5.8 | 0.056 |
| 4 | ✓ | ✓ | ✓ | 81.1 | 75.6 | 77.2 | 6.2 | 0.026 |
All three components contribute: SFT-PM substantially improves fine-grained metrics, GRPO-CL further advances all metrics, and CDE markedly enhances reasoning quality (LLM Judge 5.8→6.2).
Key Findings¶
- Explain-then-Detect surpasses Direct Detection for the first time: All prior systems of this type fall short of the SFT baseline.
- CDE strongly correlates with LLM-Judge: Pearson \(r=-0.78\), Spearman \(\rho=-0.81\) (\(p<0.001\)).
- SFT warm-up strategy is critical: Binary-only SFT performs worse than the no-warm-up baseline after the RL stage.
- CDE does not cause policy entropy collapse: Overall policy entropy is comparable to baselines without CDE.
- Human evaluation validates the approach: ExPO-HM achieves 100% logical consistency vs. 96% for GRPO, with helpfulness scores of 2.2 vs. 1.6.
Highlights & Insights¶
- The analogy to human annotator training is highly precise: The progressive pipeline—policy manual → fine-grained practice → binary judgment—translates naturally into the algorithm design.
- CDE is an excellent proxy for reasoning quality: Conceptually simple (conditional entropy), strongly correlated with human evaluation, and usable as a differentiable reward signal.
- Key finding: good SFT does not necessarily lead to good RL: Binary SFT achieves the best SFT-stage performance but yields the worst results after RL.
- Consistency across three datasets: The method generalizes well across different hateful content types.
- Exceptionally comprehensive evaluation: Ablations, warm-up strategy comparisons, CDE analysis, calibration analysis, and human evaluation are all included.
Limitations & Future Work¶
- Limited dataset scale: Annotated hateful meme data, especially with explanations, is extremely scarce.
- Single-turn interaction: Only single-turn reasoning is evaluated; multi-turn moderation dialogue scenarios are not considered.
- Cultural dependency: Moderation policies are highly culture-specific; cross-cultural applicability is not validated.
- Backbone model constraints: Validation is limited to Qwen2.5-VL 3B/7B.
- The framework is extensible to other content moderation tasks (e.g., misinformation detection, cyberbullying identification).
Related Work & Insights¶
- RA-HMD (Mei et al., 2025): The previous SOTA direct detection method, which ExPO-HM is the first to surpass.
- LOREHM (Huang et al., 2024): A reasoning agent framework based on LLaVA-Next-34B.
- GRPO (Shao et al., 2024): ExPO-HM builds upon this by incorporating the CDE reward and curriculum learning.
- Insight: The "reasoning quality proxy" concept underlying CDE is generalizable to other tasks requiring interpretable reasoning.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The CDE concept is original, and the curriculum learning strategy is elegantly designed.
- Technical Depth: ⭐⭐⭐⭐ — The mapping from the human training pipeline to concrete algorithmic design is complete and coherent.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three datasets, multiple baselines, ablations, and human evaluation make for an exceptionally thorough evaluation.
- Writing Quality: ⭐⭐⭐⭐ — Motivation is clear and experiments are well organized.
- Value: ⭐⭐⭐⭐ — Directly applicable to content moderation in practice.
- Overall Recommendation: ⭐⭐⭐⭐ (4/5)