SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models¶
Conference: ACL2026
arXiv: 2509.15174
Code: https://github.com/hnghiem-nlp/hate_dpo_public
Area: Social Computing / Content Moderation / Explainable NLP
Keywords: Toxicity detection, Explainable classification, Self-augmenting training, DPO, Cross-model refinement
TL;DR¶
SMARTER utilizes a small number of labeled samples to allow LLMs to generate explanations for both correct and incorrect labels. It then enhances explainable toxicity detection through preference optimization and cross-model refinement, achieving 86%-100% of full-training performance using only 6%-57% of the training data across three datasets.
Background & Motivation¶
Background: Social platforms require the detection of hate speech, offensive content, and implicit harmful expressions. While traditional classifiers provide labels, they often lack human-readable explanations. LLMs can simultaneously output classifications and explanations, making them more suitable for content moderation scenarios requiring transparency and human review.
Limitations of Prior Work: Toxicity detection involves fine-grained label spaces and fuzzy boundaries; in particular, implicit toxicity relies heavily on context and definitions. High-quality annotated data is expensive, and standards across platforms vary with linguistic trends. Although zero-shot or few-shot prompting of commercial LLMs is convenient, it is costly, lacks controllability, shows high variance, and may not consistently produce compliant explanations.
Key Challenge: Low-resource scenarios require both high accuracy and human-understandable explanations. Direct SFT on small samples is prone to overfitting, while pure prompting is unstable. The core problem is how to enable models to leverage their inherent generation capabilities to construct useful training signals under minimal supervision.
Goal: The authors propose a two-stage framework: first, enhancing classification through self-augmented explanations and DPO for a single model; second, allowing models of different architectures to learn explanation styles and reasoning patterns from each other. This results in explainable, deployable moderation models with minimal data.
Key Insight: The paper exploits the structured preference that "an explanation given the correct label should be superior to an explanation given an incorrect label." For each labeled post, the model generates an explanation for the gold label and counterfactual explanations for incorrect labels, naturally forming chosen/rejected data.
Core Idea: Transform self-generated explanations for correct/incorrect labels into preference optimization data, then absorb complementary explanation capabilities through cross-model refinement.
Method¶
Overall Architecture¶
SMARTER consists of two phases. Phase 1 is individual model self-augmentation: a small set of training samples \(K \in \{16,32,64,128,256\}\) is sampled for each task. The model is first SFT-ed on these samples, then generates explanations for both correct and incorrect labels, followed by preference optimization using DPO or KTO. Phase 2 is cross-model refinement: explanations generated by one model at \(K=128\) are used to train another model, allowing a weaker model to absorb the reasoning style of a stronger or complementary model.
The experiments utilize three tasks: HateXplain, Latent Hate, and Implicit Hate. Models include Llama-3.1-8B-Instruct and COT-T5-XL. Macro-F1 is the evaluation metric due to the importance of class imbalance and multi-class semantic boundaries.
Key Designs¶
-
Self-augmenting Explanations for Preference Data:
- Function: Expands training signals without additional human annotation.
- Mechanism: For each post, the model generates a preferred explanation based on the gold label and dispreferred explanations based on each incorrect label. For DPO, these are paired as chosen/rejected; for KTO, the correct explanation is treated as a positive sample and incorrect ones as negative.
- Design Motivation: Content moderation explanations are naturally tied to label definitions. Explanations for incorrect labels often reveal points of confusion; using them in contrastive learning helps the model learn category boundaries rather than just memorizing a few positive examples.
-
DPO Preference over KTO for Contrastive Signals:
- Function: Selects a preference optimization method better suited for fine-grained toxicity detection.
- Mechanism: DPO uses paired preferred/rejected explanations to directly reinforce "which label explanation is more reasonable for the same post," whereas KTO uses listwise binary signals, which are coarser.
- Design Motivation: The challenge in toxicity detection often lies in subtle semantic differences between labels. Paired comparisons highlight class boundaries more effectively than single-point judgments. In experiments, DPO consistently improved at \(K=256\), while KTO was often ineffective or even detrimental.
-
Cross-model Refinement and Consistency Checks:
- Function: Allows models of different architectures to share explanation styles and reasoning strengths while monitoring explanation quality.
- Mechanism: Llama and T5 generate gold label explanations for each other on held-out 128-shot data, followed by SFT and DPO on the partner model's output. NLI is then used to check if the explanation entails the predicted label and conforms to the label definition.
- Design Motivation: Llama's explanations may be human-preferred for certain categories, while T5's encoder-decoder architecture might be more stable. Cross-model training transfers these advantages but may introduce inconsistencies, necessitating additional consistency audits.
Loss & Training¶
Base SFT uses LoRA with rank 64, alpha 128, and dropout 0.05, targeting \(q\) and \(v\) in projection layers. Base SFT is trained for 3 epochs with a learning rate of \(3 \times 10^{-4}\). DPO uses the default sigmoid loss from TRL with \(\beta=0.1\); KTO also uses \(\beta=0.1\). Inference temperature is 0, with a maximum of 512 tokens for explainable classification and 20 tokens for label-only classification.
Key Experimental Results¶
Main Results¶
Results at \(K=256\) show that SMARTER outperforms commercial model zero-shot and 16-shot ICL with minimal data, and exceeds the full-training baseline on Latent Hate.
| Method | HateXplain F1 / Data Ratio | Latent Hate F1 / Data Ratio | Implicit Hate F1 / Data Ratio | Observation |
|---|---|---|---|---|
| Llama_DPO-256 | 0.64 / 6% | 0.69 / 7% | 0.60 / 57% | Strongest overall; best on Latent Hate |
| T5_DPO-256 | 0.62 / 6% | 0.65 / 7% | 0.59 / 57% | Weaker than Llama, but highly data-efficient |
| Llama_Full | 0.72 / 100% | 0.62 / 100% | 0.67 / 100% | Full training strongest on HateXplain/Implicit Hate |
| ModernBERT | 0.70 / 100% | 0.61 / 100% | 0.64 / 100% | Strong classifier, but generates no explanations |
| GPT-4o-chat zero-shot | 0.56 / - | 0.51 / - | 0.58 / - | Commercial model zero-shot is inconsistent |
| GPT-4o-chat 16-shot ICL | 0.62±0.01 / - | 0.60±0.06 / - | 0.40±0.11 / - | Few-shot ICL has high variance on complex tasks |
The authors also performed a contribution breakdown: on HateXplain, off-the-shelf Llama scored 0.52. Adding HateCOT pre-training brought the \(K=256\) baseline to 0.58, and SMARTER’s DPO self-augmentation further increased it to 0.64, indicating gains are not merely from general pre-training or seed explanations.
Ablation Study¶
Cross-model refinement demonstrated that T5 significantly benefits from Llama's explanations, while Llama's performance often regresses when learning from T5. This suggest that "mutual learning" is not unconditionally effective and requires validation set selection.
| Setting | HateXplain F1 | Latent Hate F1 | Implicit Hate F1 | Conclusion |
|---|---|---|---|---|
| T5 Single Model DPO-256 | 0.62 | 0.65 | 0.59 | T5 self-augmented baseline |
| T5 + Llama Outputs SFT+DPO | 0.66 | 0.66 | 0.61 | Outperforms single T5 and some Llama results |
| Llama Single Model DPO-256 | 0.64 | 0.69 | 0.60 | Llama self-augmented baseline |
| Llama + T5 Outputs | No improvement | Marginal gain | No stable gain | T5 output may not suit Llama enhancement |
Regarding explanation quality, human evaluation on 342 HateXplain samples compared T5 and Llama. For the "Normal" category, Llama's explanations were preferred 73 times vs. 25 for T5; for "Offensive" and "Hate," they were comparable. NLI consistency checks showed most explanations align with predicted labels and definitions (Entail > 96%), though cross-model training slightly increased Contradiction by 2%-3%.
| Dataset/Model | Training | Label Consistency Entail↑ | Label Contra.↓ | Def. Consistency Entail↑ | Def. Contra.↓ |
|---|---|---|---|---|---|
| HateXplain T5 | DPO | 99.2 | 0.8 | 98.2 | 1.5 |
| HateXplain T5 | XMOD | 96.7 | 1.5 | 99.0 | 0.9 |
| Latent Hate Llama | DPO | 97.3 | 2.7 | 97.3 | 2.3 |
| Latent Hate Llama | XMOD | 96.8 | 3.2 | 97.5 | 2.5 |
| Implicit Hate T5 | DPO | 99.6 | 0.4 | 99.0 | 1.0 |
| Implicit Hate T5 | XMOD | 98.3 | 1.4 | 97.5 | 2.0 |
Key Findings¶
- DPO self-augmentation provides gains even at \(K \le 64\), showing significant low-resource value; T5 narrowed the gap with Llama on Latent Hate and Implicit Hate.
- DPO continued to improve at \(K=256\), while KTO was largely ineffective on HateXplain and hindered performance on the other two datasets.
- Commercial model 16-shot ICL is not necessarily better than zero-shot; for example, GPT-4o-mini dropped from 0.50 to 0.29 on HateXplain, indicating that few-shot prompts are fragile for fine-grained harmful content tasks.
- Cross-model training allows weaker models to absorb stronger reasoning styles but can introduce slight explanation inconsistencies, requiring periodic human or NLI audits in deployment.
Highlights & Insights¶
- The most ingenious aspect of the paper is turning "incorrect label explanations" into training resources. This is more than simple data augmentation; it explicitly contrasts category boundaries for the same input.
- The superiority of DPO over KTO aligns with task intuition: moderation often requires comparing which explanation fits a definition better between two similar labels, rather than just judging if an explanation "makes sense."
- Cross-model refinement reveals that models differ not just in performance but in explanation styles. T5 can absorb Llama's reasoning patterns, but Llama does not necessarily benefit from T5.
- The authors did not merely pursue F1 scores but also validated explanation quality via human preference and NLI consistency, which is critical for explainable content moderation.
Limitations & Future Work¶
- Data is entirely in English; cross-lingual toxicity, cultural contexts, and platform norms could significantly shift label boundaries.
- Only Llama and T5 were compared; cross-model refinement conclusions may not generalize to more architectures or larger models.
- Human evaluation of explanations was limited to HateXplain due to budget constraints; the other two fine-grained tasks lack equivalent human validation.
- Self-augmentation relies on the model's initial ability to generate explanations; strong initial bias may reinforce incorrect boundaries or prejudices.
- Content moderation carries risks of misuse; model outputs must complement human review, appeal mechanisms, and bias audits rather than acting as the sole basis for automated suppression.
Related Work & Insights¶
- vs. Traditional Toxicity Classifiers: Classifiers like ModernBERT are powerful but limited in interpretability. SMARTER is better suited for workflows requiring transparent reasoning and human oversight.
- vs. Commercial ICL: Commercial models are easy to deploy but suffer from high few-shot variance, costs, and lack of control. SMARTER provides a stable, controllable local solution using open-source models.
- vs. Self-Instruct / Self-Refine: While these methods typically generate more positive data, SMARTER's uniqueness lies in generating explanations for both correct and incorrect labels to form preference pairs.
- Insight: For other fine-grained social computing tasks—such as rumor classification, stance detection, and policy violation judgment—"counterfactual explanations" under label definitions could serve as effective low-resource alignment signals.
Rating¶
- Novelty: ⭐⭐⭐⭐☆ The idea of self-augmenting explanations with DPO is straightforward but highly practical for explainable toxicity detection.
- Experimental Thoroughness: ⭐⭐⭐⭐☆ Comprehensive across three tasks, two model types, commercial baselines, cross-model training, and consistency analysis; cross-lingual and further human evaluations are still needed.
- Writing Quality: ⭐⭐⭐⭐☆ The methodology is clear and tables are informative, though some figure results require close reading of the text.
- Value: ⭐⭐⭐⭐⭐ Directly applicable to low-resource, explainable, and controllable moderation model training; offers a transferable paradigm for preference optimization using incorrect explanations.