ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection¶

Conference: ACL 2026
arXiv: 2604.12321
Code: https://huggingface.co/ArdLi/ToxiTrace
Area: Social Computing
Keywords: Chinese Toxicity Detection, Explainability, Gradient Constraint, Fine-grained Evidence Extraction, Contrastive Learning

TL;DR¶

ToxiTrace proposes an explainable Chinese toxicity detection method for BERT-class encoders, combining CuSA (LLM-guided weak annotation), GCLoss (gradient-constrained loss), and ARCL (adversarial reasoning contrastive learning) to achieve both high sentence-level classification accuracy and contiguous toxic span extraction while maintaining efficient encoder inference.

Background & Motivation¶

Background: Existing Chinese toxicity detection methods primarily target sentence-level classification and have achieved strong performance through pre-trained language models (e.g., RoBERTa, MacBERT) and large language models.

Limitations of Prior Work: - Most methods only perform sentence-level classification and cannot pinpoint which specific spans within a sentence are toxic, lacking explainability - Chinese uses character-level tokenization, causing gradient/attention attribution signals to fragment across individual characters and fail to form human-readable contiguous spans - LLMs have strong explanation capabilities but underperform encoders in direct classification and incur high inference costs

Key Challenge: Encoder models are accurate classifiers but poor explainers (fragmented attributions), while LLMs excel at explanation but are weak classifiers with slow inference—the strengths of both cannot be combined.

Goal: Maintain efficient encoder inference while enabling the model to both classify accurately and extract contiguous, readable toxic spans as explanations.

Key Insight: By explicitly constraining gradient signals during training, the encoder's token-level attributions naturally focus on toxic evidence, enabling direct contiguous span extraction from saliency maps at inference time.

Core Idea: Elevate gradient attribution from "post-hoc explanation" to "training objective"—use LLM-generated weak annotations to guide gradients toward toxic tokens while sharpening the toxic/non-toxic semantic boundary through contrastive learning.

Method¶

Overall Architecture¶

ToxiTrace follows a four-stage pipeline: (1) warm-up training for baseline classification; (2) CuSA leverages encoder attribution cues + LLM refinement to generate weakly annotated toxic spans; (3) GCLoss explicitly boosts gradient responses on toxic tokens and suppresses non-toxic ones; (4) ARCL sharpens the semantic boundary between toxic and non-toxic content. At inference, the model first classifies, then applies the BiCSE algorithm to extract contiguous spans from the saliency map for toxic inputs.

Key Designs¶

CuSA (Clue-guided Span Annotation):
- Function: Automatically generates weak supervision signals for toxic spans without fine-grained annotations
- Mechanism: First warm-up trains the encoder for baseline classification, then computes token-level saliency scores. The BiCSE (Bidirectional Cliff Scan and Extraction) algorithm extracts initial high-saliency spans as clues, which are fed to an LLM (Gemini 2.5 Pro) to refine span boundaries
- Design Motivation: Existing Chinese toxicity datasets have only coarse-grained labels, but direct LLM annotation lacks localization cues. Using the encoder's own attribution signals to provide candidate regions, then leveraging the LLM's comprehension ability to refine them, achieves complementary strengths
GCLoss (Gradient-Constrained Loss):
- Function: Explicitly shapes token-level gradient distributions so that toxic tokens have higher gradient responses than non-toxic ones
- Mechanism: Comprises two components—PGR Loss enforces a margin between toxic/non-toxic token gradients; PPT Loss uses within-sample statistics (15th percentile + α·max) to cap the gradient upper bound for non-toxic tokens and set a gradient lower bound for toxic tokens
- Design Motivation: Models trained solely with classification loss produce scattered, inaccurate token-level attributions. Directly constraining gradients concentrates attributions on toxic evidence, making span extraction at inference more reliable
ARCL (Adversarial Reasoning Contrastive Learning):
- Function: Sharpens the semantic boundary between toxic and non-toxic text
- Mechanism: Uses an LLM (Gemini 2.5 Flash) to generate pro and con reasoning arguments ("Assume the text is toxic/non-toxic, generate supporting rationales"), which serve as positive and negative samples for adaptive InfoNCE contrastive learning
- Design Motivation: GCLoss only constrains intra-sentence token-level gradient relationships and cannot capture inter-sentence semantic differences. Reasoning content generated via the LLM debate mechanism is more targeted and semantically deeper than simple data augmentation

Loss & Training¶

Overall training objective: \(\mathcal{L} = \mathcal{L}_{CE} + \lambda_{grad}(\mathcal{L}_{PGR} + \mathcal{L}_{PPT}) + \lambda_{sem}\mathcal{L}_{con}\)

Training procedure: warm-up for 3 epochs (cross-entropy only) → introduce GCLoss + ARCL for joint training. Too many or too few warm-up steps both degrade final performance.

Key Experimental Results¶

Main Results (Classification)¶

Dataset	Metric	Ours (RoBERTa+ToxiTrace)	Prev. SOTA (RoBERTa)	Gain
COLD	Macro-F1	83.68%	82.56%	+1.12%
COLD	Acc	83.84%	82.68%	+1.16%
ToxiCN	Macro-F1	83.83% (MacBERT)	82.81%	+1.02%

Span Extraction (CNTP)¶

Model	Overlap F1	Character F1	IoU	Inference Time
RoBERTa+ToxiTrace*	77.90%	77.63%	61.56%	1m 58s
Qwen3-8B	77.87%	74.74%	59.67%	14m 33s
Gemini 2.5 Pro	80.39%	79.67%	66.22%	~1.5h

Ablation Study¶

Config	Classification Macro-F1	Extraction F1	Note
Full model	83.68%	77.90%	full model
w/o CuSA	82.90%	71.96%	Weak annotation degrades to raw BiCSE; extraction recall drops sharply
w/o ARCL	83.12%	75.16%	Missing semantic contrast; both classification and extraction decline
w/o GCLoss	83.36%	65.15%	Largest extraction F1 drop (−12.75%)
RoBERTa baseline	82.76%	65.08%	Baseline

Key Findings¶

GCLoss contributes far more to span extraction than ARCL (−12.75% vs −2.74%), making it the core component
Encoder+ToxiTrace achieves span extraction F1 comparable to the strongest LLM (Qwen3-8B) in ~1/7 of the inference time
The BiCSE algorithm significantly outperforms traditional top-k selection for extraction (RoBERTa 52.34→65.08 F1)
Masking toxic spans causes a sharp drop in model confidence, validating the causal faithfulness of gradient attributions

Highlights & Insights¶

The idea of elevating gradient attribution from a passive analysis tool to an active training objective is novel, closing the loop from "shaping gradients during training → extracting spans at inference"
Clever dual use of LLMs: as a weak annotation refiner in CuSA and as an adversarial reasoning generator in ARCL—neither role involves direct classification, avoiding the LLM's classification weakness
The BiCSE bidirectional cliff scan algorithm addresses the practical problem of attribution fragmentation under Chinese character-level tokenization
Clear efficiency advantage: encoder inference ~2 min vs LLM ~15 min, with comparable span extraction quality

Limitations & Future Work¶

Does not handle homophone substitution, pinyin obfuscation, or other "implicit toxic expressions"
Validated only on Chinese; applicability to other character-level languages (Japanese, Korean) requires further investigation
LoRA-based transfer to decoder LLMs yields limited results; deeper parameter-efficient gradient shaping strategies may be needed
CuSA depends on an external LLM for annotation refinement, introducing additional cost

vs Traditional Attribution Methods (LIME/IG/Attention): Traditional methods provide post-hoc explanations with scattered token selections; ToxiTrace shapes gradients during training to extract contiguous spans
vs Direct LLM Detection: LLMs explain well but classify poorly and slowly; ToxiTrace gives encoders the best of both worlds
vs CRF Sequence Labeling: CRF requires explicit annotation for training; ToxiTrace achieves comparable results via weak supervision + gradient constraints

Rating¶

Novelty: ⭐⭐⭐⭐ Gradient attribution as a training objective + LLM debate contrastive learning is a creative combination
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-dataset, multi-model comparisons, ablations, and faithfulness verification are comprehensive
Writing Quality: ⭐⭐⭐⭐ Clear framework with logically sound motivation