Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?¶

Conference: ICLR 2026 arXiv: 2509.22291 Code: https://github.com/Ewanwong/fairness_x_explainability Area: AI Safety / Fairness Keywords: fairness, explainability, hate speech detection, input attribution, bias mitigation

TL;DR¶

The first systematic large-scale quantitative study on the relationship between input-based explanations and fairness: explanations can effectively detect biased predictions and serve as training regularizers to reduce bias, but cannot be reliably used for automatic fair model selection.

Background & Motivation¶

Background: NLP models in sensitive tasks such as hate speech detection often reproduce or amplify social biases present in training data. Explainability is widely regarded as a key enabler of fairness—if explanations can reveal that a model relies on sensitive features (e.g., race- or gender-related tokens), such reliance can be detected and constrained.

Limitations of Prior Work: (a) Some studies question the faithfulness of explanation methods, as they do not necessarily reflect the true decision process. (b) Reducing reliance on sensitive features may simultaneously degrade both performance and fairness. (c) Models can be deliberately trained to conceal their use of sensitive features from explanations. Existing research is largely qualitative or limited in scale.

Key Challenge: The relationship between explainability and fairness is oversimplified—the assumption that "explanations reveal bias → bias can be eliminated" lacks large-scale quantitative validation.

Goal: Three research questions: (RQ1) Can explanations detect biased predictions? (RQ2) Can explanations select fair models? (RQ3) Can explanations reduce bias during training?

Key Insight: Large-scale experiments on hate speech detection using 16 explanation methods × encoder/decoder models × multiple debiasing techniques × two datasets.

Core Idea: Input attribution explanations are effective for bias detection and training-time debiasing, but unreliable for model selection—the relationship between explainability and fairness is task-specific and sensitive to method choice.

Method¶

Overall Architecture¶

Three experimental pipelines corresponding to the three RQs: (RQ1) compute sensitive token reliance scores from explanations and measure Pearson correlation with individual unfairness; (RQ2) rank models by sensitive token reliance on the validation set and evaluate whether this predicts test-set fairness; (RQ3) incorporate sensitive token reliance as a regularization term in training to produce debiased models.

Key Designs¶

Sensitive Token Reliance:
Function: Quantifies the degree to which a model relies on sensitive tokens (e.g., "black", "female", "Muslim") present in the input.
Mechanism: For token-level attribution scores generated by 16 explanation methods, the maximum absolute attribution value among sensitive tokens is taken as the reliance score for each sample.
Usage: Correlated with individual unfairness in RQ1; used as a model ranking metric in RQ2; used as a regularization target in RQ3.
Individual Unfairness (IU):
Function: Measures the change in model predictions when social group references in a sample are replaced.
Mechanism: \(IU(\mathbf{x}_i) = |f_{\hat{y}_i}(\mathbf{x}_i) - \frac{1}{|G|-1}\sum_{g'} f_{\hat{y}_i}(\mathbf{x}_i^{(g')})|\), where \(\mathbf{x}_i^{(g')}\) is the counterfactually substituted version.
Distinction from group fairness: IU is defined at the sample level, enabling per-sample correlation with explanation scores.
Explanation Regularization for Debiasing (RQ3):
Function: Minimize model reliance on sensitive tokens during training.
Loss: \(L = L_{task} + \alpha L_{debias}\), where \(L_{debias}\) penalizes attribution scores on sensitive tokens (L1 or L2 norm).
\(\alpha \in \{0.01, 0.1, 1, 10, 100\}\) is selected using a fairness-balanced metric (harmonic mean of accuracy and unfairness).

Experimental Scale¶

16 explanation methods × 2 model families (encoder: BERT/RoBERTa; decoder: Llama3.2/Qwen3) × 7 debiasing methods × 2 datasets × 3 bias types.

Key Experimental Results¶

RQ1: Bias Detection (Fairness Correlation)¶

Explanation Method	BERT (Race)	BERT (Gender)	Qwen3-4B (Race)	Qwen3-4B (Gender)
Grad L2	High	Medium	High	High
Occlusion	High	High	Medium	Medium
IxG L2	High	Medium	High	High
Attention	Low	Low	Low	Low

The best-performing methods (Occlusion / L2-norm variants) achieve statistically significant fairness correlation across most settings.

RQ2: Model Selection¶

The Spearman correlation between validation-set explanation metrics and test-set fairness is unstable, and MRR@1 consistently falls below the baseline of directly using validation-set IU. Conclusion: explanations are not reliable for model selection.

RQ3: Training Debiasing¶

Method	Race AvgIU↓	Gender AvgIU↓	Religion AvgIU↓
Default BERT	3.17	0.66	1.27
Best Explanation Regularization	~1.5	~0.4	~0.8
CDA (Best Traditional Debiasing)	0.50	0.50	0.90

Explanation regularization significantly reduces AvgIU, particularly for race bias. Some methods achieve debiasing performance comparable to or exceeding traditional debiasing techniques.

Key Findings¶

RQ1 ✓: Occlusion and L2-norm methods effectively detect biased predictions, with statistically significant fairness correlation. Detection capability is maintained even after debiasing training, refuting the concern that "debiasing renders explanations ineffective."
RQ2 ✗: Explanation methods cannot replace direct validation-set fairness computation for model selection. Debiasing alters model behavior and attribution patterns, making cross-model explanation comparisons unreliable, though within-model comparisons remain valid.
RQ3 ✓: Using sensitive token reliance as a regularization term effectively reduces bias, achieving performance on par with or superior to some traditional debiasing methods.
LLM-generated rationales are less reliable than input attributions: Natural language explanations from LLMs underperform Occlusion/L2 methods in bias detection.

Highlights & Insights¶

Three-dimensional systematic evaluation: This is the first work to decompose the "explanation → fairness" relationship into three distinct dimensions—detection, selection, and debiasing—yielding differentiated conclusions (2 out of 3 effective) rather than a blanket verdict.
Mean vs. L2 aggregation: Mean-aggregated attribution scores are significantly inferior to L2 aggregation and Occlusion for bias detection. This is because mean aggregation requires accurate estimation of the directional contribution of each token, whereas L2 and Occlusion are direction-agnostic. This finding directly informs the selection of explainability methods.
Faithfulness ≠ bias detection capability: Appendix analysis shows that the faithfulness of an explanation method is uncorrelated with its ability to detect bias—an "unfaithful" explanation may still effectively capture sensitive feature usage.

Limitations & Future Work¶

Validation is limited to hate speech detection; generalization to other classification tasks (e.g., hiring, loan approval) requires further experiments.
Explanation regularization relies on a predefined sensitive word list and cannot address implicit bias arising from proxy features.
Reasoning models and chain-of-thought (CoT) prompting are not covered—attributions for such models concentrate on intermediate reasoning steps rather than the input, necessitating a different analytical framework.
Computational cost varies substantially across the 16 explanation methods (KernelSHAP is extremely slow); efficiency trade-offs are not analyzed.

vs. Dimanov et al. (2020): They found that explanation regularization may simultaneously harm both performance and fairness. This paper demonstrates, through larger-scale experiments and more careful hyperparameter search (using a fairness metric rather than accuracy alone), that explanation regularization can effectively debias models.
vs. Slack et al. (2020) / Pruthi et al. (2020): They showed that models can be trained to conceal bias from explanations. This paper finds that explanations can still detect residual bias even after debiasing training, while confirming that cross-model explanation comparisons remain unreliable.
Implications for ASIDE/AlphaSteer: ASIDE structurally separates instructions from data, which may also affect attribution distributions. The analytical framework proposed here can be used to assess whether such safety methods simultaneously improve fairness.

Rating¶

Novelty: ⭐⭐⭐⭐ — First large-scale quantitative study; the three-dimensional design is systematic
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 16 methods × 5 models × 7 debiasing techniques × 2 datasets × 3 bias types; exceptionally comprehensive
Writing Quality: ⭐⭐⭐⭐ — Clear structure; RQ-driven narrative is logically coherent
Value: ⭐⭐⭐⭐ — Provides clear practical guidance on where explainability works and where it does not in fair AI applications