GRADIEND: Feature Learning within Neural Networks Exemplified through Biases¶
Conference: ICLR 2026 arXiv: 2502.01406 Code: https://github.com/aieng-lab/gradiend Area: Social Computing Keywords: monosemantic feature learning, gender debiasing, gradient encoder-decoder, Transformer debiasing, interpretability
TL;DR¶
This paper proposes GRADIEND — a gradient-based encoder-decoder architecture that learns interpretable monosemantic features (exemplified by gender) from model gradients via a single bottleneck neuron. The framework not only identifies which weights encode a specific feature, but also directly modifies model weights through the decoder to mitigate bias. Combined with INLP, it achieves state-of-the-art debiasing results across all baseline models.
Background & Motivation¶
AI systems frequently exhibit and amplify social biases (e.g., gender bias), causing harmful effects in critical domains such as law, healthcare, and recruitment. Amazon's AI recruiting tool favoring male candidates is a well-known example.
Existing Transformer debiasing methods include: - Counterfactual Data Augmentation (CDA): retraining after swapping gender-related words, at high cost - Dropout Augmentation: increasing the Dropout rate during pretraining - INLP: Iterative Nullspace Projection, repeatedly training linear classifiers and projecting to the nullspace - SentDebias / SelfDebias: post-hoc methods adjusting embeddings or output distributions
Key Challenge: Existing unsupervised sparse autoencoder methods (e.g., Bricken et al., 2023) can extract interpretable features but require learning a large number of latent features before searching for meaningful explanations, with no guarantee that a desired feature (e.g., "gender") will emerge. Meanwhile, most debiasing methods are post-hoc and do not genuinely modify the internal representations of already-trained models.
Key Insight: The paper exploits the feature information encoded in model gradients — gradients naturally indicate which parameters need to be updated to change a given feature. By designing a minimal encoder-decoder structure, a monosemantic feature neuron with desired semantics can be learned from the difference between factual and counterfactual gradients.
Method¶
Overall Architecture¶
Input: a pretrained Transformer language model + template sentences containing names and pronouns. Output: (1) a scalar feature neuron \(h\) encoding gender information; (2) a decoding vector indicating how to modify model weights to adjust the degree of gender bias.
Key Designs¶
-
Factual/Counterfactual Gradient Construction: For a template sentence such as "Alice explained the vision as best [MASK] could", MLM gradients are computed with the correct pronoun ("she") and the counterfactual pronoun ("he") as targets, respectively:
- Factual gradient \(\nabla^+ W_m\): targeting the correct gendered pronoun
- Counterfactual gradient \(\nabla^- W_m\): targeting the opposite gendered pronoun
- Gradient difference \(\nabla^{\pm}W_m := \nabla^+ W_m - \nabla^- W_m\)
The gradient difference cancels shared update components unrelated to gender, retaining only gender-relevant directions.
-
GRADIEND Encoder-Decoder: A minimal architecture with a single hidden neuron as the bottleneck:
\(\text{enc}(\nabla^+ W_m) = \tanh(W_e^T \cdot \nabla^+ W_m + b_e) =: h \in \mathbb{R}\) \(\text{dec}(h) = h \cdot W_d + b_d \approx \nabla^{\pm} W_m\)
where \(W_e, W_d, b_d \in \mathbb{R}^n\) and \(b_e \in \mathbb{R}\), yielding only \(3n+1\) total parameters. The encoder maps the factual gradient to a scalar \(h\) (gender factor); the decoder reconstructs the gradient difference from \(h\). The objective is MSE loss.
-
Gender Debiasing Application: Given a selected gender factor \(h\) and learning rate \(\alpha\), model weights are directly modified as:
\(\tilde{W}_m := W_m + \alpha \cdot \text{dec}(h)\)
When \(h\) and \(\alpha\) share the same sign, the model is steered toward male bias; opposite signs steer toward female bias. Values near \(h=0\) correspond to the debiasing direction (the debiasing direction learned via the bias \(b_e\)).
-
Three Composite Metrics:
- BPI (Balanced Prediction Index): measures debiasing degree while accounting for language modeling capability, gender prediction balance, and prediction plausibility
- FPI (Female Prediction Index): measures degree of female bias
- MPI (Male Prediction Index): measures degree of male bias
Loss & Training¶
- Optimizer: Adam, learning rate 1e-5, weight decay 1e-2
- Batch size: 32, MSE loss
- Training steps: 23,653 (equal to the number of templates in the Genter training set)
- Evaluated every 250 steps using \(\text{Cor}_{\text{Genter}}^{\text{val}}\); best model selected
- At each step, a gender is randomly selected and a name is sampled from NAMExact
- Custom initialization of decoder weights (using the same \(n\) as the encoder as the initialization range)
- The prediction layer is excluded from GRADIEND parameters, ensuring debiasing acts on the language model itself
Key Experimental Results¶
Encoder Evaluation (H1: Learning Gender Features)¶
| Model | \(\text{Acc}_{\text{Genter}}\) | \(\text{Cor}_{\text{Genter}}\) | \(\text{Acc}_{\text{Enc}}\) | \(\text{Cor}_{\text{Enc}}\) |
|---|---|---|---|---|
| BERT-base | 1.000 | 0.957 | 0.612 | 0.669 |
| BERT-large | 1.000 | 0.908 | 0.578 | 0.616 |
| DistilBERT | 1.000 | 1.000 | 0.758 | 0.838 |
| RoBERTa | 1.000 | 1.000 | 0.909 | 0.935 |
All models achieve near-perfect separation of \(\pm 1\) on gender-relevant data; gender-neutral inputs are mapped to values close to 0.
Debiasing Comparison (H2: Modifying Gender Bias)¶
| Method | SS(%) | SEAT | CrowS(%) | LMS(%) | GLUE(%) |
|---|---|---|---|---|---|
| BERT-base | baseline | baseline | baseline | baseline | baseline |
| + GRADIEND-BPI | improved | — | — | maintained | maintained |
| + GRADIEND-BPI + INLP | significantly improved | improved | — | maintained | maintained |
| CDA / Dropout / INLP / SentDebias | partially improved | inconsistent | inconsistent | partially degraded | partially degraded |
Ablation Study¶
| Configuration | Key Metric | Notes |
|---|---|---|
| Different base models | All 4 models succeed | BERT/DistilBERT/RoBERTa all learn gender features |
| Gender factor \(h=0\) | BPI optimum approximated here | Decoder bias \(b_e\) automatically learns debiasing direction |
| Overfitting analysis | No significant difference across train/val/test names | Generalizes to unseen names |
| Generalization to woman/man | he/she generalize to woman/man | Cross-lexical generalization of gender concept |
Key Findings¶
- GRADIEND-BPI + INLP is the only combined method that achieves significant improvement on the SS metric across all baseline models, demonstrating strong robustness
- After introducing confidence intervals, the effectiveness of existing debiasing methods is far less conclusive than prior studies suggest
- RoBERTa unexpectedly exhibits a female bias (\(\mathbb{P}(F) > \mathbb{P}(M)\)), contrary to the commonly assumed male bias
- Steering toward a specific gender (FPI/MPI) is easier to achieve than debiasing (BPI)
- The effect of weight modification is approximately point-symmetric with respect to the signs of \(h\) and \(\alpha\)
Highlights & Insights¶
- Learning features with desired semantics directly from gradients: Unlike unsupervised sparse autoencoders (which learn many features and interpret them post-hoc), GRADIEND can learn "desired" interpretable features (e.g., gender), representing an important paradigm shift
- Minimal yet elegant design: Only a single scalar bottleneck neuron with \(3n+1\) parameters effectively encodes the complex concept of gender. The architectural simplicity facilitates analysis and understanding
- Introduction of bootstrap confidence intervals: Reveals a long-overlooked issue in the field — prior debiasing method comparisons lack statistical rigor
- Interesting finding regarding decoder bias: Even at \(h=0\) (no gender information), the decoder bias \(b_d\) itself learns an effective debiasing direction
Limitations & Future Work¶
- Only binary gender features are validated; generalization to continuous features (e.g., sentiment), multi-valued features (e.g., German articles der/die/das), or other types of bias (race, religion) requires further exploration
- Only tested on encoder-only models; applicability to generative Transformers (GPT-family) remains unverified
- Factual/counterfactual gradient construction relies on the MLM objective; adaptation for CLM tasks has yet to be established
- Debiasing trade-off: aggressive debiasing degrades language modeling capability, requiring careful selection within the \(h\) and \(\alpha\) search grid
- Gender is treated as binary, without accounting for non-binary gender identities
Related Work & Insights¶
- Monosemantic features / sparse autoencoders (Bricken et al., 2023; Templeton et al., 2024): unsupervised methods that decompose interpretable features from high-dimensional feature spaces; gender-bias-sensitive features have been identified in Claude 3
- INLP (Ravfogel et al., 2020): iterative nullspace projection debiasing; most effective when combined with GRADIEND
- Movement Pruning (Joniak & Aizawa, 2022): reducing gender bias through pruning
- Grad-CAM / Integrated Gradients: pioneering work in gradient-based interpretation
- Insight: gradients can serve not only for explanation (attribution) but also for encoding and manipulating semantic features within models. This notion of "gradients as feature representations" may hold significant value for model editing and machine unlearning
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (Learning desired semantic features from gradients is a highly novel paradigm)
- Experimental Thoroughness: ⭐⭐⭐⭐ (4 base models, multiple metrics, but only gender is evaluated)
- Writing Quality: ⭐⭐⭐⭐ (Clear structure, rigorous formulations, comprehensive appendix)
- Value: ⭐⭐⭐⭐ (High proof-of-concept value, though practical scope awaits expansion)