Obliviator Reveals the Cost of Nonlinear Guardedness in Concept Erasure¶

Conference: NeurIPS 2025 arXiv: 2603.07529 Code: None Area: Fairness / Concept Erasure Keywords: concept erasure, HSIC, RKHS, fairness, nonlinear guardedness

TL;DR¶

This paper proposes Obliviator — a post-processing concept erasure method based on HSIC minimization in RKHS — that iteratively deforms the feature space through a two-step optimization procedure. It is the first method to achieve complete guardedness against nonlinear adversaries, while quantifying the utility-erasure trade-off of nonlinear guardedness. Obliviator substantially outperforms existing methods across multiple PLMs and datasets.

Background & Motivation¶

Background: Pretrained language models (PLMs) widely encode sensitive demographic attributes, leading to biased and unfair predictions. Concept erasure aims to remove such information from representations while preserving task-relevant utility.

Limitations of Prior Work: - Linear methods (INLP, R-LACE, LEACE, SAL) only guard against linear adversaries; nonlinear classifiers can still recover sensitive attributes. - Existing nonlinear methods (kSAL, KCE, AdS, FaRM, KRaM) attempt to handle nonlinear dependencies but fail to fully capture nonlinear statistical dependencies, remaining vulnerable to nonlinear adversaries. - Even costly fine-tuning of PLMs (e.g., AdS, FaRM) yields incomplete erasure.

Key Challenge: The two objectives of concept erasure — removing sensitive attributes vs. preserving task utility — are fundamentally competing. Existing methods either fail to fully erase (i.e., are not immune to nonlinear adversaries) or sacrifice too much utility during erasure. More critically, the dynamics of the utility-erasure trade-off have never been studied.

Goal: (1) Achieve complete guardedness against nonlinear adversaries (i.e., true statistical independence); (2) Reveal and quantify the trade-off dynamics between utility and guardedness throughout the erasure process.

Key Insight: Adopting a functional-analytic perspective, the paper employs HSIC in RKHS as a measure of nonlinear statistical dependence, formalizes erasure as a cascaded kernel optimization problem, and solves it via an iterative procedure.

Core Idea: Use HSIC to measure nonlinear statistical dependence, and iteratively deform the feature space through encoder HSIC minimization combined with RKHS eigendecomposition, achieving complete nonlinear concept erasure while preserving utility.

Method¶

Overall Architecture¶

Obliviator is a post-processing, iterative concept erasure method that alternates between two steps (see Figure 2): - Step 1 (Encoder Training): Train an encoder to minimize HSIC between the representation and the sensitive attribute, while maximizing HSIC with the task label / original representation to preserve utility. - Step 2 (RKHS Disentanglement): Solve a constrained eigenvalue problem in RKHS to find directions that maximize the visibility of task-relevant information while remaining orthogonal to the sensitive attribute. - Each iteration produces an intermediate representation, progressively deforming the feature space toward a state where the sensitive attribute is undetectable.

Key Designs¶

Cascaded Kernel Problem (Fundamental Distinction from kSAL/KCE):
- kSAL/KCE assume that mapping representations to RKHS and performing linear erasure therein suffices for nonlinear guardedness — but this only guards against linear adversaries within that RKHS, remaining vulnerable to nonlinear adversaries in the same space.
- Obliviator seeks representations \(\varepsilon(X)\) such that the sensitive attribute \(S\) remains undetectable even after a subsequent adversarial feature map \(\phi(\cdot)\). This leads to the cascaded kernel problem: \(\inf_\theta \sup_g \sup_f \mathbb{E}[\bar{g}(S) \bar{f}(\varepsilon(\theta; X))]\)
- When HSIC \(\to 0\), this is equivalent to \(Z_\theta \perp\!\!\perp S\) (true statistical independence).
Step 1: Encoder — Imposing Independence via RKHS: At iteration \(i\), the encoder \(\varepsilon^i\) is trained with the following loss: \(\inf_{\theta^i} \frac{1}{n^2} \text{trace}\Big(\mathbf{K}_{z^i} \mathbf{H} (\mathbf{K}_s - \tau_x \mathbf{K}_x - \tau_{x^i} \mathbf{K}_{x^i} - \tau_y \mathbf{K}_y) \mathbf{H}\Big)\) where \(\mathbf{K}_\bullet\) denotes kernel matrices of the corresponding variables and \(\tau\) are balancing weights. A key innovation is the use of not only \(Y\) to explicitly preserve task information, but also \(X\) (the original representation) and \(X^i\) (the current iteration input) as implicit proxies, since HSIC aggregates different "visibility patterns" with different weights depending on the reference variable.
Step 2: RKHS Disentanglement — Eigenvalue Problem: A constrained optimization is solved to find RKHS functions that maximize the correlation of \(Z^i\) with \((X^i, X, Y)\), subject to zero correlation with \(S\): \(\mathbf{Q}^T \Big(\hat{\mathbf{C}}_{x^i z^i}^T \hat{\mathbf{C}}_{x^i z^i} + \tau_y \hat{\mathbf{C}}_{y z^i}^T \hat{\mathbf{C}}_{y z^i} + \tau_x \hat{\mathbf{C}}_{x z^i}^T \hat{\mathbf{C}}_{x z^i}\Big) \mathbf{Q} \mathbf{v} = \lambda \mathbf{v}\) where \(\mathbf{Q}\) is an orthonormal basis for the null space of \(\hat{\mathbf{C}}_{sz^i}\). The top \(m\) eigenvectors are selected to project the representation, which serves as input to the encoder in the next iteration.

Loss & Training¶

The coefficients \(\tau_x, \tau_{x^i}, \tau_y\) in the multi-objective loss control the balance between utility preservation and erasure.
Iterative rather than one-shot optimization: each step deforms the feature space incrementally, yielding more utility-preserving erasure.
Both supervised (using \(Y\) labels) and unsupervised (using only \(X, X^i\) as proxies) modes are supported.
Compatible with both frozen representations (post-hoc) and fine-tuned representations.

Key Experimental Results¶

Main Results — BERT Finetuned + Supervised Erasure (Gap Between Baselines and Obliviator)¶

Dataset	Task Y	Sensitive Attr. S	Best Baseline Residual S Acc.	Obliviator Residual S Acc.	Gap
Dial-Mention	Mention	Race	~62%	~50% (chance)	12%
Dial-Sentiment	Sentiment	Race	~63%	~50% (chance)	13%
Bias in Bios	Profession (28 classes)	Gender	~64%	~50% (chance)	14%

Cross-PLM Generalization — Frozen + Supervised on Bias in Bios¶

PLM	Embedding Dim.	Obliviator Trade-off	INLP	FaRM	KRaM
BERT	768	Full erasure + high utility	Residual leakage	Residual leakage	Residual leakage
GPT-2	768	Comparable to BERT	Degraded	Task accuracy collapses	Task accuracy collapses
LLaMA-3.2-1B	2048	Better than BERT	Unchanged	Unchanged	Improved but incomplete
DeepSeek-7B	4096	Substantially better than BERT	—	Unchanged	Accuracy degrades

Ablation Study — Supervised vs. Unsupervised × Frozen vs. Fine-tuned¶

Setting	Utility Preservation	Full Erasure	Trade-off Severity
Finetuned + Supervised	✅ Best	✅	Minimal trade-off
Frozen + Supervised	✅ Good	✅	Slight trade-off
Finetuned + Unsupervised	✅ Good	✅	Moderate trade-off
Frozen + Unsupervised	⚠️ Some degradation	✅	Most pronounced trade-off

Fairness Metrics — Dial-Sentiment (DP & Gap_rms)¶

PLM	Erasure Mode	DP (lower is better)	Gap_rms (lower is better)
BERT	Supervised	Near 0	Near 0
BERT	Unsupervised	Low	Low
DeepSeek	Supervised	Lower (better disentanglement)	Lower
DeepSeek	Unsupervised	Low	Low

Key Findings¶

Obliviator is the only method capable of reducing nonlinear adversary accuracy on the sensitive attribute to chance level (true statistical independence).
Stronger PLMs (DeepSeek > LLaMA > GPT-2 ≈ BERT) yield better-disentangled representations, which Obliviator directly leverages for more utility-preserving erasure.
Supervised erasure (using \(Y\) labels) retains more utility than unsupervised erasure, since \(Y\) provides an explicit proxy for task-relevant patterns.
Skewed data distributions substantially worsen the trade-off (80% skew incurs greater utility loss than the 50% balanced case), revealing the dependence of post-processing erasure methods on data representativeness.
Even fine-tuning PLMs (e.g., AdS) fails to achieve complete guardedness against nonlinear adversaries — Obliviator achieves complete erasure in a purely post-processing setting.

Highlights & Insights¶

Solid Theoretical Foundations: The derivation proceeds coherently from linear covariance to statistical independence in nonlinear RKHS; the guarantee that HSIC = 0 implies independence provides a theoretical upper bound for the method.
Iterative Rather Than One-Shot: The design of progressively deforming the feature space is elegant — it simultaneously produces utility-erasure trade-off curves for analysis and substantively improves erasure quality.
Novel RKHS Disentanglement Step: Solving an eigenvalue problem under null-space constraints elegantly unifies "preventing additional \(S\) leakage" and "re-aligning \(Y\) information" into a single optimization.
Generalization Finding: The chain — stronger PLM → better disentanglement → better trade-off — is insightful, suggesting that the observed utility-erasure trade-off may serve as a diagnostic indicator of representational quality.

Limitations & Future Work¶

The iterative procedure requires multiple rounds of encoder training and eigendecomposition, incurring substantial computational overhead, especially for the 4096-dimensional DeepSeek embeddings.
The choice of kernel function (e.g., RBF bandwidth) affects results, but sensitivity analysis is not sufficiently discussed.
Validation is limited to NLP tasks (text classification, sentiment, occupation); visual or multimodal settings are not explored.
As a post-processing method that does not modify PLM parameters, utility loss may be unavoidable when task information and sensitive attributes are highly entangled in the original representations.

vs. INLP/R-LACE/LEACE: These linear methods only guard against linear adversaries; Obliviator achieves complete erasure precisely in the scenarios where they fail entirely (nonlinear probing).
vs. kSAL/KCE: Both are kernel-based, but kSAL performs linear erasure only within RKHS, guarding against linear adversaries in that specific RKHS only; Obliviator addresses this fundamental limitation through the cascaded kernel formalization and iterative optimization.
vs. AdS/FaRM: These methods require fine-tuning PLMs at higher computational cost, yet still fail to achieve complete nonlinear erasure; Obliviator achieves more thorough erasure as a post-processing method.
vs. KRaM: Also a post-processing kernel method, but KRaM's rate-distortion maximization framework likewise fails to achieve complete erasure and causes task accuracy collapse on GPT-2.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The cascaded kernel formalization exposes the fundamental flaw of methods such as kSAL; the two-step iterative framework is elegant and original.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 4 PLMs × 3 datasets × 4 settings × trade-off curves × fairness metrics × skew analysis; exceptionally comprehensive.
Writing Quality: ⭐⭐⭐⭐ — Mathematical derivations are clear, but the dense notation presents a non-trivial barrier on first reading.
Value: ⭐⭐⭐⭐⭐ — The first method to achieve complete nonlinear concept erasure; the trade-off analysis framework provides an important benchmark for future work.