Fair Representation Learning with Controllable High Confidence Guarantees via Adversarial Inference¶

Conference: NeurIPS 2025 arXiv: 2510.21017 Code: https://github.com/JamesLuoyh/FRG Area: AI Safety / Fairness Keywords: fair representation learning, high-confidence guarantees, adversarial inference, statistical testing, demographic parity

TL;DR¶

This paper proposes FRG (Fair Representation learning with high-confidence Guarantees), the first fair representation learning framework that allows users to specify a fairness threshold \(\varepsilon\) and confidence level \(1-\delta\). By combining VAE-based candidate selection, adversarial inference that maximizes covariance, and a Student's t-test to construct a high-confidence upper bound, FRG guarantees that \(\Delta_{DP} \leq \varepsilon\) holds with probability at least \(1-\delta\) for any downstream model and task.

Background & Motivation¶

Background: Fair representation learning (FRL) aims to learn representations at the data producer side that remove sensitive attribute information, ensuring that any downstream consumer using such representations will not produce unfair predictions. Methods such as LAFTR, ICVAE, LMIFR, CFAIR, and FCRL estimate upper bounds on \(\Delta_{DP}\) from training data, but these estimates do not carry guarantees that hold on unseen test sets.

Limitations of Prior Work: (a) Existing methods' fairness guarantees are based on training/validation set estimates, which may fail on test sets due to overfitting — at least 10% of trials across 6 baseline methods violate fairness constraints; (b) FARE provides high-confidence certificates but does not support user-specified \(\varepsilon\) and \(\delta\), and the certificates are often very loose (typically several times the expected \(\varepsilon\)); (c) No existing framework allows users to explicitly control both the fairness threshold and the confidence level.

Key Challenge: In representation learning, the data producer does not know the downstream model — including potentially adversarial downstream models. Fairness must be guaranteed even under worst-case downstream usage.

Goal: To construct a "data-producer-side insurance" — a framework that provides \(1-\delta\) confidence guarantees at representation learning time, ensuring \(\Delta_{DP} \leq \varepsilon\) for any downstream task and model.

Key Insight: Establish an equivalence between \(\Delta_{DP}\) and \(|\text{Cov}(\hat{Y}, S)|\) (Theorem 5.2), thereby transforming fairness verification into a statistical hypothesis testing problem.

Core Idea: Train an adversarial model to find the worst-case downstream model that maximizes covariance; construct a \(1-\delta\) upper bound on \(g_\varepsilon(\phi)\) via a t-test on held-out data; certify fairness if the upper bound \(\leq 0\), otherwise honestly return "No Solution Found."

Method¶

Overall Architecture¶

FRG partitions data into \(D_c\) (for candidate selection) and \(D_f\) (for fairness testing), and consists of three components: (1) Candidate Selection: a VAE is optimized on \(D_c\) to search for representation models \(\phi_c\) likely to pass the fairness test; (2) Adversarial Inference: an adversarial model is trained to predict sensitive attributes from representations, maximizing \(|\text{Cov}(\hat{Y}, S)|\) to approximate the worst-case downstream model; (3) Fairness Test: a \(1-\delta\) confidence upper bound on \(\Delta_{DP}\) is constructed using adversarial model predictions on \(D_f\); if the bound is \(\leq \varepsilon\), the model is returned, otherwise NSF is reported.

Key Designs¶

Equivalence between \(\Delta_{DP}\) and Covariance (Theorem 5.2):
- Function: Transforms the fairness metric into a statistic amenable to optimization and testing.
- Mechanism: When \(S, \hat{Y} \in \{0,1\}\), \(\Delta_{DP}(\tau, \phi) = |\text{Cov}(\hat{Y}, S)| / \text{Var}(S)\). Hence the worst-case downstream model is \(\tau^*_{adv} = \arg\max_\tau |\text{Cov}(\hat{Y}, S)|\).
- Design Motivation: Directly optimizing \(\Delta_{DP}\) requires enumerating all downstream models; the equivalence reformulates the problem as maximizing covariance, which is tractable via standard gradient optimization.
Adversarial Inference:
- Function: Trains a "worst-case" downstream model to detect residual sensitive attribute information in the representation.
- Mechanism: \(\tau_{adv}\) is trained on \(D_c\) via gradient optimization to maximize \(\text{Cov}(\hat{Y}, S)\), approximating the optimal adversary \(\tau^*_{adv}\).
- Design Motivation: Unlike the joint adversarial training in LAFTR and related methods, FRG trains the adversary independently — yielding greater reliability and avoiding the instability of joint optimization.
Fairness Test:
- Function: Constructs a \(1-\delta\) confidence upper bound on \(g_\varepsilon(\phi) = \sup_\tau \Delta_{DP}(\tau, \phi) - \varepsilon\) using held-out data \(D_f\).
- Mechanism: Adversarial model predictions on \(D_f\) are used to estimate \(\Pr(\hat{Y}=1|S=s)\) → construct \(m\) unbiased estimates → Student's t-test yields a \(1-\delta\) confidence interval \([c_l, c_u]\) → upper bound \(U_\varepsilon = \max(|c_l|, |c_u|) - \varepsilon\). The test passes if \(U_\varepsilon \leq 0\).
- Design Motivation: Performing the statistical test on \(D_f\) (which was not used during training) avoids overfitting; the Student's t-test is a well-established tool in the scientific community.
Honest NSF Mechanism:
- Function: Returns "No Solution Found" when \(\varepsilon\)-fairness cannot be guaranteed at confidence \(1-\delta\), rather than making a false claim.
- Design Motivation: Avoids the multiple comparisons problem — each candidate is tested only once, and a failed test directly triggers NSF.

Loss & Training¶

Candidate selection uses a VAE objective with a Lagrangian multiplier enforcing \(\hat{U}_\varepsilon(\phi, D_c) \leq 0\): \(\mathcal{L} = -\text{ELBO} + \lambda \hat{U}_\varepsilon(\phi, D_c)\).
An inflation factor \(\alpha \geq 1\) enlarges the confidence interval during candidate selection, reducing cases where a candidate passes selection but fails the fairness test.
The adversary performs only \(t \in [1,10]\) gradient update steps per round for efficiency.

Key Experimental Results¶

Main Results¶

Method	Adult Violation Rate	Income Violation Rate	Health Violation Rate	Adversarial Task Violation Rate
FRG	<10%	<10%	<10%	<10%
LAFTR	>10%	>10%	>10%	Significantly violated
LMIFR	>10%	—	>10%	Significantly violated
FCRL	>10%	>10%	>10%	Significantly violated
FARE	<10%	<10%	<10%	<10%
ICVAE	Partially <10%	>10%	Partially <10%	Significantly violated

Under \(\varepsilon \in \{0.04, 0.08, 0.12, 0.16\}\) and \(\delta=0.1\): FRG consistently satisfies constraints (violation rate <10%) with AUC comparable to or better than SOTA.

Ablation Study¶

Configuration	Description
FRG vs. FRG_supervised	Supervised FRG yields marginal improvement — stronger representations may expose more sensitive information
Varying \(\delta\)	\(\delta \in \{0.01, 0.05, 0.1, 0.15\}\); larger \(\delta\) marginally improves AUC at the cost of confidence
Varying \(\alpha\)	The inflation factor affects the NSF rate; too small → frequent NSF; too large → loose confidence intervals

Key Findings¶

All 6 baseline methods fail to consistently satisfy fairness constraints: particularly under small \(\varepsilon\) and adversarial tasks, violation rates for LAFTR/LMIFR/FCRL/CFAIR far exceed 10%, demonstrating that training-set upper bound estimates do not hold on test sets.
FARE controls violation rates but produces extremely loose certificates: at \(\varepsilon=0.04\), FARE's actual certified value may be 0.12–0.16, preventing fine-grained control.
FRG's NSF rate is reasonable: at \(\varepsilon \geq 0.08\), the solution return rate is \(\geq 90\%\); NSF rate is higher only for very small \(\varepsilon\) — an honest trade-off.
Adversarial tasks provide the most stringent evaluation: all methods without high-confidence guarantees fail markedly on adversarial tasks, while FRG maintains fairness even against adversarial downstream models.
Unsupervised FRG is more stable in transfer learning: supervised methods (LAFTR/CFAIR/FARE) suffer significant AUC drops on non-target tasks, whereas unsupervised FRG remains consistent.

Highlights & Insights¶

First user-controllable high-confidence fair representation learning framework: allowing users to explicitly specify \(\varepsilon\) and \(\delta\) represents a qualitative shift from "empirical fairness" to "certified fairness," with direct implications for legal compliance (e.g., NYC Local Law 144).
The equivalence \(\Delta_{DP} = |\text{Cov}(\hat{Y}, S)|/\text{Var}(S)\) is particularly elegant: it reduces fairness verification to a standard statistical quantity, enabling direct application of mature tools such as the t-test.
The "honest" design philosophy: the NSF mechanism — preferring silence over false guarantees when certification is infeasible — is a design principle worth adopting broadly in trustworthy AI.
Independently trained adversary: more stable and reliable than joint adversarial training, and theoretically closer to the optimal adversary.

Limitations & Future Work¶

Assumes near-optimal adversary approximation: the trained \(\tau_{adv}\) is only an approximation; if it deviates significantly from the optimum, the guarantee may not hold.
Binary sensitive attributes only (main paper); multi-valued extension is discussed in the appendix with limited experiments.
High NSF rate under small \(\varepsilon\) and \(\delta\): practical deployment may require large datasets to find solutions under strict constraints.
t-test assumes normality: sufficiently large samples are required for the Central Limit Theorem to apply.
Distribution shift not considered: guarantees hold only under the i.i.d. assumption; robustness to covariate or sensitive attribute shift warrants further investigation.

vs. LAFTR/LMIFR/CFAIR: These methods estimate fairness upper bounds on training data, but test-set violation rates exceed 10% — the absence of statistical guarantees is the fundamental issue.
vs. FARE: FARE also provides high-confidence guarantees, but its certificates are very loose (often several times \(\varepsilon\)) and do not support user-specified thresholds; FRG achieves tighter upper bounds via adversarial inference.
vs. Seldonian algorithms: FRG belongs to the Seldonian algorithm family — the "safety first, performance second" framework is generalizable to other guarantees such as privacy and robustness.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First controllable high-confidence fair representation learning framework with outstanding theoretical contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ 3 datasets × 6 baselines × multiple \(\varepsilon\)/\(\delta\) configurations, including adversarial task and transfer learning evaluations.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is precise and theoretical derivations are clear, though some details are deferred to the appendix.
Value: ⭐⭐⭐⭐⭐ Significant implications for trustworthy AI and fairness compliance; the framework is extensible to other safety guarantees.