Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs¶
Conference: ICML 2025
arXiv: 2505.23996
Code: https://github.com/apple/ml-synthbias
Area: AI Safety / LLM Fairness
Keywords: Fairness Evaluation, Uncertainty-aware, Gender Bias, Coreference Resolution, Benchmark Dataset
TL;DR¶
Proposes an uncertainty-aware fairness metric UCerF and a large-scale synthetic dataset SynthBias to evaluate the gender-occupation bias of LLMs at a finer grain by jointly considering model prediction correctness and confidence.
Background & Motivation¶
Background: LLM fairness evaluation primarily relies on accuracy-based discrete metrics such as Equalized Odds (EO), which measure the degree of bias by comparing accuracy differences across various demographic groups. WinoBias is the most commonly used gender-occupation coreference resolution dataset in this field.
Limitations of Prior Work: Accuracy-based metrics only focus on whether predictions are correct while ignoring differences in model confidence. Two models might achieve identical accuracy, but one confidently makes biased predictions while the other makes an uncertain correct guess. Under EO, they are deemed equally fair, but their actual behavior differs significantly. Furthermore, WinoBias suffers from a small scale (only 3,168 instances), a lack of diversity, and a reliance on outdated syntactic cues, making it unsuitable for evaluating modern LLMs.
Key Challenge: The discrete nature of traditional fairness metrics (where each sample only contributes a binary "correct/incorrect") fails to capture implicit biases in model decision-making. A model can be "confidently correct" on one group but "barely guessing correctly" on another, an asymmetry entirely overlooked by accuracy-based metrics.
Goal: (1) Design a fairness metric capable of jointly considering prediction correctness and uncertainty; (2) Construct a larger, more diverse fairness evaluation dataset tailored for modern LLMs.
Key Insight: The authors observe that under random sampling decoding (non-greedy), model confidence directly affects the degree of bias in the output. A highly confident yet heavily biased model poses a much greater risk in real-world applications than an uncertain model.
Core Idea: Unify the model's correctness and uncertainty into a single continuous scale of "behavioral desirability" (LSBP), and then define fairness as the distance between different groups on this scale.
Method¶
Overall Architecture¶
The UCerF framework consists of two parts: (1) UCerF, an uncertainty-aware fairness metric that maps model behavior to a desirability scale \([-1, 1]\) before computing group differences; (2) SynthBias (31,756 samples), a synthetic dataset generated by GPT-4o and verified by humans, used for fairness evaluation in gender-occupation coreference resolution tasks.
Key Designs¶
-
Linear Scale of Behavioral Desirability (LSBP):
- Function: Unifies prediction correctness and model uncertainty into a single continuous scale.
- Mechanism: Utilizes perplexity to estimate model uncertainty, which is normalized to obtain confidence \(c(x_i) = (k - f_{\text{perplexity}}(x_i; G)) / (k-1) \in [0,1]\). A sign is then assigned based on whether the prediction is correct: \(D(x_i) = c(x_i)\) when correct, and \(D(x_i) = -c(x_i)\) when incorrect, yielding a desirability score \(D \in [-1, 1]\).
- Design Motivation: The binary nature of traditional metrics (0/1) discards the distinction between "low-confidence accurate hits" and "high-confidence biased predictions". LSBP intuitively assigns high scores to "confident correctness", the lowest scores to "confident errors", and places "uncertainty" in the middle.
-
UCerF Fairness Metric:
- Function: Quantifies the fairness gap between two groups based on LSBP.
- Mechanism: For each minimal pair sample (differing only by pronouns), computes \(U(x_i) = 1 - \frac{1}{2}|D(x_i^A) - D(x_i^B)|\), and then calculates the expectation over the entire dataset \(U(\mathbf{X}) = \mathbb{E}[U(x_i)] \in [0,1]\). A value of 1 represents perfect fairness, while 0 represents complete unfairness. A group-wise variant \(U_\text{group}\) is also proposed, which substitutes TPR/FPR with TPD/FPD to align with EO.
- Design Motivation: Compared to EO which only measures accuracy discrepancies, UCerF distinguishes scenarios such as "confident correct + confident incorrect" (high bias) vs "confident correct + uncertain correct" (moderate bias), providing a more granular fairness evaluation.
-
SynthBias Dataset:
- Function: Provides a large-scale, diverse gender-occupation coreference resolution evaluation dataset.
- Mechanism: Generates candidate samples using GPT-4o, covering all gender stereotype pairs across 40 occupations. It redefines type1 (ambiguous even for humans) and type2 (resolvable by humans), replacing the old synthetic cue-based classification. Following automatic rule filtering and crowdsourced annotation verification (qualification test \(\ge 80\%\), dynamic coverage strategy to reach \(75\%\) consensus), it ultimately yields \(14,132 + 17,624 = 31,756\) validated instances.
- Design Motivation: WinoBias is overly small-scale (only 3,168 instances) with fixed templates leading to insufficient diversity. Furthermore, its type1/type2 definitions based on syntactic cues are no longer suitable for modern LLMs that understand semantics.
Loss & Training¶
UCerF is an evaluation metric and does not involve training. The evaluation utilizes perplexity as the uncertainty estimator, which can be substituted with other estimators.
Key Experimental Results¶
Main Results¶
| Model | Accuracy (WB) | EO (WB) | UCerF (WB) | Accuracy (SB) | UCerF (SB) |
|---|---|---|---|---|---|
| Llama-3-70B-Inst | High (Top 2) | Rank 1 | Rank 1 | Declined | Rank 3 |
| Mixtral-8x7B-Inst | High (Top 3) | Rank 2 | Rank 2 | Declined | Rank 4 |
| Mistral-7B-Inst | 4th | 5th | 8th | Declined | Declined significantly |
| Pythia-1B | 10th | Medium | 5th | Stable | Stable |
The accuracy of models on SynthBias is generally lower than on WinoBias, indicating that the dataset is more challenging.
Ablation Study¶
| Analysis Dimension | Key Metrics | Description |
|---|---|---|
| UCerF vs EO Rank Discrepancy | Mistral-7B: EO 5th \(\rightarrow\) UCerF 8th | High-confidence biased predictions are penalized by UCerF |
| Uncertain Models | Pythia-1B: Acc 10th but UCerF 5th | Cautious predictions are deemed more fair by UCerF |
| WB \(\rightarrow\) SB Rank Changes | Llama-3-70B: 1st \(\rightarrow\) 3rd | SynthBias exposes biases undetected by WinoBias |
| Type1 Tasks | Llama-3-70B: WB 3rd \(\rightarrow\) SB 8th | Internal biases are exposed under semantic ambiguity |
| MCQ Tasks | UCerF improved for all models | Restricted options make predictions more uniform |
Key Findings¶
- Mistral-7B performs acceptably in terms of accuracy and EO, but UCerF reveals its overconfidence in biased predictions, manifesting as an "implicit bias".
- SynthBias is more challenging than WinoBias (accuracies generally drop) and can expose more discrepancies in fairness. For instance, Llama-3-70B ranks first on WinoBias but drops to third on SynthBias.
- In type1 tasks where no correct answer exists, SynthBias's human-verified ambiguity is more rigorous. This isolates the LLM's semantic comprehension capability to evaluate its intrinsic bias independently.
Highlights & Insights¶
- Linear Scale of Behavioral Desirability (LSBP) unifies correctness and uncertainty into a single continuous dimension. This is an elegant and general idea, easily transferable to any scenario requiring the joint evaluation of correctness and certainty (e.g., fairness in computer-aided medical diagnosis).
- UCerF Exposes Implicit Bias: Even when models answer correctly, the asymmetry in confidence still reflects bias—an insight that EO completely fails to capture.
- The data generation and human verification workflow of SynthBias (GPT-4o generation \(\rightarrow\) rule filtering \(\rightarrow\) crowdsourced annotation \(\rightarrow\) strict consensus filtering) serves as an excellent paradigm for constructing high-quality synthetic evaluation datasets.
Limitations & Future Work¶
- Focuses only on binary gender pronouns (his/her), without addressing gender-neutral pronouns like they/them.
- Occupational bias is grounded in data from the US Bureau of Labor Statistics, possessing geographical limitations.
- Uncertainty estimation relies solely on perplexity; more sophisticated estimators (such as MC Dropout or semantic entropy) might lead to different conclusions.
- The dataset is generated by GPT-4o, which might inherit its own biases.
- The applicability of UCerF to other types of bias (e.g., race, age) and other tasks (e.g., text generation, QA) has not been explored.
Related Work & Insights¶
- vs WinoBias: WinoBias is template-based, small-scale, and relies on syntactic cues. Conversely, SynthBias is LLM-generated, 10 times larger, and based on semantic ambiguity, making it better suited for evaluating modern LLMs.
- vs Equalized Odds: EO is discrete (binary correct/incorrect), whereas UCerF is continuous (considering confidence), enabling it to detect "high-confidence bias" and "low-confidence fairness" overlooked by EO.
- vs Kuzucu et al. (2023): Prior work only compared the uncertainty of two groups separately, whereas UCerF jointly considers uncertainty and correctness to handle more intricate scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ Incorporating uncertainty into fairness metrics is a novel perspective, and the LSBP scale is elegantly designed.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluation is conducted with 10 LLMs across two datasets in intrinsic, MCQ, and CoT multi-task setups, featuring detailed case studies.
- Writing Quality: ⭐⭐⭐⭐⭐ Richly illustrated, introducing the problem with intuitive examples in Fig.1 and advancing step-by-step.
- Value: ⭐⭐⭐⭐ Provides a more fine-grained tool for fairness evaluation, while the SynthBias dataset holds solid practical application value.
title: >- [Paper Note] Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs description: >- [ICML 2025][AI Safety][fairness] Proposes an uncertainty-aware fairness metric UCerF and a large-scale gender-occupation bias evaluation dataset SynthBias (31,756 samples), to finer evaluate LLMs' intrinsic bias by jointly analyzing prediction correctness and model uncertainty. tags: - ICML 2025 - AI Safety - fairness - uncertainty - LLM bias - gender-occupation bias - co-reference resolution