Understanding and Mitigating Bias Inheritance in LLM-based Data Augmentation on Downstream Tasks¶
Conference: ACL2026
arXiv: 2502.04419
Code: https://github.com/MiaomiaoLi2/bias-inheritance
Area: LLM Security / Fairness / Data Augmentation
Keywords: Bias inheritance, Synthetic data, Data augmentation, Fairness evaluation, Bias mitigation
TL;DR¶
This paper systematically investigates how biased augmented data generated by LLMs is inherited and amplified during supervised fine-tuning, affecting performance on downstream tasks. Through six types of bias generation frameworks, ten tasks, and three classes of mitigation methods, it reveals the complex phenomenon that "more synthetic data does not necessarily mean greater safety."
Background & Motivation¶
Background: LLM-based data augmentation has become a common practice in low-resource tasks and instruction fine-tuning. Compared to manual labeling, LLMs can rapidly generate large volumes of samples, but these samples inevitably carry social biases from the model's pre-training, alignment, and prompt design.
Limitations of Prior Work: Existing fairness studies typically measure the output bias of models directly, with fewer studies examining "what happens when biased synthetic data is reused for training." If LLM-generated data is again used to fine-tune LLMs, biases may not only persist but also propagate in more covert ways across downstream tasks such as classification, recruitment, salary recommendation, and story generation.
Key Challenge: Data augmentation pursues scale and diversity, yet safety and fairness require controlling sample distributions. If synthetic data supplements biased patterns, more data may instead make the model more certain of these patterns, especially when biases are intertwined with professions, cultures, names, and group identities, making them difficult to resolve through simple filtering.
Goal: To define and quantify bias inheritance, systematically compare inheritance effects across different bias types, bias ratios, task types, and model scales, and explore whether it can be mitigated through token, mask, or loss-level methods.
Key Insight: The authors decompose bias generation into three dimensions: contextual vs. contrastive, single vs. intersectional, and explicit vs. implicit. By combining these dimensions, they construct six types of controllable biases, allowing for an analysis of "where the bias comes from and how it affects the task."
Core Idea: Treat the bias in LLM synthetic data as a controllable variable to observe how it propagates and amplifies across tasks, groups, and rounds in the fine-tuned model.
Method¶
The research workflow involves first using an LLM to generate augmented data with gender or cultural biases based on preset prompts; then mixing the original data \(D_o\) and augmented data \(D_a\) into a training set \(D=D_o\cup D_a\); controlling the proportion of biased augmented data via a bias ratio \(\gamma=|D_a|/|D|\); and finally performing supervised fine-tuning on the model and evaluating its performance, fairness, and generation tendencies across multiple downstream tasks.
Overall Architecture¶
The experiments primarily use Llama-3.1-8B-Instruct as the base model, with GPT-4o-mini used for large-scale validation; the appendix also includes cross-architecture verification for the Qwen and DeepSeek series. Gender bias experiments center on six occupations: architect, dentist, nurse, painter, professor, and software engineer, evaluating occupation classification, hiring recommendations, and salary recommendations. Cultural bias experiments cover four cultures: Arabic, Chinese, Portuguese, and Spanish, evaluating directly and indirectly related classification tasks, as well as the proportion of negative adjectives in story generation. Bias ratios are set at 0, 5%, 10%, 20%, and 50%.
Key Designs¶
-
Six Multidimensional Bias Generation Frameworks:
- Function: Generate augmented data with different forms of bias using controllable prompts.
- Mechanism: Combine biases from three sets of dimensions: contextual bias influences answers through background descriptions, while contrastive bias creates differences through direct comparisons between two groups or cultures; single bias involves only one identity dimension, while intersectional bias involves overlapping identities like age, gender, and culture; explicit bias explicitly states group attributes, while implicit bias expresses identity through signals like names.
- Design Motivation: Real-world bias is not a single label; explicit vs. implicit, single vs. intersectional, and contextual vs. contrastive all alter the patterns learned by the model. Decomposing these dimensions allows for a comparison of which types of bias are most easily inherited.
-
Bias Inheritance Evaluation Protocol:
- Function: Quantify the impact of biased synthetic data in the training set on downstream model behavior.
- Mechanism: Keep the original unbiased data fixed, vary the augmented data ratio \(\gamma\), and evaluate the fine-tuned model \(f^*\) for within-group performance, between-group gaps, and open-ended generation tendencies. Classification tasks use accuracy or macro-F1; recruitment tasks look at candidate selection proportions; salary tasks look at average recommended salaries for male vs. female candidates; story generation looks at negative adjectives across dimensions like agency, beliefs, and communion.
- Design Motivation: Looking only at overall accuracy can mask bias inheritance. The authors group directly related tasks, indirectly related tasks, open generation, and multi-round self-augmentation to observe if bias propagates across tasks.
-
Three Mitigation Strategies:
- Function: Attempt to reduce bias inheritance by addressing different sources of misalignment.
- Mechanism: Token-based methods add prompts like "the following text may contain bias" before the augmented text to encourage self-correction; mask-based methods replace sensitive cues like culture, names, and pronouns with
[MASK]or neutral words; loss-based methods add the mean distance between original and augmented data in the representation space to the training objective, such as aligning distributions using \(\mathcal{L}_{align}=(\mathbb{E}_{P_o}[\phi(x,y)]-\mathbb{E}_{P_a}[\phi(x,y)])^2\). - Design Motivation: The paper argues that bias inheritance stems from value misalignment, group generation imbalance, and real/generated data distribution misalignment; therefore, mitigation starts from the prompt, surface features, and representation distribution levels, respectively.
Loss & Training¶
In gender bias experiments, Llama-3.1-8B-Instruct is fine-tuned using LoRA for 3 epochs with a learning rate of \(1e^{-5}\). In cultural bias experiments, the learning rate is \(1e^{-6}\), with Arabic data trained for 5 epochs and other cultures trained for 3 epochs. Loss-based mitigation adds an extra constraint on the mean difference between original and augmented data representations to the standard fine-tuning loss, using the mean distance of the last hidden representation layer to characterize distribution differences.
Key Experimental Results¶
Main Results¶
The paper covers 10 downstream tasks and 17 datasets. The focus is not on comparing single SOTA scores but on how bias ratio, bias type, and task attributes change model behavior.
| Experimental Dimension | Setting | Metric | Main Observation |
|---|---|---|---|
| Gender Classification | BiasinBios, six occupations, balanced test | male/female accuracy | Biased augmented data usually improves performance for the majority (male) and decreases it for the minority (female). |
| Gender Salary | 60 male/female biographies per occupation | Mean recommended salary | After augmentation, salaries for both may rise, but the increase for males is larger, widening the gender pay gap. |
| Gender Hiring | 4 cultures × male/female name candidates | Candidate selection ratio | Spanish male increases are more pronounced, Arabic candidates consistently decline; cross-task bias diffusion occurs. |
| Cultural Classification | 16 public test sets, 16,980 samples | macro-F1 | Indirectly related tasks may improve at low bias ratios (10%-20%), but directly related tasks decline significantly even at low ratios. |
| Cultural Story Generation | Arabic/Chinese/Portuguese/Spanish names | Negative adjective ratio | Spanish negative adjectives overall decrease, while Arabic negative words increase at 20%-50% bias ratios. |
| Multi-round Inheritance | 3,600 unbiased + 50% neutral biased synthetic/round | Class., Hiring, Salary | Bias accumulates across rounds; male salaries rise, female salaries fall, Arabic candidates decline, Spanish candidates rise. |
Ablation Study¶
The analytical experiments attribute bias inheritance to three types of misalignment and compare the applicability of three mitigation strategies.
| Analysis / Mitigation | Evidence or Significance | Conclusion |
|---|---|---|
| Value Misalignment | LLM answers on GlobalOpinionQA differ significantly from real human responses, worse for Eastern cultures. | Models cannot reliably simulate the values of different cultural groups; cultural bias causes more harm to directly related tasks. |
| Group Gen. Imbalance | Under neutral prompts, Llama generates more female biographies for most jobs, except architect. | Even without explicit bias in the prompt, generated data may naturally be imbalanced. |
| Real/Gen. Dist. Misalignment | In embedding space, augmented and original data are often clearly separated; Arabic Bias #5 p-value reaches \(2.06\times10^{-56}\). | Distribution misalignment is a key mechanism for performance degradation and bias inheritance. |
| Statistical Significance | Gender Classification between-group \(p=9.62\times10^{-15}\); Cultural Class. direct vs indirect \(p=8.46\times10^{-24}\). | Bias inheritance is not a random fluctuation but exists significantly across tasks. |
| token-based | Overall mitigation \(p=0.0359\) | More effective in simple bias and classification tasks; depends on the model's self-recognition of bias. |
| mask-based | Overall mitigation \(p=0.0485\) | Useful in low bias ratio and explicit sensitive word scenarios, but insufficient for implicit/distributional bias. |
| loss-based | Overall mitigation \(p=0.0215\) | Most effective for large distribution distances, coarse-grained classification, or generation tasks like salary; most robust overall. |
Key Findings¶
- Bias inheritance is task-dependent: Indirect cultural classification sometimes improves due to extra cultural information, but tasks directly identifying discrimination/bias are significantly harmed.
- Bias type is critical: Contrastive explicit and contextual implicit are often the most dangerous; the former directly strengthens between-group differences, while the latter is more covert and more easily absorbed by the model as a natural pattern.
- Multi-round self-augmentation amplifies the problem. After repeatedly training on biased synthetic data, the bias not only persists but also spreads to the majority group and causes overall performance degradation.
- Strong aligned models do not necessarily exhibit bias in the same direction. In large-scale experiments, GPT-4o-mini showed a decrease in male selection and an increase in female selection, suggesting RLHF/alignment can change the direction of bias inheritance.
Highlights & Insights¶
- The paper makes "synthetic data safety" very concrete. It does not just say LLMs are biased; it defines a bias ratio and systematically compares 5 ratios, 6 bias types, 2 types of social bias, and 10 tasks.
- Most inspiring is that "bias inheritance can improve certain metrics." Low ratios of cultural bias improve macro-F1 on indirectly related tasks, indicating that biased data sometimes carries useful cultural cues, meaning mitigation cannot simply equate to deleting all group information.
- The analysis of three misalignments is highly actionable: Value misalignment explains cultural Q&A, group generation imbalance explains why neutral prompts still lead to bias, and distribution misalignment explains performance volatility when mixing real and generated data.
- The fact that mitigation results are not packaged as a universal solution is important. Token, mask, and loss methods each have their applicable conditions, indicating that fairness fixing needs to consider the task, bias type, and augmentation ratio rather than applying a fixed filter.
Limitations & Future Work¶
- The scope of social bias is still limited, mainly covering gender and culture, and has not yet systematically studied dimensions like race, socioeconomic status, religion, or disability.
- The training method is primarily supervised fine-tuning; bias inheritance in RLHF, DPO, or synthetic preference data remains an open question.
- Core analysis focuses on Llama-3.1 and GPT-4o-mini; although Qwen/DeepSeek are added in the appendix, the interactions between different model families, alignment strategies, and data generators are not fully covered.
- Current mitigation strategies primarily process augmented data during training without systematic research into data selection, generator constraints, active auditing, or human feedback loops.
Related Work & Insights¶
- vs. Traditional Fairness Evaluation: Traditional work mostly tests if model output is biased; this paper tests how biased data alters the downstream model after entering the training set, which is more relevant to the risks of synthetic data pipelines.
- vs. Data Augmentation Methods: General data augmentation focuses on accuracy and robustness; this paper reminds us that group distribution, values, and representation distribution in augmented data can all alter fairness.
- vs. Debiasing Methods: Only masking sensitive words handles surface bias; the loss-based method in this paper suggests that representation alignment between real and generated data is necessary.
- Transferable Insight: Any system using LLMs to generate training data should record the bias ratio, group distribution, and generator prompts, and perform inheritance-based auditing on downstream tasks rather than just auditing the generated samples themselves.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ "Bias inheritance" as a systematic definition and evaluation of synthetic data retraining risk is highly valuable.
- Experimental Thoroughness: ⭐⭐⭐⭐☆ Tasks, ratios, and bias types are broadly covered, though charts are mostly trend analyses and cross-model depth could be further strengthened.
- Writing Quality: ⭐⭐⭐⭐☆ Clear structure with rich explanations of phenomena; some numerical values in figures are not easily reproducible directly from the text.
- Value: ⭐⭐⭐⭐⭐ Direct warning significance for LLM data augmentation, safety fine-tuning, and fairness auditing.