Mitigating Selection Bias with Node Pruning and Auxiliary Options¶

Conference: ACL 2025
arXiv: 2409.18857
Institution: University of Wisconsin-Madison, Amazon
Keywords: selection bias, bias node pruning, auxiliary option injection, multiple-choice questions, LLM debiasing

TL;DR¶

This paper proposes two complementary methods, Bias Node Pruning (BNP) and Auxiliary Option Injection (AOI), to concurrently mitigate the selection bias of LLMs in multiple-choice questions (MCQs) from both internal and external perspectives. This is achieved by localizing and pruning 0.002% of the biased parameters in the model's output layer (white-box) and injecting an "I don't know" auxiliary option (widely applicable to black-box models). Additionally, a distribution-level bias metric, CKLD, is introduced. The combined approach improves the ARC-Challenge accuracy on Llama-3 from 52.3% to 65.3%.

Background & Motivation¶

Core Problem: LLMs exhibit systematic selection bias when answering MCQs—tending to select specific positions (such as the last option) or specific labels (such as "A") regardless of the option content, which severely undermines accuracy and reliability.
Limitations of Prior Work: Previous works focus on input reformatting (e.g., Split-and-Merge, Li et al. 2023) or output probability calibration (e.g., PriDe, Zheng et al. 2024; DoLa, Reif & Schwartz 2024). However, these methods only apply external patches, ignoring the internal mechanisms of bias generation.
Real-world Impact: Voting experiments in Figure 2 reveal that the majority-voting accuracy across all option permutations for four LLMs is significantly higher than the single-inference accuracy, proving that selection bias is a universal issue across models.
Key Finding 1: By analyzing the selection frequency distribution under option permutations, it is observed that the bias of incorrect samples is far greater than that of correct samples—when the model answers incorrectly, the selection distribution exhibits a sharper imbalance (e.g., Llama-3 prefers "D" and Bloomz prefers "A").
Key Finding 2: By extracting embeddings from various layers and token positions of the model and calculating the \(L_2\) norm of the difference vector between correct and incorrect answers, it is found that selection bias is primarily concentrated in the final layers of the decoder, particularly at the interaction between the output projection matrix and the final-layer embeddings.

Method¶

Overall Architecture¶

This paper proposes two complementary debiasing methods and a new evaluation metric:

Component	Type	Applicability	Core Idea
BNP (Bias Node Pruning)	Parameter Pruning	White-box models	Prune rows in the output projection matrix that interact most strongly with the bias vector
AOI (Auxiliary Option Injection)	Input Prompting	White-box + Black-box	Inject an "I don't know" auxiliary option to absorb uncertainty
CKLD (Choice KL Divergence)	Evaluation Metric	Universal	Measure the discrepancy between the predicted distribution and the ground-truth label distribution using KL divergence

Key Designs¶

1. Bias Node Pruning (BNP)

The core idea is to model the output of the biased LLM as the product of an "unbiased model + a bias vector" and the output projection matrix: \(F(x) \approx (D(x) + b) \cdot W\), where the bias term \(b \cdot W\) directly causes the output shift. The specific steps are as follows:

Bias Vector Calculation: For each question \(x\), all permutations of options are fed into the model. The final-layer embeddings for correct and incorrect permutations are collected to calculate the difference vector \(b_x = \text{mean}(z_-) - \text{mean}(z_+)\). The global bias vector \(b\) is then obtained by averaging over 32 training samples.
Bias Node Identification: The interaction strength \(\sum_j b_i \times W_{ij}\) is calculated for each row of the output projection matrix \(W \in \mathbb{R}^{d \times |V|}\) with the bias vector \(b\). The Top-k rows with the strongest interaction are identified as the set of bias nodes \(K\).
Parameter Zeroing: All corresponding rows in \(K\) are set to zero to obtain \(\tilde{W}\). Subsequent inferences are performed using \(\tilde{W}\), modifying only ~0.002% of the model parameters (e.g., pruning only 32 nodes out of Llama-3's 8 billion parameters).
Hyperparameter Selection: 32 nodes are pruned for Llama-3 and Mistral, and 128 nodes for Bloomz, selected via a simple grid search over {16, 32, 64, 128}.

2. Auxiliary Option Injection (AOI)

Based on the observation that "incorrect samples are more prone to bias," this mechanism is designed to let the model actively express uncertainty:

Option Expansion: An auxiliary option \(o_{\text{aux}}\), "I don't know", is appended to the end of the original option set \(A\).
Answer Selection: Based on the output logit probability distribution, the option with the highest probability is selected from the original option set (excluding \(o_{\text{aux}}\)) as the final answer \(\hat{a} = \operatorname{argmax}_{a \in A \setminus \{o_{\text{aux}}\}} P(\hat{y}=a|x_A)\).
Black-box Adaptation: For black-box models where logits are inaccessible, the Jaccard similarity between the generated text and each option is used instead of probability ranking.
Ablation Study: Comparing alternative texts like "None of the above" and "I know the answer", "I don't know" performs best in most scenarios. Multiple IDK options provide additional gains for Llama-3 but are ineffective for other models.

3. Choice KL Divergence (CKLD)

Existing metrics (RStd standard deviation, RSD relative standard deviation) only measure the variability of accuracies across classes, making them insensitive to imbalanced ground-truth label distributions (e.g., A accounts for 22%, D accounts for 28%), which can be misleading:

Definition: \(\text{CKLD} = \sum_i p_i \log(p_i/q_i)\), where \(p_i\) is the ratio of the \(i\)-th option in the ground-truth labels, and \(q_i\) is the ratio of the \(i\)-th option in the model's predictions.
Theoretical Guarantee: Using the Lagrangian method, it is proven that CKLD reaches its minimum value of 0 if and only if \(q_i = p_i\) (i.e., the predicted distribution matches the ground-truth distribution).
Contrast of Advantages: Synthetic data experiments show that under label imbalance, the minimum point of RSD deviates from the true value (always residing at \(1/k\)), whereas CKLD accurately reflects the point of minimum bias.

Key Experimental Results¶

Main Results (BNP + AOI)¶

Validated across 3 models \(\times\) 3 datasets:

Model + Method	ARC Acc↑	ARC CKLD↓	MMLU Acc↑	MMLU CKLD↓	CSQA Acc↑	CSQA CKLD↓
Llama-3	52.3	0.494	41.8	0.589	65.4	0.095
Llama-3 + BNP	56.7	0.302	43.1	0.501	66.6	0.074
Llama-3 + AOI	60.7	0.231	47.3	0.321	67.4	0.065
Llama-3 + BNP+AOI	65.3	0.124	48.3	0.288	68.1	0.049
Bloomz	43.9	0.283	28.0	0.661	58.5	0.136
Bloomz + BNP+AOI	48.8	0.088	32.0	0.205	64.9	0.052
Mistral	67.4	0.040	46.4	0.186	63.6	0.042
Mistral + BNP+AOI	69.5	0.019	48.6	0.140	66.8	0.016

Compatibility with Existing Methods¶

BNP+AOI is orthogonally compatible with existing methods such as CoT, ICL, and DoLa:

Method (Llama-3, ARC)	Acc↑	CKLD↓
CoT	66.2	0.050
CoT + Ours (BNP+AOI)	69.2	0.024
ICL	62.2	0.169
ICL + Ours (BNP+AOI)	70.0	0.054
DoLa	51.1	0.524
DoLa + Ours (BNP+AOI)	64.1	0.139

Black-Box Model Validation¶

AOI is equally effective in black-box scenarios where model parameters are inaccessible:

Model	ARC Acc	ARC+AOI Acc	CSQA Acc	CSQA+AOI Acc
Claude-3-Haiku	65.3	71.4 (+6.1)	36.4	47.0 (+10.6)
Claude-3-Sonnet	86.9	87.6 (+0.7)	71.0	73.1 (+2.1)

Key Findings¶

BNP is insensitive to the number of pruned nodes: The performance stably outperforms the baseline from 8 to 128 nodes, but fine-tuning can yield further optimization.
Transferability of bias vectors across datasets: The bias vector calculated using the ARC dataset reduces CKLD on CSQA by 36%, which even outperforms the bias vector computed from CSQA itself (22%), indicating that the bias vector captures inherent model properties.
BNP does not affect generation quality: On sentiment analysis and text summarization tasks, pruning 32 nodes leads to only a minor performance drop (F1: 32.7 \(\rightarrow\) 31.3), with a negligible impact on general language capabilities.
Visualization of distribution shifts: After applying BNP+AOI, the model's option choice frequency distribution tends toward uniformity (approaching the ideal uniform ratio marked by the dashed lines).

Highlights & Insights¶

Mechanistic Approach: First to localize the source of selection bias at the parameter level of the output projection matrix, rather than merely performing external calibration.
Minimalist Pruning: Modifying only 0.002% of the parameters yields up to +24.9% relative accuracy improvement with virtually zero computational overhead.
Elegant Design of AOI: Appending a harmless "I don't know" option significantly reduces bias, remaining highly effective for black-box models.
CKLD Fills the Metric Gap: While existing metrics like RStd/RSD fail under class imbalance, CKLD accurately measures distribution-level bias via KL divergence.
Full Scenario Coverage: BNP targets white-box setups while AOI is suitable for black-box ones; their combined use is orthogonal and can further enhance performance when stacked with CoT/ICL.

Limitations & Future Work¶

BNP requires a small amount of annotated calibration data (all permutations of 32 samples), and its generalization to out-of-distribution (OOD) scenarios has not been fully verified.
The hyperparameter \(k\) (number of pruned nodes) must be manually searched for each model, with no automatic determination method provided.
The methods are only validated on MCQ tasks; whether they apply to biases in open-ended generation tasks remains an open question.
Calculating the bias vector requires \(N!\) permutation inferences (where \(N\) is the number of options), causing the cost to escalate rapidly as the number of options increases.
The root causes of bias (e.g., human cognitive biases in training data? symbol encoding in the tokenizer?) remain unresolved.

Input-side Debiasing: Split-and-Merge (Li et al. 2023), option order shuffling (Robinson et al. 2023).
Output-side Calibration: PriDe (Zheng et al. 2024), DoLa contrastive decoding (Chuang et al. 2023), label bias quantification (Reif & Schwartz 2024).
Structured Pruning: LLM-Pruner (Ma et al. 2023), Deep Compression (Han et al. 2016).
Symbol Binding: Enhancing MCQ symbol binding training (Xue et al. 2024).
Inspiration from Survey Science: The inclusion of an "I don't know" option in human surveys improves data quality (Schuman & Presser 1996).

Rating¶

Novelty: ⭐⭐⭐⭐ — The approach combining bias localization and parameter-level pruning is novel; AOI draws inspiration from survey science.
Technical Depth: ⭐⭐⭐⭐ — Features a complete logical progression from embedding analysis to bias modeling and pruning.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation with 3 models \(\times\) 4 datasets, combined with CoT/ICL/DoLa, black-box validation, and extensive ablation studies.
Value: ⭐⭐⭐⭐⭐ — Plug-and-play, applicable to both white-box and black-box models, and orthogonal to existing methods.
Overall Rating: ⭐⭐⭐⭐ — Simple yet effective methodology, addressing a practical pain point in LLM MCQ debiasing.