Skip to content

Assessing the Vulnerability of LLMs to Cognitive Biases in Scientific Research

Conference: ACL 2025
Code: None
Area: AI Safety / LLM Reliability
Keywords: Cognitive Bias, Large Language Models, Scientific Research, Confirmation Bias, Anchoring Effect

TL;DR

This paper systematically evaluates the vulnerability of Large Language Models (LLMs) to various cognitive biases in scientific research scenarios. By constructing a scientific reasoning test suite covering confirmation bias, anchoring effect, availability bias, and others, the study reveals the risks of systemic biases that LLMs may introduce when assisting scientific research, and proposes mitigation strategies.

Background & Motivation

Background: LLMs are being increasingly widely applied in various stages of scientific research, including literature reviews, hypothesis generation, experimental design, and data analysis and interpretation. Researchers have begun relying on tools like ChatGPT and Claude to accelerate scientific workflows. Meanwhile, cognitive science research has revealed the ubiquity of cognitive biases in human decision-making (e.g., Kahneman's System 1/System 2 theory), which can lead to irrational judgements.

Limitations of Prior Work: Existing research on LLM biases primarily focuses on social biases (such as gender and racial biases) and factual hallucinations, with less attention paid to cognitive biases. However, cognitive biases are far more dangerous in scientific research scenarios—if an LLM exhibits confirmation bias when assisting in hypothesis evaluation, it may lead researchers to ignore counter-evidence; if it exhibits availability bias in literature reviews, it may lead to the over-recommendation of popular methods.

Key Challenge: The training data of LLMs itself contains a vast amount of text influenced by human cognitive biases (e.g., over-embellishing one's own methods in papers, low reporting rates of negative results). This implies that LLMs may not only fail to help researchers correct biases, but might instead amplify them.

Goal: (1) Construct a test suite for scientific research scenarios covering 6 primary cognitive biases; (2) quantitatively evaluate the susceptibility of mainstream LLMs to biases in these scenarios; (3) explore effective debiasing strategies.

Key Insight: Starting from classical experimental paradigms in cognitive psychology, the authors adapt traditional cognitive bias test questions to scientific research contexts, constructing test pairs with controlled variables.

Core Idea: Transfer the bias detection methodology from cognitive psychology to LLM evaluation, quantitatively measuring the impact of each bias through carefully designed controlled experiments.

Method

Overall Architecture

The framework consists of three stages: bias test suite construction, multidimensional bias evaluation, and debiasing strategy exploration. In the test suite construction stage, test cases contextualized in scientific research are designed for 6 categories of cognitive biases. In the evaluation stage, tests are run on multiple mainstream LLMs to quantify the level of bias. In the debiasing stage, the effectiveness of prompt-based and post-hoc debiasing methods is explored.

Key Designs

  1. Scientific Contextualized Cognitive Bias Dataset (SciCogBias):

    • Function: Provide a standardized evaluation benchmark for scientific cognitive biases in LLMs.
    • Mechanism: Cover 6 cognitive biases—(a) Confirmation Bias: given a hypothesis and evidence from both supporting and opposing sides, test whether the LLM tends to support the given hypothesis; (b) Anchoring Effect: after providing an initial numerical value, test whether the LLM's numerical estimation is influenced by the anchor; (c) Availability Bias: test whether the LLM tends to recommend more popular/recent methods while neglecting equally or even superior less-known methods; (d) Bandwagon Effect: after informing the LLM that "most experts believe X," test whether its judgment changes; (e) Framing Effect: express the same scientific discovery in positive and negative ways, testing whether the LLM's evaluation remains consistent; (f) Sunk Cost Bias: in a situation where a research project has already consumed substantial resources, test whether the LLM will advocate continuing an unpromising project. For each bias, 100-200 paired test cases are constructed.
    • Design Motivation: Existing LLM bias benchmarks lack specificity for scientific research scenarios, and general bias evaluations cannot reflect the actual risks of LLMs assisting in scientific research.
  2. Paired Controlled Assessment:

    • Function: Accurately quantify the magnitude of influence of each bias.
    • Mechanism: Design paired conditions for each test case—a bias-inducing condition (containing bias triggers, e.g., presenting a hypothesis before evidence) and a neutral condition (removing bias triggers, e.g., presenting only evidence without a hypothesis). The degree of bias \(B_{score}\) is defined as a quantitative measure of the difference in LLM outputs between the two conditions. Probability differences are used for classification tasks, and semantic similarity changes are used for generation tasks. In large-scale experiments, each test case is sampled 10 times repeatedly to reduce the impact of randomness.
    • Design Motivation: Only through rigorous controlled experiments can one differentiate between biased behavior and reasonable reasoning behavior of LLMs—since prior information should indeed influence judgment in some cases.
  3. Multi-Level Debiasing:

    • Function: Alleviate cognitive biases in LLMs during scientific reasoning.
    • Mechanism: Three levels of debiasing strategies are designed—(a) Prompting level: explicitly warn the LLM about specific biases in the prompt (e.g., "please ensure you are not influenced by the anchoring of the initial value"), or require the LLM to list supporting and opposing arguments first before making a judgment (Devil's Advocate Prompting); (b) Reasoning level: require the LLM to generate multiple independent reasoning paths and take a majority vote (Self-Consistency Debiasing), or ask the LLM to identify potential bias types before answering (Meta-Cognitive Prompting); (c) Verification level: use another LLM as a "reviewer" to check the first LLM's response for signs of bias.
    • Design Motivation: A single debiasing strategy has difficulty handling all types of biases. Since the generation mechanisms of different biases vary (cognitive vs. statistical vs. linguistic levels), multi-level defense is required.

Loss & Training

This is a purely evaluative study and does not involve model training. In the evaluation, temperature=0 is used to ensure output determinism, while multiple sampling (temperature=0.7, 10 times) is utilized to evaluate the robustness of biases.

Key Experimental Results

Main Results

Bias Type GPT-4 Bias Score Claude-3 Bias Score Llama-3-70B Bias Score Human Baseline
Confirmation Bias 0.42 0.38 0.51 0.45
Anchoring Effect 0.56 0.48 0.63 0.52
Availability Bias 0.61 0.55 0.72 0.38
Bandwagon Effect 0.47 0.43 0.58 0.50
Framing Effect 0.35 0.31 0.44 0.40
Sunk Cost 0.39 0.35 0.49 0.55

Ablation Study

Debiasing Strategy Confirmation Bias Reduction Anchoring Effect Reduction Availability Bias Reduction Average Reduction
No Debiasing 0 0 0 0
Explicit Warning -8.2% -5.1% -4.3% -5.9%
Devil's Advocate -18.6% -7.2% -11.5% -12.4%
Self-Consistency -12.3% -15.8% -9.7% -12.6%
Meta-Cognitive -15.1% -12.4% -13.2% -13.6%
Multi-Level (All) -24.7% -21.3% -19.8% -21.9%

Key Findings

  • LLMs exhibit the highest vulnerability to availability bias (0.61) and anchoring effect (0.56), which are significantly higher than the human baseline. The prominence of availability bias might be due to the fact that popular methods appear much more frequently in pre-training data than niche methods.
  • Interestingly, LLMs outperform humans on sunk cost bias (0.39 vs 0.55), suggesting that LLMs might actually act more rationally regarding biases that involve emotional factors.
  • Single debiasing strategies have limited effectiveness (5-14%), but using them in a multi-level combination shows significant improvement (approx. 22%), indicating that different levels of debiasing mechanisms indeed address biases from different sources.
  • Larger and newer models (GPT-4 > Llama-3-70B) generally show lower levels of bias, but the gap is closing.

Highlights & Insights

  • Transferring the paired experimental paradigm from cognitive psychology to LLM evaluation is elegant—it eliminates confounding variables through strict controls, making the bias measurement more credible. This methodology can be applied to evaluate any systematic behavioral biases in LLMs.
  • The finding that "LLMs are worse than humans regarding availability bias" has significant practical implications: this means that when using LLMs for scientific literature reviews, they may amplify recommendation bias toward popular methods, requiring researchers to be particularly vigilant.
  • The effectiveness of the Meta-Cognitive Prompting strategy (prompting LLMs to perform self-review first) suggests that LLMs possess a certain degree of "meta-cognitive" ability, which can be guided for self-correction through appropriate prompting.

Limitations & Future Work

  • Although the evaluation suite covers 6 biases, the types of known biases in cognitive psychology are far more numerous (e.g., publication bias, survivorship bias), requiring expansion in future work.
  • The absolute value of bias scores is difficult to directly interpret as "severity of harm", and correlation analysis with actual scientific research error rates is lacking.
  • Debiasing strategies increase inference costs, introducing latency issues in real-time research assistance scenarios.
  • Future work can develop specialized "scientific assistant security guardrails" to automatically screen for biases before LLMs provide scientific advice.
  • vs. CogBias (Echterhoff et al., 2024): CogBias also evaluates cognitive biases in LLMs but uses general scenarios. The scientific research-contextualized design in this work makes the evaluation results more targeted and practical.
  • vs. Social Bias Research (BBQ, WinoBias): Traditional bias benchmarks focus on demographic factors, whereas the cognitive biases addressed in this study are more insidious and have a greater impact in scientific research settings.
  • vs. Red Teaming for Science: Red-teaming methods already exist in the safety alignment domain; this work can be viewed as cognitive safety red-teaming in scientific scenarios.

Rating

  • Novelty: ⭐⭐⭐⭐ The perspective of cognitive bias evaluation in scientific research scenarios is novel, and the test suite is elegantly designed.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation covering 6 biases, multiple models, and ablation studies on debiasing strategies.
  • Writing Quality: ⭐⭐⭐⭐ Clear interdisciplinary analysis between cognitive psychology and NLP.
  • Value: ⭐⭐⭐⭐⭐ High practical value, offering important warnings regarding the reliability of LLMs in scientific research.