Deontological Keyword Bias: The Impact of Modal Expressions on Normative Judgments of Language Models¶

Conference: ACL 2025
arXiv: 2506.11068
Area: LLM Alignment / AI Safety
Keywords: Deontological Keyword Bias, Modal Expressions, Normative Judgments, LLM Bias, Debias Methods

TL;DR¶

This paper reveals that LLMs exhibit "Deontological Keyword Bias" (DKB)—when prompts contain modal deontic expressions such as "must" and "ought to", the models misclassify over 90% of commonsense scenarios as obligations. The authors propose debiasing strategies based on few-shot examples and reasoning prompts.

Background & Motivation¶

The moral reasoning capability of LLMs is increasingly important: As the application of LLMs in the real world expands, the normative decisions they make as agents may affect society's understanding of "right" and "wrong."

The criticality of deontic/obligation judgments: Obligation judgments are core elements of behavioral decision-making in LLMs. Unlike factual judgments, the criteria for obligation judgments are often ambiguous, even for humans.

Key differences between humans and LLMs: - Humans learn normative judgments through real-world interactions and consequence simulation/imagination. - LLMs learn concepts of obligation indirectly through textual patterns, lacking direct interaction with real-world consequences. - This leads to LLMs potentially over-relying on linguistic cues (such as modal expressions) rather than contextual understanding.

Core Hypothesis: LLM obligation judgments are primarily influenced by modal deontic expressions (Modal Expressions), even in scenarios where no obligation judgment is required.

Real-world Risk Example: "You should have an umbrella when it rains" — carrying an umbrella is a reasonable recommendation but not a true obligation. LLMs might falsely judge it as an obligation.

Method¶

Overall Architecture¶

The study unfolds around two core concepts:

DKE (Deontological Keyword Effect): The general phenomenon where modal deontic expressions lead to an increase in obligation judgments.
DKB (Deontological Keyword Bias): A special case of DKE—where the model incorrectly judges a scenario as an obligation due to the presence of modal expressions in contexts where humans do not believe an obligation exists.

Mathematical definition: Given a semantic frame $S$, modal enhancement $Z$, and question format $Q$, DKE holds when $f(Y_with_ME) > f(Y_without_ME)$ holds consistently across instances. DKB refers specifically to DKE when $S$ lacks obligation-related semantics.

Key Designs¶

Experimental Dataset Design:
- Deontological Dataset (Positive Labels): The deontology dataset from Hendrycks et al. (2021).
- Commonsense Dataset (Negative Labels): Serves as the control group for non-deontic contexts.
- Moral Dataset: From Scherrer et al. (2023), containing high and low ambiguity sub-datasets.
- Each dataset has 445 samples, using four modal expressions: "must", "ought to", "should", and "have to".
Multi-dimensional Validation:
- Three levels of questions: general, explicit, and strict.
- Two answer formats: binary and continuous rating.
- Impact of negative modal expressions (e.g., "must not").
- Comparison of different modal expression strengths.
Debiasing Method — In-Context Reasoning:
- A hybrid method combining few-shot learning and reasoning prompts.
- Few-shot examples are annotated based on deontological semantics (rather than keywords).
- Modal expressions are removed from examples in the deontological dataset.
- Negative examples from the commonsense dataset contain modal expressions.

Key Experimental Results¶

Main Results¶

Comparison of Humans vs. GPT-4o on Obligation Judgments (0-5 scale):

Condition	Dataset	Human	GPT-4o
With Modal Expression	Deontological	4.17 (0.50)	4.95 (0.25)
Without Modal Expression	Deontological	3.11 (1.44)	0.30 (0.05)
With Modal Expression	Commonsense	3.33 (1.84)	4.90 (0.10)
Without Modal Expression	Commonsense	1.90 (1.05)	0.10 (0.10)

Key Findings: On the commonsense dataset with modal expressions, GPT-4o gives a score of 4.90 (compared to only 3.33 for humans) with extremely low variance (0.10), indicating that the model almost mechanically classifies all sentences with modal expressions as obligations.

Cross-Model Existence Validation of DKB (Commonsense dataset, ratio of positive obligation judgments):

Model	Without ME	With ME	With Negative ME
GPT-4o	0.02	0.98	0.98
GPT-4o-mini	0.04	0.96	0.97
Llama-3.1-70B	0.01	0.86	0.87
Llama-3.1-8B	0.00	0.54	0.59
Gemma-9B	0.01	0.89	0.69
Qwen-7B	0.02	0.88	0.92

Bias Strength of Different Modal Expressions (Commonsense dataset, cross-model average):

Modal Expression	Ratio of Positive Judgments
must	0.86
ought to	0.83
should	0.79
have to	0.64

The bias strength is consistent with the modal strength in deontic logic.

Key Findings¶

Universality of DKB: Across almost all tested LLMs on the commonsense dataset, the ratio of positive judgments skyrocketed from less than 5% to 50–98% after adding modal expressions.
Negative Modal Expressions Also Induce Bias: Negative forms (such as "must not") are also judged as containing obligation semantics, with the bias being even more severe than affirmative forms on the commonsense dataset.
Consistency Across Question Formats: DKB consistently exists across general, explicit, and strict question levels, as well as binary/continuous rating formats.
Limited Impact in Reasoning Tasks: In the Opposing Contexts Scenario (OCS) experiments, the impact of modal expressions on reasoning outcomes is small and inconsistent, suggesting that keywords might affect judgment and reasoning differently.
De-biasing Performance: The hybrid few-shot + reasoning prompting method reduced the ratio of positive judgments on the commonsense dataset from 0.88 to 0.28 (using 2-shot + reasoning), demonstrating effective debiasing potential.

Highlights & Insights¶

Discovery and Definition of a New Phenomenon: This study is the first to systematically identify and formally define "Deontological Keyword Bias" (DKB), filling an important gap in the research of LLM moral reasoning.
Connection with Instruction Tuning: LLMs are frequently instruction-tuned to follow user prompts, making them particularly sensitive to modal deontic expressions. When expressions like "must follow the instruction" appear, the model may over-generalize its authority.
Bias Source in Training Data: Taking the Alpaca RLHF dataset as an example, non-deontic usages such as "A picnic list should include items such as sandwiches" also reinforce the association between modal expressions and deontic semantics.
Reflections on Practical Impact: As LLMs act as agent systems making real-world decisions, the ability to distinguish legal enforcement, social norms, and recommendations is crucial.
Simple and Effective Debiasing Solution: The proposed training-free debiasing method is simple and practical, serving as a quick patch for actual deployment.

Limitations & Future Work¶

Limited Dataset Scale: Each dataset contains only 445 samples, and the experiments did not cover all available models.
Linguistic Limitations: Only English data was used; whether similar biases exist in other languages or cultural backgrounds remains to be verified.
Insufficient Quantification of Debiasing Effects: Although the debiasing method is effective, quantitative evaluation of the extent of its adjustment is lacking.
Lack of Mechanistic Analysis: The internal mechanisms of DKB generation in LLM knowledge representations were not thoroughly analyzed.
Limited Types of Modal Expressions: Only four modal expressions were tested; other types (such as "need to", "required to") were not covered.

Deontic Logic: Von Wright (1951) symbolic deontic logic; Kant's Categorical Imperative.
Obligation Detection in NLP: Chalkidis et al. (2018) RNN-based regulatory obligation detection; Sun et al. (2023) DeonticBERT.
LLM Bias: Solaiman et al. (2019) social fairness bias; Ladhak et al. (2023) name-culture entanglement.
LLM Moral Reasoning: Zhou et al. (2023), Rao et al. (2023) ethical reasoning.

Rating¶

Dimension	Rating (1-10)	Explanation
Novelty	9	Identifies and defines the important phenomenon of DKB for the first time.
Technical Depth	7	Primarily empirical, with a clear formal definition.
Experimental Thoroughness	8	Systematic validation across multiple models, dimensions, and conditions.
Writing Quality	8	Clear conceptual definitions and structured layout.
Value	8	Direct guiding significance for AI safety and alignment research.
Overall Rating	8.0	A high-quality empirical study revealing an important bias phenomenon.