Many LLMs Are More Utilitarian Than One¶
Conference: NeurIPS 2025 arXiv: 2507.00814 Code: GitHub Area: LLM Reasoning / AI Alignment Keywords: Multi-Agent Systems, Moral Reasoning, Utilitarian Boost, Group Deliberation, AI Alignment, Deontology
TL;DR¶
A controlled study across six LLMs identifies a "Utilitarian Boost" phenomenon: LLMs engaged in dyadic or triadic moral deliberation are more likely than their solo counterparts to endorse harming a minority for the benefit of the majority. This effect is especially pronounced in personal dilemmas involving direct harm (\(\beta=0.31, p<.0001\)), and the underlying mechanisms differ across models—some exhibit reduced norm sensitivity, others heightened impartiality.
Background & Motivation¶
Background: LLM-based multi-agent systems (MAS) have been deployed in high-stakes domains such as healthcare and law for collaborative decision-making. Prior work has documented social-psychological phenomena in LLM groups, including conformity and belief congruence. Yet group moral reasoning—arguably the most consequential decision type for deployment—remains almost entirely unstudied.
Limitations of Prior Work: (1) LLM moral reasoning research is almost exclusively single-agent, offering no insight into emergent multi-agent behavior. (2) Human psychology has established that group discussion produces a "Utilitarian Boost"—groups are more willing than individuals to sacrifice a minority for the "greater good." If LLM MAS exhibit the same effect, it poses a serious threat in high-stakes deployments. (3) Single-agent alignment checks cannot capture morally-shifted behavior that emerges at the group level.
Key Challenge: An individual LLM may pass safety evaluations in isolation, yet contribute to more dangerous moral judgments when embedded in a group—an alignment blind spot that has been entirely overlooked.
Goal: Do LLMs also exhibit a Utilitarian Boost during group deliberation? What mechanisms drive this boost? Can it be mitigated?
Key Insight: The paper directly adopts well-validated experimental paradigms from psychology—Greene's moral dilemma battery, the Oxford Utilitarianism Scale (OUS), and the CNI model—to study LLM groups.
Core Idea: LLM multi-agent systems systematically shift toward utilitarianism following moral deliberation—safety at the individual model level does not guarantee safety at the group level.
Method¶
Overall Architecture¶
Two conditions are compared: Solo (a single LLM independently evaluates moral dilemmas) vs. Group (two or three instances of the same LLM conduct six rounds of discussion before each agent provides a private reflective rating). Moral reasoning is quantified using the classical dilemma battery (personal/impersonal) and established psychological instruments (OUS, CNI model). Experiments are conducted across six LLMs: Llama3.3-70B, GPT-4.1, Gemma3-27B, Qwen3-32B, Qwen2.5-32B, and QwQ.
Key Designs¶
- Greene Moral Dilemma Experiments:
- Function: Measure changes in LLMs' acceptance of moral violations between Solo and Group conditions.
- Mechanism: Greene et al.'s dilemma battery is used, distinguishing personal dilemmas (direct harm, e.g., pushing someone off a bridge to save five) from impersonal dilemmas (indirect harm, e.g., pulling a lever). Each dilemma is rated on a 1–7 scale (7 = most utilitarian). In the Group condition, three agents deliberate for six rounds before independently providing private reflective ratings. Differences between Group and Solo are analyzed using ordinal mixed-effects regression.
-
Design Motivation: This dilemma battery has been validated over two decades of human moral psychology research; direct adoption ensures comparability.
-
CNI Model Analysis of Utilitarian Mechanisms:
- Function: Decompose the source of the Utilitarian Boost—whether it stems from greater sensitivity to consequences (C), reduced sensitivity to norms (N), or a preference for action (I).
- Mechanism: The CNI model estimates three latent variables—C (consequence sensitivity), N (norm sensitivity), and I (inaction preference)—from response patterns across four orthogonal conditions (action-consistent, action-inconsistent, omission-consistent, omission-inconsistent). In humans, the group Utilitarian Boost is driven solely by increased C. The paper investigates whether the same holds for LLM groups.
-
Design Motivation: Moving beyond surface-level performance metrics to diagnose the specific cognitive mechanism underlying the Utilitarian Boost, since different mechanisms call for different mitigation strategies.
-
Mitigation Strategy Exploration:
- Function: Test the effects of model diversity, self-reflection, and pre-assigned moral frameworks on the Utilitarian Boost.
- Mechanism: (1) Model heterogeneity—pairing models from different families or of different sizes: homogeneous pairs (e.g., GPT-4.1 × GPT-4.1) amplify utilitarianism, cross-family heterogeneous pairs attenuate the boost (\(\beta=-0.30, p=.0001\)), and mixed-size pairs even reverse it toward deontology (\(\beta=1.40, p<.001\)). (2) Self-reflection—replacing multi-agent discussion with iterative single-model self-debate eliminates the Utilitarian Boost. (3) Moral priming—deontology–deontology (DD) pairs elevate utilitarianism, while UD/DU mixed pairs produce a "Deontological Boost" (\(-0.323, p<.0001\)).
- Design Motivation: Providing practitioners with actionable design levers for controlling the Utilitarian Boost.
Key Experimental Results¶
Main Results¶
| Model | Group–Solo Difference | SE | z | p |
|---|---|---|---|---|
| Gemma3 (strongest) | +1.65 | 0.16 | 10.33 | <.0001 |
| Qwen3 | +1.23 | 0.155 | 7.90 | <.0001 |
| Llama3.3 | +0.80 | 0.158 | 5.07 | <.0001 |
| Qwen2.5 | +0.68 | 0.124 | 5.47 | <.0001 |
| QwQ | +0.69 | 0.125 | 5.54 | <.0001 |
| GPT-4.1 | +0.57 | 0.17 | 3.35 | .0023 |
Ablation Study¶
| Configuration | Key Metric | Notes |
|---|---|---|
| Personal dilemmas | +0.6352 (p<.001) | Significant Utilitarian Boost in direct-harm scenarios |
| Impersonal dilemmas | −0.0227 (p=.975) | No significant effect in indirect-harm scenarios |
| Homogeneous model pairs | +0.29 (p=.0001) | Same-model pairing amplifies utilitarianism |
| Cross-family heterogeneous pairs | −0.30 (p=.0001) | Cross-family pairing attenuates the boost |
| Mixed-size models | +1.40 reversed | Models of different sizes produce a Deontological Boost |
| Self-reflection (no group) | No significant boost | Group dynamics, not iteration per se, drive the boost |
Key Findings¶
- The Utilitarian Boost is significant only in personal dilemmas (direct harm)—opposite to the human pattern, where the boost also appears in impersonal dilemmas.
- CNI profiles differ markedly across models: Gemma3 acts as a "norm-optimizing utilitarian," GPT-4.1 as an "impartial utilitarian," and Qwen3 as an "action-oriented utilitarian."
- Model diversity is the most effective mitigation tool—mixing models from different families or of different sizes can eliminate or even reverse the effect.
- Sentiment analysis reveals an elevated proportion of "fear" labels in the Group condition, which correlates positively with the Utilitarian Boost.
Highlights & Insights¶
- This is the first systematic demonstration of emergent moral drift in LLM multi-agent systems—the finding that "groups are more utilitarian than individuals" carries profound implications for AI alignment.
- The divergent CNI profiles across models reveal that the Utilitarian Boost is not a monolithic phenomenon, necessitating model-specific mitigation strategies.
- The discovery that model heterogeneity serves as a mitigation tool has immediate practical value—simply mixing different models may suffice.
- The experimental design is rigorous: validated psychological instruments, human evaluation to verify rating consistency, and three repetitions per condition.
Limitations & Future Work¶
- Only dyads and triads are tested; larger groups, asynchronous deliberation, and other configurations remain unexplored.
- Experiments are conducted primarily in English; the influence of moral norms and cultural variation is not considered.
- Mitigation experiments are exploratory and require more systematic validation.
- Group architectures incorporating a meta-controller (moderator) are not tested.
- The association between sentiment labels and utilitarianism is correlational rather than causal.
Related Work & Insights¶
- vs. Human group moral research: The human Utilitarian Boost is driven by increased consequence sensitivity (C); LLM mechanisms vary across models and are more complex.
- vs. LLM conformity research (Weng et al.): Conformity is a general phenomenon, whereas the moral drift identified here is more specific and more dangerous.
- vs. Single-agent moral evaluation: Single-agent alignment checks cannot capture group-emergent moral drift—group-level alignment evaluation is necessary.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First demonstration of the Utilitarian Boost in LLM multi-agent systems; an important discovery of an alignment blind spot.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Six models × multiple dilemma batteries × CNI analysis × mitigation experiments × human verification—exceptionally comprehensive.
- Writing Quality: ⭐⭐⭐⭐ Clear structure; psychological instruments applied rigorously.
- Value: ⭐⭐⭐⭐⭐ An important warning for multi-agent AI alignment, with actionable mitigation strategies.