Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation¶

Conference: ACL 2026 arXiv: 2604.17220 Code: None Area: Other Keywords: supply chain simulation, cognitive heterogeneity, bullwhip effect, LLM agents, beer distribution game

TL;DR¶

This paper deploys LLM agents (DeepSeek/GPT series) in the classic beer distribution game to simulate multi-stage supply chains, systematically investigating how cognitive heterogeneity (differences in reasoning capability) affects system behavior. The findings demonstrate that LLM agents can reproduce human-observed bullwhip effects and myopic behaviors, and that information sharing effectively mitigates these adverse effects.

Background & Motivation¶

Background: Behavioral experiments (e.g., the beer distribution game) have revealed supply chain inefficiencies caused by cognitive biases, such as the bullwhip effect. However, traditional human-subject experiments face constraints in scalability, cost, and experimental control. The potential of LLMs as behavioral proxies is an emerging area of exploration.

Limitations of Prior Work: (1) Most LLM multi-agent studies focus on static or structurally simple settings, without exploring highly dynamic multi-period environments; (2) existing studies typically deploy homogeneous agents, neglecting the impact of cognitive heterogeneity (mixtures of agents with varying reasoning capabilities) on collective behavior; (3) rigorous statistical validation is often absent.

Key Challenge: Strategic diversity is both pervasive and consequential in real organizations, yet its interaction effects in synthetic environments remain insufficiently studied.

Goal: To establish an LLM-driven supply chain simulation paradigm and systematically investigate how cognitive heterogeneity shapes collective behavior.

Key Insight: Agents with different reasoning capabilities — base models vs. reasoning-enhanced models — are used to represent distinct cognitive levels, and heterogeneous agents are deployed at different positions within the supply chain.

Core Idea: LLM agents can reproduce human behavioral biases; cognitive heterogeneity exacerbates systemic inefficiency; and information sharing serves as an effective mitigation mechanism.

Method¶

Overall Architecture¶

LLM agents are deployed in the classic beer distribution game (a 4-echelon supply chain: Retailer → Wholesaler → Distributor → Manufacturer), with each agent deciding order quantities at every period. Experiments include homogeneous conditions (all shallow/deep agents) and stratified conditions (a single deep agent placed at different positions), with 32 independent replications per configuration over 20 periods.

Key Designs¶

Hierarchical Reasoning Framework:
- Function: Systematically model agents with varying cognitive depths.
- Mechanism: Cognition is divided into two levels — shallow (DeepSeek-V3, GPT-4.1) and deep (DeepSeek-R1, GPT-5). Deep models consistently outperform their base counterparts on reasoning benchmarks such as AIME and GPQA. A dual-family design (DeepSeek series + GPT series) controls for architectural differences while validating cross-family consistency.
- Design Motivation: Provides empirically grounded justification for the cognitive stratification, ensuring a scientific basis for the experimental categorization.
Cognitive Heterogeneity Experimental Design:
- Function: Isolate the effect of cognitive depth on supply chain behavior.
- Mechanism: Six configurations are used — homogeneous conditions (Original: all shallow; R-Overall: all deep) and stratified conditions (R-S1 through R-S4, placing a single deep agent at one echelon). Each configuration is crossed with two information conditions (with/without information sharing), and chain-of-thought (CoT) prompting supports structured decision-making.
- Design Motivation: Systematically varying a single variable (the position of cognitive depth) enables identification of causal effects.
Information Sharing Mechanism:
- Function: Test the effectiveness of information transparency in mitigating behavioral biases.
- Mechanism: Under the information-sharing condition, each agent receives inventory and backlog information from all other echelons. Order variance, total cost, and bullwhip effect intensity are compared across information conditions.
- Design Motivation: Information asymmetry is a classical driver of the bullwhip effect; this design validates whether LLM agents similarly benefit from information sharing.

Loss & Training¶

No model training is involved. Standard statistical tests (sign test, t-test, Mann-Whitney test) are used to verify the significance of results.

Key Experimental Results¶

Main Results¶

Bullwhip effect replication (homogeneous conditions, no information sharing):

Configuration	Order Variance Amplification	p-value	Note
DeepSeek-Original	82.3%	<0.001	Significant bullwhip effect
DeepSeek-R-Overall	79.8%	<0.001	Persists after reasoning enhancement
GPT-Original	74.2%	<0.001	Consistent across families
GPT-R-Overall	74.3%	<0.001	Cross-family validation

Ablation Study¶

Mitigation effect of information sharing:

Condition	Total Cost (No IS)	Total Cost (With IS)	Reduction
DeepSeek-Original	39.43	20.15	~49%
DeepSeek-R-Overall	29.43	17.71	~40%

Key Findings¶

LLM agents successfully reproduce the bullwhip effect observed in human experiments (p<0.001), validating the credibility of LLMs as behavioral proxies.
Compared to human data, LLM agents exhibit more stable decisions (lower variance), yielding cleaner statistical signals.
Cognitive enhancement (R1/GPT-5) reduces total cost but does not eliminate the bullwhip effect — even "smarter" agents still exhibit myopic behavior.
Information sharing is the most effective intervention, consistently reducing costs by 40–50% across all configurations.
Self-interested behavior (each agent minimizing its own cost) is identified as the root cause of systemic inefficiency.

Highlights & Insights¶

Using LLMs to simulate behavioral experiments is a highly promising paradigm: compared to human-subject experiments, the approach reduces costs by orders of magnitude, supports large-scale replication, and enables precise variable control. This has transformative implications for operations management and behavioral economics research.
The finding that cognitive enhancement cannot eliminate the bullwhip effect is particularly insightful: the problem lies not in individual intelligence, but in information structure and incentive design — a conclusion that closely mirrors dynamics in real organizations.
The dual-family validation design (DeepSeek + GPT) ensures the cross-platform robustness of the findings.

Limitations & Future Work¶

Whether the "cognitive biases" exhibited by LLM agents are fundamentally equivalent to those of humans remains uncertain — they may reflect behavioral patterns learned from training data rather than genuine cognitive limitations.
While the beer distribution game is a classic benchmark, it is highly simplified; real supply chains involve far greater complexity (multiple products, stochasticity, contractual constraints).
The temperature parameter is fixed at 1; behavior may differ under other temperature settings (though prior work on stability is cited).
Only 4-echelon linear supply chains are studied; networked supply chain behavior may differ substantially.

vs. Kirshner (2024): A pioneering work deploying LLM agents in supply chains, but using homogeneous settings; this paper is the first to introduce cognitive heterogeneity.
vs. Park et al. (2023) (Generative Agents): Focuses on social interaction simulation; this paper extends LLM agents to structured economic environments.
vs. traditional RL methods (IPPO/MAPPO): Require strict state space definitions and extensive training; LLM agents exhibit human-like behavior with zero training.

Rating¶

Novelty: ⭐⭐⭐⭐ A novel perspective combining cognitive heterogeneity with supply chain simulation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 32 replications × 6 configurations × 2 information conditions, with rigorous statistical validation.
Writing Quality: ⭐⭐⭐⭐ Experimental design is clear; statistical analysis is sound.
Value: ⭐⭐⭐⭐ Opens a new direction for applying LLM agents to organizational behavior research.