Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation¶
Conference: ACL 2026 arXiv: 2604.17220 Code: None Area: Other Keywords: supply chain simulation, cognitive heterogeneity, bullwhip effect, LLM agents, beer distribution game
TL;DR¶
This paper deploys LLM agents (DeepSeek/GPT series) in the classic beer distribution game to simulate multi-stage supply chains, systematically investigating how cognitive heterogeneity (differences in reasoning capability) affects system behavior. The findings demonstrate that LLM agents can reproduce human-observed bullwhip effects and myopic behaviors, and that information sharing effectively mitigates these adverse effects.
Background & Motivation¶
Background: Behavioral experiments (e.g., the beer distribution game) have revealed supply chain inefficiencies caused by cognitive biases, such as the bullwhip effect. However, traditional human-subject experiments face constraints in scalability, cost, and experimental control. The potential of LLMs as behavioral proxies is an emerging area of exploration.
Limitations of Prior Work: (1) Most LLM multi-agent studies focus on static or structurally simple settings, without exploring highly dynamic multi-period environments; (2) existing studies typically deploy homogeneous agents, neglecting the impact of cognitive heterogeneity (mixtures of agents with varying reasoning capabilities) on collective behavior; (3) rigorous statistical validation is often absent.
Key Challenge: Strategic diversity is both pervasive and consequential in real organizations, yet its interaction effects in synthetic environments remain insufficiently studied.
Goal: To establish an LLM-driven supply chain simulation paradigm and systematically investigate how cognitive heterogeneity shapes collective behavior.
Key Insight: Agents with different reasoning capabilities — base models vs. reasoning-enhanced models — are used to represent distinct cognitive levels, and heterogeneous agents are deployed at different positions within the supply chain.
Core Idea: LLM agents can reproduce human behavioral biases; cognitive heterogeneity exacerbates systemic inefficiency; and information sharing serves as an effective mitigation mechanism.
Method¶
Overall Architecture¶
LLM agents are deployed in the classic beer distribution game (a 4-echelon supply chain: Retailer → Wholesaler → Distributor → Manufacturer), with each agent deciding order quantities at every period. Experiments include homogeneous conditions (all shallow/deep agents) and stratified conditions (a single deep agent placed at different positions), with 32 independent replications per configuration over 20 periods.
Key Designs¶
-
Hierarchical Reasoning Framework:
- Function: Systematically model agents with varying cognitive depths.
- Mechanism: Cognition is divided into two levels — shallow (DeepSeek-V3, GPT-4.1) and deep (DeepSeek-R1, GPT-5). Deep models consistently outperform their base counterparts on reasoning benchmarks such as AIME and GPQA. A dual-family design (DeepSeek series + GPT series) controls for architectural differences while validating cross-family consistency.
- Design Motivation: Provides empirically grounded justification for the cognitive stratification, ensuring a scientific basis for the experimental categorization.
-
Cognitive Heterogeneity Experimental Design:
- Function: Isolate the effect of cognitive depth on supply chain behavior.
- Mechanism: Six configurations are used — homogeneous conditions (Original: all shallow; R-Overall: all deep) and stratified conditions (R-S1 through R-S4, placing a single deep agent at one echelon). Each configuration is crossed with two information conditions (with/without information sharing), and chain-of-thought (CoT) prompting supports structured decision-making.
- Design Motivation: Systematically varying a single variable (the position of cognitive depth) enables identification of causal effects.
-
Information Sharing Mechanism:
- Function: Test the effectiveness of information transparency in mitigating behavioral biases.
- Mechanism: Under the information-sharing condition, each agent receives inventory and backlog information from all other echelons. Order variance, total cost, and bullwhip effect intensity are compared across information conditions.
- Design Motivation: Information asymmetry is a classical driver of the bullwhip effect; this design validates whether LLM agents similarly benefit from information sharing.
Loss & Training¶
No model training is involved. Standard statistical tests (sign test, t-test, Mann-Whitney test) are used to verify the significance of results.
Key Experimental Results¶
Main Results¶
Bullwhip effect replication (homogeneous conditions, no information sharing):
| Configuration | Order Variance Amplification | p-value | Note |
|---|---|---|---|
| DeepSeek-Original | 82.3% | <0.001 | Significant bullwhip effect |
| DeepSeek-R-Overall | 79.8% | <0.001 | Persists after reasoning enhancement |
| GPT-Original | 74.2% | <0.001 | Consistent across families |
| GPT-R-Overall | 74.3% | <0.001 | Cross-family validation |
Ablation Study¶
Mitigation effect of information sharing:
| Condition | Total Cost (No IS) | Total Cost (With IS) | Reduction |
|---|---|---|---|
| DeepSeek-Original | 39.43 | 20.15 | ~49% |
| DeepSeek-R-Overall | 29.43 | 17.71 | ~40% |
Key Findings¶
- LLM agents successfully reproduce the bullwhip effect observed in human experiments (p<0.001), validating the credibility of LLMs as behavioral proxies.
- Compared to human data, LLM agents exhibit more stable decisions (lower variance), yielding cleaner statistical signals.
- Cognitive enhancement (R1/GPT-5) reduces total cost but does not eliminate the bullwhip effect — even "smarter" agents still exhibit myopic behavior.
- Information sharing is the most effective intervention, consistently reducing costs by 40–50% across all configurations.
- Self-interested behavior (each agent minimizing its own cost) is identified as the root cause of systemic inefficiency.
Highlights & Insights¶
- Using LLMs to simulate behavioral experiments is a highly promising paradigm: compared to human-subject experiments, the approach reduces costs by orders of magnitude, supports large-scale replication, and enables precise variable control. This has transformative implications for operations management and behavioral economics research.
- The finding that cognitive enhancement cannot eliminate the bullwhip effect is particularly insightful: the problem lies not in individual intelligence, but in information structure and incentive design — a conclusion that closely mirrors dynamics in real organizations.
- The dual-family validation design (DeepSeek + GPT) ensures the cross-platform robustness of the findings.
Limitations & Future Work¶
- Whether the "cognitive biases" exhibited by LLM agents are fundamentally equivalent to those of humans remains uncertain — they may reflect behavioral patterns learned from training data rather than genuine cognitive limitations.
- While the beer distribution game is a classic benchmark, it is highly simplified; real supply chains involve far greater complexity (multiple products, stochasticity, contractual constraints).
- The temperature parameter is fixed at 1; behavior may differ under other temperature settings (though prior work on stability is cited).
- Only 4-echelon linear supply chains are studied; networked supply chain behavior may differ substantially.
Related Work & Insights¶
- vs. Kirshner (2024): A pioneering work deploying LLM agents in supply chains, but using homogeneous settings; this paper is the first to introduce cognitive heterogeneity.
- vs. Park et al. (2023) (Generative Agents): Focuses on social interaction simulation; this paper extends LLM agents to structured economic environments.
- vs. traditional RL methods (IPPO/MAPPO): Require strict state space definitions and extensive training; LLM agents exhibit human-like behavior with zero training.
Rating¶
- Novelty: ⭐⭐⭐⭐ A novel perspective combining cognitive heterogeneity with supply chain simulation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 32 replications × 6 configurations × 2 information conditions, with rigorous statistical validation.
- Writing Quality: ⭐⭐⭐⭐ Experimental design is clear; statistical analysis is sound.
- Value: ⭐⭐⭐⭐ Opens a new direction for applying LLM agents to organizational behavior research.