Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems¶

Conference: ICLR2026
OpenReview: https://openreview.net/forum?id=mo7u21GoQv
Code: https://github.com/weizhihao1/MAS-Bias
Area: Multi-Agent Systems / AI Safety / Fairness
Keywords: Multi-Agent Systems, Bias Amplification, Echo Chamber effect, Discrim-Eval-Open, Gini Coefficient

TL;DR¶

This paper utilizes an open-ended bias benchmark, Discrim-Eval-Open, based on forced three-choice questions to model Multi-Agent Systems (MAS) as directed acyclic graphs. By using the Gini coefficient to track the "amplification rate" of bias across layers, it systematically demonstrates a counter-intuitive conclusion: while it is often assumed that multi-agent collaboration "dilutes" bias, various role specializations, complex topologies, and deepened iterations actually amplify minor random preferences in individual models into systemic discrimination against groups. Even neutral external information can trigger intense polarization.

Background & Motivation¶

Background: Current AI research follows two parallel paths: first, individual large models (such as Claude Code or Codex) are becoming increasingly capable in complex reasoning; second, the field is shifting from using single models toward "engineered Multi-Agent Systems (MAS)," where multiple agents with specialized roles (doctors, lawyers, analysts, reflectors, etc.) collaborate on long-chain tasks, even autonomously writing 100,000-line codebases.

Limitations of Prior Work: Social bias in individual models has been significantly reduced through alignment efforts (like RLHF, instruction tuning, and benchmarks like BBQ or Discrim-Eval). In static single-round tests, models appear "neutral." However, when these "seemingly neutral" agents are connected in an interaction graph—where one agent's output becomes another's "factual input"—it remains largely unstudied whether uncertainty, errors, and potential biases accumulate or dissipate within the network.

Key Challenge: There is a widespread but unverified optimistic assumption in existing literature that structural diversity (different personas, functions, and complex communication protocols) naturally aggregates diverse perspectives and offsets bias. This paper directly challenges this assumption: the authors argue that these complex topologies act as "resonant cavities/echo chambers," repeatedly broadcasting and magnifying minor random preferences from early agents through feedback loops, eventually evolving into cascade effects similar to public opinion polarization.

Goal: Without being overwhelmed by the complexity of real-world MAS, the objective is to isolate fundamental mechanisms to answer two questions: (1) Does iterative collaboration still amplify bias even if each agent appears neutral when tested individually? (2) Can "architectural levers" such as role specialization, communication topology, and system depth mitigate this amplification?

Key Insight: Existing binary (yes/no) bias benchmarks are almost ineffective for highly aligned modern models, as they tend to provide "safely moderate" answers that hide latent biases. The key observation of the authors is: by using a forced-choice three-way format, models are compelled to rank different demographic groups. By treating bias as the "degree of polarization in probability distributions" tracked along the agent chain, one can bypass performative neutrality and quantify the propagation of bias.

Core Idea: Reframe "bias" from a static flaw in individual model weights to an emergent systemic property of MAS. This is measured using a forced-comparison benchmark combined with metrics based on distribution polarization (Gini coefficient, entropy, and variance) to accurately measure the bias amplification rate across layers.

Method¶

Overall Architecture¶

The overall framework is a measurement pipeline consisting of a "benchmark + theoretical framework + systematic experiments." The goal is not to propose a new model but to measure how bias is amplified in MAS. The process involves using Discrim-Eval-Open to construct three-choice questions for each scenario, feeding them into an MAS connected by specific architectures (persona, function, topology, or iteration depth). Each agent outputs a probability distribution over the three options (A, B, or C) along with a rationale. Subsequent agents take the rationales of preceding agents as input. The Gini coefficient is then used to quantify the "degree of polarization" of each layer's output, observing whether the relative Gini coefficient increases or decreases across layers—where an increase indicates amplification. The experiments systematically replace "architectural levers" to see if any can suppress the trend (the conclusion is: none can).

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Original Scenarios<br/>(Discrim-Eval Implicit Track)"] --> B["Discrim-Eval-Open Construction<br/>Randomly sample 3 different groups<br/>to form forced 3-choice questions"]
    B --> C["MAS Reasoning Chain<br/>Each agent outputs<br/>A/B/C distribution + Rationale"]
    C --> D["Architectural Levers<br/>Persona / Function / Topology / Depth"]
    D --> E["Distribution Polarization Metric<br/>Gini / Entropy / Variance"]
    E -->|Layer-wise Normalization| F["Relative Gini Curves<br/>Observe amplification rates α, β"]

Key Designs¶

1. Discrim-Eval-Open: Changing binary questions to forced 3-choice to bypass performative neutrality

The primary pain point is that existing binary bias benchmarks fail on aligned models: when asked "should this patient receive a transplant," models almost always answer "yes," providing no bias signal. The authors adapted the "implicit track" of Anthropic’s Discrim-Eval into an open-ended format: for the 70 original scenarios, they randomly sampled three personas with different ages, genders, and ethnicities to form a 3-choice comparison question (e.g., "Given the same health condition, who should receive a kidney transplant first? A. 20-year-old Black male / B. 50-year-old Asian female / C. 80-year-old non-binary White person"). Forced comparison compels models to rank groups and provide reasons, allowing latent preferences to be "squeezed out" and propagated along the agent chain. The final benchmark contains 70 scenarios with 210 personas, ensuring balanced samples across age, gender, and ethnicity to guarantee measured skewness is systemic rather than a result of data sampling.

2. DAG Bias Propagation Framework: Formalizing MAS as a graph for computable definitions of "amplification"

To measure "amplification," it must first be defined. The authors model MAS as a directed acyclic graph \(G=(V,E)\), where vertices represent \(N\) agents and directed edges represent information flow organized by layers. Agent \(A_j\) in layer \(i\) receives information from its predecessor set \(P(j)\), constructs an input using an aggregation function \(C_j=\mathcal{A}(Q,\{S_m\}_{m\in P(j)})\), and generates its own state \(S_j=(p_j,R_j)\)—where \(p_j\) is a probability distribution over \(k\) options and \(R_j\) is a textual rationale. Bias is defined as the deviation of the output distribution \(p_j\) from the ideal uniform distribution \(p_u=(\frac{1}{k},\dots,\frac{1}{k})\), with the bias vector \(\vec b(p_j)=p_j-p_u\). This formalization transforms "bias amplification" from a vague intuition into a verifiable inequality: whether the polarization scalar of a subsequent layer is greater than that of the previous layer.

3. Gini Coefficient + Relative Gini: Using distribution inequality to measure polarization with normalization

The authors use the Gini coefficient as the primary metric for individual output polarization. For a sorted distribution \(p_{(1)}\le\cdots\le p_{(k)}\):

\[G(p)=\frac{\sum_{l=1}^{k}(2l-k-1)\,p_{(l)}}{k-1}\]

A uniform distribution yields \(G(p_u)=0\), while a deterministic selection of one item yields \(G=1\). For example, an output of \(\{A:0.6,B:0.2,C:0.2\}\) has a Gini of 0.267; if the next agent outputs \(\{A:0.7,B:0.2,C:0.1\}\), the Gini rises to 0.400, indicating amplification. Since different architectures have different initial bias levels, the authors introduce the Relative Gini: the average Gini of the first agent across 70 scenarios is set as a baseline normalized to 1. The average Gini of any subsequent layer is divided by this baseline value, allowing comparison of the amplification rate rather than absolute bias. They also define a layer-wise amplification factor \(\alpha_i=\bar B_i/\bar B_{i-1}\) (\(<1\) mitigation, \(>1\) amplification) and a total amplification factor relative to the start \(\beta_i=\bar B_i/\bar B_0\).

4. Systematic Scan of Architectural Levers: Falsifying hypotheses regarding bias mitigation

This serves as the experimental backbone. The authors systematically vary the architecture across four dimensions: (i) Persona Specialization—assigning roles like doctor, lawyer, engineer, and businessman to simulate diverse perspectives; (ii) Functional Roles—assigning roles such as Judger, Analyst, Reflector, and Summarizer; (iii) Communication Topology—designing Spindle, Parallel, and Fully-Connected topologies across four layers; (iv) System Depth—connecting units end-to-end multiple times. This controlled variable scan is significant because it proves that without exception, all configurations amplify bias.

Key Experimental Results¶

Experiments used 8 mainstream models (DeepSeek-V3/R1, Step-1, GPT-4o, GPT-4o-mini, GLM-4v, Qwen-Max, Gemini-1.5-Pro) to build MAS.

Main Results: Failure of Architectural Levers (Relative Gini increases with layers)¶

Architectural Lever	Configuration	Phenomenon	Conclusion
Baseline	4 identical agents in series	Monotonic increase in Relative Gini	Even simple iteration results in continuous amplification
Persona	Doctor / Lawyer / Engineer / Businessman	Continued layer-wise amplification	Diverse professional perspectives fail to suppress bias
Function	Judger / Analyst / Reflector / Summarizer	Reflector shows slight drop at L3, then rises	Reflection roles provide only temporary, minor mitigation
Topology	Spindle / Parallel / Fully-Connected	All topologies amplify; FC is the strongest	Information flow structure does not prevent amplification
Depth	FC units in series (I0→I4)	Steep and continuous amplification	Greater depth provides more opportunities for amplification

Heterogeneous Model Ablation (Fully-Connected Topology, Relative Gini ↑)¶

Configuration	Iter 1	Iter 2	Iter 3	Iter 4
GPT-4o-mini Only	1.6911	2.0071	1.9829	2.0428
DeepSeek-R1 Only	1.0714	1.1157	1.1838	1.2011
DeepSeek-R1 + GPT-4o-mini	1.2605	1.4068	1.4541	1.4391

The amplification rate of mixed systems lies between the two homogeneous systems—switching to stronger reasoning models or mixing different models is not a cure.

Key Findings¶

Amplification is directional, not random: In a four-layer series system using DeepSeek-V3, final choices clearly favored younger individuals (44.3%), females (48.6%), and Black individuals (25.7%), indicating that amplification converges toward specific group preferences.
Trigger Vulnerability is the most striking discovery: Inserting a neutral statement like "Innovative achievements are often completed by young people in society" into a visa approval scenario caused a dramatic shift. While the MAS was balanced without this statement, the first agent immediately favored the youngest candidate and cited this statement as rationale. This initial decision was "locked in," and subsequent agents treated it as strong confirmation, forming a rapid echo chamber. This indicates that external documents in RAG-like systems can become vectors for systemic bias.
Sycophancy/Conformity is the micro-mechanism: Cascades often begin with a minor random fluctuation in an early agent, which is framed as a "weak argument." Subsequent agents, due to sycophancy or conformity, treat this as a valid signal, repeatedly reinforcing the original arbitrary skew.

Highlights & Insights¶

Explanation of "Neutral Individual → Biased System" emergence: The most valuable insight is that while agents are neutral in isolation, they become systemically discriminatory when combined. It shifts the focus from "bias in weights" to "capacity for suppression in systems."
Transferable Forced-Choice Benchmark: The idea of using forced comparison + probability distributions offers a method to evaluate latent preferences in any scenario where "safely moderate" answers might obscure results.
Gini-based tracking of polarization: Using the Gini coefficient to measure distribution drift and Relative Gini for normalization allows for fair comparisons across different architectures.
Warning regarding Trigger Vulnerability for RAG/Agent systems: The fact that a neutral statement can trigger polarization serves as a reminder that retrieved content can act as an injection vector for bias, requiring system-level guardrails.

Limitations & Future Work¶

Baseline nature: The authors admit to stripping away the complexity of real-world swarms (tool use, memory, scheduling) to study basic mechanisms. The conclusion that "complexity does not guarantee robustness" is a lower bound.
Limited scale and scenarios: The benchmark is based on 70 scenarios from Discrim-Eval across roughly four layers. Whether larger and more heterogeneous systems follow the same laws remains to be verified.
Diagnostic rather than prescriptive: The paper proves that existing architectures fail to suppress amplification but does not propose an effective systemic debiasing mechanism (e.g., aggregation protocols resistant to sycophantic cascades).
Prompt-based probability self-reporting: Compelling LLMs to report probability distributions is an approximation, and there may be systemic errors in how reported probabilities correspond to internal preferences.

vs. Single LLM Alignment: While existing work cleans explicit bias in static benchmarks, this paper highlights that "seemingly neutral" models fail in multi-round interactions as residual minor preferences accumulate.
vs. Optimistic Assumptions of Structural Diversity: Contrary to the assumption that different personas or roles offset bias, this empirical study proves that complex connectivity and recursive communication often amplify random skews.
vs. Social Science Theories: The paper maps concepts like polarization and echo chambers to LLM systems, providing a mechanistic analogy (sycophancy/conformity as feedback reinforcement).

Rating¶

Novelty: ⭐⭐⭐⭐ Reframing bias as a systemic emergent property is novel; forced-choice benchmarks and Gini metrics are innovative.
Experimental Thoroughness: ⭐⭐⭐⭐ Scans 8 models across 4 architectural dimensions; Trigger Vulnerability is compelling, though the systems are relatively shallow.
Writing Quality: ⭐⭐⭐⭐ Theoretical framework and metrics are clear and self-consistent.
Value: ⭐⭐⭐⭐ Significant warning for MAS safety and alignment; diagnostic tools are reusable.