Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment¶

Conference: ACL 2026 arXiv: 2604.19548 Code: https://unikcc.github.io/ReTAS/ Area: LLM Reasoning Keywords: actor-observer asymmetry, attribution bias, dialectical alignment, multi-agent collaboration, self-reflection

TL;DR¶

This paper identifies that LLM agents exhibit a human-like "actor-observer asymmetry" (AOA) cognitive bias during role-play — when acting as actors, agents tend to attribute failures to external factors, while as observers they tend to attribute failures to internal errors. The authors propose ReTAS, which employs dialectical reasoning (thesis–antithesis–synthesis) and GRPO-based alignment to mitigate this bias.

Background & Motivation¶

State of the Field: LLM multi-agent frameworks leverage role-play to assign specialized capabilities (e.g., executor, reviewer), and rely on self-reflection and mutual auditing to improve reliability. However, role assignment not only serves as a functional specification but also acts as a cognitive prior that shapes reasoning.

Limitations of Prior Work: When agents face failure as "actors" (during self-reflection), they tend to attribute the cause to external factors (e.g., server issues); when acting as "observers" (auditing others), they tend to attribute failure to internal errors (e.g., code logic bugs). This contradictory attribution prevents agents from reaching consensus, undermining collaborative reliability.

Root Cause: Role-play is a foundational design in multi-agent systems, but the cognitive biases it introduces are an inherent side effect. Simply instructing agents to "remain objective" is ineffective (role inertia leads to defensive justification), while forcing opposing perspectives results in overcorrection and unwarranted self-blame.

Paper Goals: To quantify the AOA bias in LLMs and design a structured reasoning method to eliminate this role-dependent attribution inconsistency.

Starting Point: Drawing on Fichtean dialectics (thesis → antithesis → synthesis) — robust attribution requires first expressing a position, then confronting its negation, and finally synthesizing a unified truth.

Core Idea: Train a ReTAS model that decomposes reflection into three explicit stages: thesis (role-consistent explanation), antithesis (simulating the opposing perspective to expose blind spots), and synthesis (reconciling conflicting views to reach a perspective-invariant conclusion), aligned using GRPO with attribution rewards.

Method¶

Overall Architecture¶

The framework consists of three steps: (1) constructing an Ambiguous Failure Benchmark (AFB) to quantify AOA bias; (2) dialectical synthesis — generating thesis–antithesis–synthesis reasoning traces for each attribution scenario; (3) dialectical alignment — training the ReTAS model via GRPO with attribution consistency rewards.

Key Designs¶

Ambiguous Failure Benchmark (AFB):
- Function: Precisely quantify the degree of AOA bias in LLMs.
- Mechanism: Constructs 200 inherently ambiguous failure scenarios in which a single failure signal reasonably supports contradictory root causes (e.g., a timeout may reflect infrastructure latency or aggressive configuration). Paired counterfactual probes are used — the same scenario is queried with actor and observer system prompts respectively, forcing a binary attribution (internal/external), and the attribution flip rate is recorded. Results show that most models exhibit AOA flips in >20% of scenarios.
- Design Motivation: Scenarios without deterministic root causes are required — if a clear ground truth exists, attribution differences reflect capability gaps rather than cognitive bias.
Dialectical Reasoning (Thesis–Antithesis–Synthesis):
- Function: Eliminate perspective-dependent attribution through structured reasoning.
- Mechanism: Thesis — generates a role-consistent explanation expressing domain-specific expertise; Antithesis — simulates the opposing perspective to expose blind spots and counterevidence in the current attribution; Synthesis — reconciles conflicting views to derive a perspective-invariant conclusion grounded in objective evidence rather than role priors. These three stages serve as an explicit structural framework for chain-of-thought reasoning.
- Design Motivation: Naïve "remain objective" instructions are ineffective due to role inertia; a coercive structure is needed to ensure multiple perspectives are genuinely considered.
GRPO Dialectical Alignment Training:
- Function: Internalize dialectical reasoning capability into model parameters.
- Mechanism: Training with attribution rewards — reasoning trajectories that produce inconsistent attributions under actor vs. observer perspectives are penalized, while trajectories converging on the true root cause are rewarded. Built on the GRPO framework, the model learns to generate perspective-invariant dialectical reasoning chains.
- Design Motivation: Prompting alone is insufficient for stable dialectical reasoning; RL-based alignment internalizes this capability into the model.

Loss & Training¶

GRPO optimization framework with attribution consistency as the reward signal. Evaluation is conducted on the AFB benchmark and multiple downstream tasks. Models include GPT-5 series, DeepSeek-V3.2, Qwen3-4B, among others.

Key Experimental Results¶

Main Results¶

Model	Human-Agent AOA Flip Rate	Agent-Agent AOA Flip Rate
GPT-5.1	6%	26%
GPT-5	23%	33%
DeepSeek-V3.2	15%	39%
Qwen3-4B	33%	-
QwQ-32B	21%	-

Ablation Study¶

Configuration	Attribution Consistency	Task Performance	Notes
Standard role-play	Low	Baseline	AOA bias present
+ "Remain objective" instruction	Marginal improvement	No change	Role inertia offsets effect
+ Dialectical prompting (no training)	Moderate improvement	Improvement	Structured but unstable
ReTAS (dialectical alignment)	Significant improvement	Significant improvement	Dialectical reasoning internalized

Key Findings¶

AOA bias is pervasive across all tested models; Agent-Agent scenarios (39% flip rate for DeepSeek) are more severe than Human-Agent scenarios.
ReTAS effectively reduces attribution inconsistency while significantly improving failure resolution rates in ambiguous scenarios.
The three-stage dialectical structure is more effective than simply prompting for "multi-perspective thinking."
Bias severity is inversely correlated with model capability — stronger models tend to be more consistent (GPT-5.1 achieves the lowest flip rate of 6%).

Highlights & Insights¶

Introducing a classic social psychology theory into AI agent analysis: AOA, as a human cognitive bias, is systematically validated to exist in LLMs — a finding with important implications for the reliability design of multi-agent systems.
Creative application of dialectics as a debiasing tool: The thesis–antithesis–synthesis structure is naturally suited to reconciling conflicting perspectives and is more operationally tractable than generic "remain objective" instructions.
Clever design of the AFB benchmark: Deliberately constructing ambiguous scenarios without deterministic root causes ensures that any systematic divergence in attribution can be ascribed to cognitive bias rather than capability differences.

Limitations & Future Work¶

The AFB benchmark is relatively small (200 scenarios) and may not cover all bias patterns.
The generalizability of dialectical alignment training — whether it transfers to domains not covered by AFB — remains an open question.
The synthesis stage may still be dominated by one perspective, leaving the bias incompletely eliminated.
AOA bias in non-failure scenarios (e.g., attribution of success) has not been explored.

vs. Reflexion / self-reflection methods: Self-reflection operates within the role framework and, under the influence of AOA bias, may reinforce erroneous attributions. ReTAS breaks role inertia through the explicit antithesis stage.
vs. multi-agent debate methods: Debate assigns opposing positions to different agents but lacks a structured synthesis mechanism. The synthesis stage in ReTAS provides an explicit framework for conflict reconciliation.