Skip to content

Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment

Conference: ACL 2026 arXiv: 2604.19548 Code: https://unikcc.github.io/ReTAS/ Area: LLM Reasoning Keywords: actor-observer asymmetry, attribution bias, dialectical alignment, multi-agent collaboration, self-reflection

TL;DR

This paper identifies that LLM agents exhibit a human-like "actor-observer asymmetry" (AOA) cognitive bias during role-play — when acting as actors, agents tend to attribute failures to external factors, while as observers they tend to attribute failures to internal errors. The authors propose ReTAS, which employs dialectical reasoning (thesis–antithesis–synthesis) and GRPO-based alignment to mitigate this bias.

Background & Motivation

State of the Field: LLM multi-agent frameworks leverage role-play to assign specialized capabilities (e.g., executor, reviewer), and rely on self-reflection and mutual auditing to improve reliability. However, role assignment not only serves as a functional specification but also acts as a cognitive prior that shapes reasoning.

Limitations of Prior Work: When agents face failure as "actors" (during self-reflection), they tend to attribute the cause to external factors (e.g., server issues); when acting as "observers" (auditing others), they tend to attribute failure to internal errors (e.g., code logic bugs). This contradictory attribution prevents agents from reaching consensus, undermining collaborative reliability.

Root Cause: Role-play is a foundational design in multi-agent systems, but the cognitive biases it introduces are an inherent side effect. Simply instructing agents to "remain objective" is ineffective (role inertia leads to defensive justification), while forcing opposing perspectives results in overcorrection and unwarranted self-blame.

Paper Goals: To quantify the AOA bias in LLMs and design a structured reasoning method to eliminate this role-dependent attribution inconsistency.

Starting Point: Drawing on Fichtean dialectics (thesis → antithesis → synthesis) — robust attribution requires first expressing a position, then confronting its negation, and finally synthesizing a unified truth.

Core Idea: Train a ReTAS model that decomposes reflection into three explicit stages: thesis (role-consistent explanation), antithesis (simulating the opposing perspective to expose blind spots), and synthesis (reconciling conflicting views to reach a perspective-invariant conclusion), aligned using GRPO with attribution rewards.

Method

Overall Architecture

The framework consists of three steps: (1) constructing an Ambiguous Failure Benchmark (AFB) to quantify AOA bias; (2) dialectical synthesis — generating thesis–antithesis–synthesis reasoning traces for each attribution scenario; (3) dialectical alignment — training the ReTAS model via GRPO with attribution consistency rewards.

Key Designs

  1. Ambiguous Failure Benchmark (AFB):

    • Function: Precisely quantify the degree of AOA bias in LLMs.
    • Mechanism: Constructs 200 inherently ambiguous failure scenarios in which a single failure signal reasonably supports contradictory root causes (e.g., a timeout may reflect infrastructure latency or aggressive configuration). Paired counterfactual probes are used — the same scenario is queried with actor and observer system prompts respectively, forcing a binary attribution (internal/external), and the attribution flip rate is recorded. Results show that most models exhibit AOA flips in >20% of scenarios.
    • Design Motivation: Scenarios without deterministic root causes are required — if a clear ground truth exists, attribution differences reflect capability gaps rather than cognitive bias.
  2. Dialectical Reasoning (Thesis–Antithesis–Synthesis):

    • Function: Eliminate perspective-dependent attribution through structured reasoning.
    • Mechanism: Thesis — generates a role-consistent explanation expressing domain-specific expertise; Antithesis — simulates the opposing perspective to expose blind spots and counterevidence in the current attribution; Synthesis — reconciles conflicting views to derive a perspective-invariant conclusion grounded in objective evidence rather than role priors. These three stages serve as an explicit structural framework for chain-of-thought reasoning.
    • Design Motivation: Naïve "remain objective" instructions are ineffective due to role inertia; a coercive structure is needed to ensure multiple perspectives are genuinely considered.
  3. GRPO Dialectical Alignment Training:

    • Function: Internalize dialectical reasoning capability into model parameters.
    • Mechanism: Training with attribution rewards — reasoning trajectories that produce inconsistent attributions under actor vs. observer perspectives are penalized, while trajectories converging on the true root cause are rewarded. Built on the GRPO framework, the model learns to generate perspective-invariant dialectical reasoning chains.
    • Design Motivation: Prompting alone is insufficient for stable dialectical reasoning; RL-based alignment internalizes this capability into the model.

Loss & Training

GRPO optimization framework with attribution consistency as the reward signal. Evaluation is conducted on the AFB benchmark and multiple downstream tasks. Models include GPT-5 series, DeepSeek-V3.2, Qwen3-4B, among others.

Key Experimental Results

Main Results

Model Human-Agent AOA Flip Rate Agent-Agent AOA Flip Rate
GPT-5.1 6% 26%
GPT-5 23% 33%
DeepSeek-V3.2 15% 39%
Qwen3-4B 33% -
QwQ-32B 21% -

Ablation Study

Configuration Attribution Consistency Task Performance Notes
Standard role-play Low Baseline AOA bias present
+ "Remain objective" instruction Marginal improvement No change Role inertia offsets effect
+ Dialectical prompting (no training) Moderate improvement Improvement Structured but unstable
ReTAS (dialectical alignment) Significant improvement Significant improvement Dialectical reasoning internalized

Key Findings

  • AOA bias is pervasive across all tested models; Agent-Agent scenarios (39% flip rate for DeepSeek) are more severe than Human-Agent scenarios.
  • ReTAS effectively reduces attribution inconsistency while significantly improving failure resolution rates in ambiguous scenarios.
  • The three-stage dialectical structure is more effective than simply prompting for "multi-perspective thinking."
  • Bias severity is inversely correlated with model capability — stronger models tend to be more consistent (GPT-5.1 achieves the lowest flip rate of 6%).

Highlights & Insights

  • Introducing a classic social psychology theory into AI agent analysis: AOA, as a human cognitive bias, is systematically validated to exist in LLMs — a finding with important implications for the reliability design of multi-agent systems.
  • Creative application of dialectics as a debiasing tool: The thesis–antithesis–synthesis structure is naturally suited to reconciling conflicting perspectives and is more operationally tractable than generic "remain objective" instructions.
  • Clever design of the AFB benchmark: Deliberately constructing ambiguous scenarios without deterministic root causes ensures that any systematic divergence in attribution can be ascribed to cognitive bias rather than capability differences.

Limitations & Future Work

  • The AFB benchmark is relatively small (200 scenarios) and may not cover all bias patterns.
  • The generalizability of dialectical alignment training — whether it transfers to domains not covered by AFB — remains an open question.
  • The synthesis stage may still be dominated by one perspective, leaving the bias incompletely eliminated.
  • AOA bias in non-failure scenarios (e.g., attribution of success) has not been explored.
  • vs. Reflexion / self-reflection methods: Self-reflection operates within the role framework and, under the influence of AOA bias, may reinforce erroneous attributions. ReTAS breaks role inertia through the explicit antithesis stage.
  • vs. multi-agent debate methods: Debate assigns opposing positions to different agents but lacks a structured synthesis mechanism. The synthesis stage in ReTAS provides an explicit framework for conflict reconciliation.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The identification of AOA in LLMs is an original contribution; the dialectical alignment solution is creative.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multi-model evaluation, a dedicated benchmark, and ablation analysis are provided, though the benchmark scale could be larger.
  • Writing Quality: ⭐⭐⭐⭐⭐ The problem is clearly defined, and the integration of social psychology theory with AI methodology is natural. Code: To be confirmed Area: llm_reasoning Keywords: To be supplemented

TL;DR

To be supplemented after thorough reading.

Background & Motivation

To be supplemented after thorough reading.

Method

To be supplemented after thorough reading.

Key Experimental Results

To be supplemented after thorough reading.

Highlights & Insights

To be supplemented after thorough reading.

Limitations & Future Work

To be supplemented after thorough reading.

To be supplemented after thorough reading.

Rating

  • Novelty: Pending
  • Experimental Thoroughness: Pending
  • Writing Quality: Pending
  • Value: Pending