Skip to content

MUG: Multi-agent Undercover Gaming — Hallucination Removal via Counterfactual Test for Multimodal Reasoning

Conference: AAAI 2026 arXiv: 2511.11182 Code: https://github.com/YongLD/MUG.git Area: Causal Reasoning Keywords: Multi-Agent Gaming, Counterfactual Testing, Hallucination Detection, Undercover Game, Active Reasoning

TL;DR

MUG reframes Multi-Agent Debate (MAD) as a "Who's Undercover" social reasoning game — by introducing information asymmetry through counterfactual image editing (modifying the reference image), one agent is assigned the edited image \(I^-\) as the "undercover," while other agents hold the original image \(I^+\) and identify the undercover (i.e., the hallucination source) via reasoning and voting. On HallusionBench, Qwen2.5VL-7B improves from 46.4% to 53.8%.

Background & Motivation

Background: Multi-Agent Debate (MAD) enhances reasoning quality through structured discussion among multiple LLM agents and is a promising direction for hallucination mitigation.

Limitations of Prior Work: MAD suffers from three fundamental limitations — (1) it relies on the unrealistic assumption that all debaters are rational: when agents themselves are prone to hallucination, consensus may converge to a shared error; (2) it depends on statistical consensus (e.g., majority voting) without a genuine fact-checking mechanism; (3) agents passively answer questions rather than actively investigating and verifying claims.

Key Challenge: The consensus mechanism in MAD is fundamentally a "group statistics" approach — if the majority of agents share the same hallucination, consensus converges to an incorrect answer. A mechanism is needed that identifies "who is hallucinating" rather than "who is in the minority."

Goal: How can hallucinating agents be detected and excluded without assuming agent rationality?

Key Insight: Inspired by the "Who's Undercover" social reasoning game — by providing one agent with a modified image (counterfactual evidence), verifiable information asymmetry is introduced. Since the edit is known, a ground truth is available to determine who is the undercover (the hallucinator).

Core Idea: Use counterfactual image editing to create information asymmetry, combined with an undercover game mechanism to detect hallucinating agents, replacing statistical consensus with fact-checking.

Method

Overall Architecture

Two-phase game: (1) Undercover Detection Phase — among \(N\) agents, one is assigned the counterfactual image \(I^-\) as the undercover (Role=U), while others hold the original image \(I^+\) (Role=D). Multi-round reasoning and voting eliminate the most suspicious agent. (2) Summarization Phase — after the undercover is eliminated, the remaining agents collaboratively generate the final answer based on \(I^+\).

Key Designs

  1. Counterfactual Image Generation \(I^-\):

    • Function: Performs fine-grained editing on the original reference image to introduce controlled semantic changes.
    • Mechanism: An LLM first analyzes the question type and maps it to an edit type ("how many" → quantity editing, "what object" → object replacement, "what color" → attribute modification). A scene graph is constructed to identify the edit target, and the Step1X-Edit model executes the edit. The edit must satisfy three constraints: maximizing visual similarity \(C_{vs}\) (ViT embeddings), semantic consistency \(C_{sc}\) (CLIP similarity), and naturalness \(C_{na}\) (FID score).
    • Design Motivation: The edit must be subtle enough (so the undercover is not trivially identified) while introducing meaningful semantic differences (sufficient to distinguish reasoning correctness).
  2. Undercover Detection Game:

    • Function: Identifies and eliminates the undercover agent through multi-round reasoning and voting.
    • Reasoning Phase: Each agent reasons based on its assigned image (\(I^+\) or \(I^-\)). Normal agents aim to provide accurate reasoning and identify inconsistencies; the undercover agent aims to generate plausible-sounding reasoning while avoiding detection.
    • Voting Phase: Each agent votes based on four factors — inconsistency score \(\phi_1\), deviation-from-consensus score \(\phi_2\), detail accuracy \(\phi_3\), and behavioral suspicion \(\phi_4\). A majority vote eliminates the most suspicious agent.
    • Design Motivation: Information asymmetry creates strategic tension — normal agents prove they are viewing \(I^+\) by providing specific details, while the undercover must balance plausibility against accuracy.
  3. Three Dimensions of Innovation (vs. Traditional MAD):

    • Fact-Checking vs. Statistical Consensus: Counterfactual testing provides ground truth to verify who is hallucinating, rather than relying solely on majority voting.
    • Cross-Evidence vs. Single Source: Additional evidence is dynamically generated by modifying the image; traditional MAD relies on a single original image.
    • Active Reasoning vs. Passive Answering: Agents actively question, verify, and debate rather than merely answering a given question.

Key Experimental Results

Main Results

Method MMMU MMStar HallusionBench Avg POPE Acc
Qwen2.5VL-7B (baseline) 45.0 61.2 46.4 87.4
Qwen2.5VL-7B (Self-Refine) 45.8 61.5 48.8 85.9
Qwen2.5VL-7B (MAD-Vote) 44.7 57.4 37.8 80.0
Qwen2.5VL-7B (MAD-Judge) 47.4 62.3 50.2 85.2
Qwen2.5VL-7B (MUG) 50.3 63.8 53.8 88.4

Ablation Study

Configuration MMStar HallusionBench MMMU
MUG Full 63.80 53.80 50.33
w/o Counterfactual Editing 62.31 (-1.49) 50.19 (-3.61) 49.25 (-1.08)
w/o Undercover Mechanism 62.23 (-1.57) 49.31 (-4.49) 47.66 (-2.67)

Key Findings

  • MUG enables Qwen2.5VL-7B to surpass GPT-4v on MMMU (50.3 vs. 53.8 — the latter being a much larger model); MAD-Vote instead degrades performance (−7.4% on MMMU).
  • The improvement on HallusionBench is most pronounced — MUG achieves +7.4% over baseline, while MAD-Vote yields −8.6% — indicating that MAD can be detrimental for hallucination detection.
  • The undercover mechanism contributes more than counterfactual editing (its removal causes a larger performance drop), suggesting that "game dynamics" contribute more than "additional evidence."
  • One detection round yields the best results (50.3/63.8/69.4); additional rounds degrade performance — indicating that prolonged debate may mislead normal agents.
  • Additional time overhead is only 0.91 seconds per sample (3.74s vs. MAD's 2.35s), offering an excellent cost-performance ratio.

Highlights & Insights

  • The analogy social reasoning game → AI hallucination detection is particularly elegant: the core of the undercover game — "identifying anomalous players through information asymmetry" — maps perfectly to "identifying hallucinating agents through counterfactual testing."
  • Counterfactual editing provides ground truth: This represents the most fundamental improvement over traditional MAD — a shift from "group statistics" to "verifiable facts."
  • MAD-Vote can be harmful: Experiments show that MAD-Vote drops from 46.4% to 37.8% on HallusionBench — if multiple agents share the same hallucination, voting amplifies the error. MUG avoids this through counterfactual testing.
  • The finding that one round is optimal: Counter-intuitively, more debate rounds degrade performance — because the undercover agent's arguments may convince normal agents, particularly on reasoning-type questions.

Limitations & Future Work

  • The quality of counterfactual image generation is inconsistent — failure modes include edits that are too subtle, editing failures, or unnatural outputs.
  • The undercover agent is currently selected at random; selection based on initial response uncertainty would be a natural improvement.
  • Only image-based counterfactual testing is supported; text-level counterfactuals remain unexplored.
  • Normal agents may be misled when the game extends beyond one round — more robust "immunity" mechanisms are needed.
  • Computational complexity scales with the number of agents and rounds.
  • vs. MAD-Vote/MAD-Judge: MUG substantially outperforms both on HallusionBench (53.8 vs. 37.8/50.2), demonstrating that counterfactual testing is more effective than voting- or judge-based approaches.
  • vs. Self-Refine: Self-Refine involves only single-agent self-correction and lacks multi-perspective verification; MUG obtains richer viewpoints through multi-agent gaming.
  • vs. iMAD (in this note collection): iMAD uses a classifier to determine "when to debate," while MUG improves "how to debate" — the two approaches are complementary and could be combined.
  • Insight: The core idea of counterfactual testing — "verifying understanding by introducing controlled differences" — generalizes to scenarios such as code verification and knowledge assessment.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The combination of social reasoning games and counterfactual editing is highly innovative and conceptually distinctive.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on 4 benchmarks with multiple baselines, ablation studies, and game dynamics analysis.
  • Writing Quality: ⭐⭐⭐⭐ The game analogy is vivid and the formal definitions are clear.
  • Value: ⭐⭐⭐⭐⭐ A fundamental improvement to the MAD paradigm — shifting from statistical consensus to verifiable fact-checking.