MUG: Multi-agent Undercover Gaming — Hallucination Removal via Counterfactual Test for Multimodal Reasoning¶
Conference: AAAI 2026 arXiv: 2511.11182 Code: https://github.com/YongLD/MUG.git Area: Causal Reasoning Keywords: Multi-Agent Gaming, Counterfactual Testing, Hallucination Detection, Undercover Game, Active Reasoning
TL;DR¶
MUG reframes Multi-Agent Debate (MAD) as a "Who's Undercover" social reasoning game — by introducing information asymmetry through counterfactual image editing (modifying the reference image), one agent is assigned the edited image \(I^-\) as the "undercover," while other agents hold the original image \(I^+\) and identify the undercover (i.e., the hallucination source) via reasoning and voting. On HallusionBench, Qwen2.5VL-7B improves from 46.4% to 53.8%.
Background & Motivation¶
Background: Multi-Agent Debate (MAD) enhances reasoning quality through structured discussion among multiple LLM agents and is a promising direction for hallucination mitigation.
Limitations of Prior Work: MAD suffers from three fundamental limitations — (1) it relies on the unrealistic assumption that all debaters are rational: when agents themselves are prone to hallucination, consensus may converge to a shared error; (2) it depends on statistical consensus (e.g., majority voting) without a genuine fact-checking mechanism; (3) agents passively answer questions rather than actively investigating and verifying claims.
Key Challenge: The consensus mechanism in MAD is fundamentally a "group statistics" approach — if the majority of agents share the same hallucination, consensus converges to an incorrect answer. A mechanism is needed that identifies "who is hallucinating" rather than "who is in the minority."
Goal: How can hallucinating agents be detected and excluded without assuming agent rationality?
Key Insight: Inspired by the "Who's Undercover" social reasoning game — by providing one agent with a modified image (counterfactual evidence), verifiable information asymmetry is introduced. Since the edit is known, a ground truth is available to determine who is the undercover (the hallucinator).
Core Idea: Use counterfactual image editing to create information asymmetry, combined with an undercover game mechanism to detect hallucinating agents, replacing statistical consensus with fact-checking.
Method¶
Overall Architecture¶
Two-phase game: (1) Undercover Detection Phase — among \(N\) agents, one is assigned the counterfactual image \(I^-\) as the undercover (Role=U), while others hold the original image \(I^+\) (Role=D). Multi-round reasoning and voting eliminate the most suspicious agent. (2) Summarization Phase — after the undercover is eliminated, the remaining agents collaboratively generate the final answer based on \(I^+\).
Key Designs¶
-
Counterfactual Image Generation \(I^-\):
- Function: Performs fine-grained editing on the original reference image to introduce controlled semantic changes.
- Mechanism: An LLM first analyzes the question type and maps it to an edit type ("how many" → quantity editing, "what object" → object replacement, "what color" → attribute modification). A scene graph is constructed to identify the edit target, and the Step1X-Edit model executes the edit. The edit must satisfy three constraints: maximizing visual similarity \(C_{vs}\) (ViT embeddings), semantic consistency \(C_{sc}\) (CLIP similarity), and naturalness \(C_{na}\) (FID score).
- Design Motivation: The edit must be subtle enough (so the undercover is not trivially identified) while introducing meaningful semantic differences (sufficient to distinguish reasoning correctness).
-
Undercover Detection Game:
- Function: Identifies and eliminates the undercover agent through multi-round reasoning and voting.
- Reasoning Phase: Each agent reasons based on its assigned image (\(I^+\) or \(I^-\)). Normal agents aim to provide accurate reasoning and identify inconsistencies; the undercover agent aims to generate plausible-sounding reasoning while avoiding detection.
- Voting Phase: Each agent votes based on four factors — inconsistency score \(\phi_1\), deviation-from-consensus score \(\phi_2\), detail accuracy \(\phi_3\), and behavioral suspicion \(\phi_4\). A majority vote eliminates the most suspicious agent.
- Design Motivation: Information asymmetry creates strategic tension — normal agents prove they are viewing \(I^+\) by providing specific details, while the undercover must balance plausibility against accuracy.
-
Three Dimensions of Innovation (vs. Traditional MAD):
- Fact-Checking vs. Statistical Consensus: Counterfactual testing provides ground truth to verify who is hallucinating, rather than relying solely on majority voting.
- Cross-Evidence vs. Single Source: Additional evidence is dynamically generated by modifying the image; traditional MAD relies on a single original image.
- Active Reasoning vs. Passive Answering: Agents actively question, verify, and debate rather than merely answering a given question.
Key Experimental Results¶
Main Results¶
| Method | MMMU | MMStar | HallusionBench Avg | POPE Acc |
|---|---|---|---|---|
| Qwen2.5VL-7B (baseline) | 45.0 | 61.2 | 46.4 | 87.4 |
| Qwen2.5VL-7B (Self-Refine) | 45.8 | 61.5 | 48.8 | 85.9 |
| Qwen2.5VL-7B (MAD-Vote) | 44.7 | 57.4 | 37.8 | 80.0 |
| Qwen2.5VL-7B (MAD-Judge) | 47.4 | 62.3 | 50.2 | 85.2 |
| Qwen2.5VL-7B (MUG) | 50.3 | 63.8 | 53.8 | 88.4 |
Ablation Study¶
| Configuration | MMStar | HallusionBench | MMMU |
|---|---|---|---|
| MUG Full | 63.80 | 53.80 | 50.33 |
| w/o Counterfactual Editing | 62.31 (-1.49) | 50.19 (-3.61) | 49.25 (-1.08) |
| w/o Undercover Mechanism | 62.23 (-1.57) | 49.31 (-4.49) | 47.66 (-2.67) |
Key Findings¶
- MUG enables Qwen2.5VL-7B to surpass GPT-4v on MMMU (50.3 vs. 53.8 — the latter being a much larger model); MAD-Vote instead degrades performance (−7.4% on MMMU).
- The improvement on HallusionBench is most pronounced — MUG achieves +7.4% over baseline, while MAD-Vote yields −8.6% — indicating that MAD can be detrimental for hallucination detection.
- The undercover mechanism contributes more than counterfactual editing (its removal causes a larger performance drop), suggesting that "game dynamics" contribute more than "additional evidence."
- One detection round yields the best results (50.3/63.8/69.4); additional rounds degrade performance — indicating that prolonged debate may mislead normal agents.
- Additional time overhead is only 0.91 seconds per sample (3.74s vs. MAD's 2.35s), offering an excellent cost-performance ratio.
Highlights & Insights¶
- The analogy social reasoning game → AI hallucination detection is particularly elegant: the core of the undercover game — "identifying anomalous players through information asymmetry" — maps perfectly to "identifying hallucinating agents through counterfactual testing."
- Counterfactual editing provides ground truth: This represents the most fundamental improvement over traditional MAD — a shift from "group statistics" to "verifiable facts."
- MAD-Vote can be harmful: Experiments show that MAD-Vote drops from 46.4% to 37.8% on HallusionBench — if multiple agents share the same hallucination, voting amplifies the error. MUG avoids this through counterfactual testing.
- The finding that one round is optimal: Counter-intuitively, more debate rounds degrade performance — because the undercover agent's arguments may convince normal agents, particularly on reasoning-type questions.
Limitations & Future Work¶
- The quality of counterfactual image generation is inconsistent — failure modes include edits that are too subtle, editing failures, or unnatural outputs.
- The undercover agent is currently selected at random; selection based on initial response uncertainty would be a natural improvement.
- Only image-based counterfactual testing is supported; text-level counterfactuals remain unexplored.
- Normal agents may be misled when the game extends beyond one round — more robust "immunity" mechanisms are needed.
- Computational complexity scales with the number of agents and rounds.
Related Work & Insights¶
- vs. MAD-Vote/MAD-Judge: MUG substantially outperforms both on HallusionBench (53.8 vs. 37.8/50.2), demonstrating that counterfactual testing is more effective than voting- or judge-based approaches.
- vs. Self-Refine: Self-Refine involves only single-agent self-correction and lacks multi-perspective verification; MUG obtains richer viewpoints through multi-agent gaming.
- vs. iMAD (in this note collection): iMAD uses a classifier to determine "when to debate," while MUG improves "how to debate" — the two approaches are complementary and could be combined.
- Insight: The core idea of counterfactual testing — "verifying understanding by introducing controlled differences" — generalizes to scenarios such as code verification and knowledge assessment.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The combination of social reasoning games and counterfactual editing is highly innovative and conceptually distinctive.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on 4 benchmarks with multiple baselines, ablation studies, and game dynamics analysis.
- Writing Quality: ⭐⭐⭐⭐ The game analogy is vivid and the formal definitions are clear.
- Value: ⭐⭐⭐⭐⭐ A fundamental improvement to the MAD paradigm — shifting from statistical consensus to verifiable fact-checking.