AAAI 2026 Audio & Speech Policy Debate Multi-Agent Collaboration LLM Evidence Retrieval Autonomous Debating TTS

DeepDebater: A Superpersuasive Autonomous Policy Debating System¶

Conference: AAAI 2026 arXiv: 2511.17854 Code: GitHub Area: Audio & Speech Keywords: Policy Debate, Multi-Agent Collaboration, LLM, Evidence Retrieval, Autonomous Debating, TTS

TL;DR¶

This paper presents DeepDebater, the first autonomous multi-agent system capable of participating in and winning a complete American-style policy debate (eight speeches plus cross-examination). The system employs a hierarchical agent workflow to construct affirmative (Advantage) and negative (DA+CP+Kritik) arguments, leverages over 3 million evidence cards from OpenDebateEvidence for retrieval-augmented generation, and integrates GPT-4o TTS speech synthesis with EchoMimic digital avatar animation for end-to-end presentation. Expert evaluations show DeepDebater significantly outperforms human-authored cases across all metrics (Quality: 4.32 vs. 3.65), achieving an 85% win rate in simulated rounds.

Background & Motivation¶

Highly complex, evidence-based, and strategically adaptive persuasion represents a fundamental challenge for AI. IBM Project Debater is the most prominent prior AI debating system, yet it exhibits three fundamental limitations.

First, oversimplified format: it employs a non-standard, extremely short debate format targeting lay audiences—a format for which no competitive tournament exists worldwide. Real American policy debate consists of eight speeches and four cross-examination periods with a strict and complex structure.

Second, shallow evidence use: Project Debater cites a small number of evidence pieces for brief speeches, whereas real policy debate is built on "cards"—evidence units containing multiple pages of direct quotations with highlighting and tag summaries, which debaters must cite densely and near-verbatim.

Third, non-iterative gameplay: Project Debater does not conduct a full multi-round debate and does not handle refutation or strategic interaction. Real debate requires point-by-point responses to every argument in the preceding speech.

American competitive policy debate serves as an ideal testbed for AI argumentation research: strict time limits, dependence on large volumes of high-quality evidence, formalized structure, and simultaneous demands for long-term strategic planning and real-time tactical decision-making. The core idea is to decompose the complex debating task into a pipeline of hierarchical, specialized agent workflows, with each agent team responsible for a discrete argumentative task.

Method¶

Overall Architecture¶

The system adopts a modular pipeline framework with two core components: an OpenDebateEvidence evidence database indexed via DuckDB (3M+ debate cards, BM25 retrieval) and a hierarchical multi-agent conversational architecture based on the AG2/AutoGen framework (driven by gpt-4.1-mini), complemented by GPT-4o TTS speech synthesis and EchoMimic V1 digital avatar animation for end-to-end presentation.

Key Designs¶

Reusable Multi-Agent Workflow Pattern
- Function: Defines the fundamental building blocks of argument generation.
- Mechanism: Each workflow contains collaborating specialized agents — a Generator drafts arguments, a Retriever searches and ranks evidence (often retrieving hundreds of cards per claim before selecting the best), and a Critic reviews quality and suggests revisions. The loop iterates until a Reviewer Agent is satisfied or a maximum iteration count is reached. Pydantic models enforce structured outputs to ensure machine-readable agent message formats.
- Design Motivation: Each argumentative component in policy debate (e.g., Advantage Link/Impact/Uniqueness, DA Stock Issues) requires independent evidentiary support and logical construction. Role separation enables specialization, and the Critic mechanism prevents low-quality outputs.
Debate Generation Pipeline
- Function: Sequentially generates a complete debate following the eight-speech structure of policy debate.
- Mechanism: 1AC (First Affirmative Constructive): Three stages — Plantext generation (researching viable policy plans) → Stock Issue workflows (independent agent teams for Harms/Inherency/Solvency) → Advantage generation (each with a Uniqueness/Link/Internal Link/Impact evidence chain). 1NC (First Negative Constructive): Strategic combination generation → Off-Case workflows (Topicality/Theory, Disadvantages with full evidence chains, Counterplans with alternative mechanisms, Kritiks with philosophical/ethical challenges and Alternatives) → On-Case refutation (directly attacking 1AC evidence). Subsequent speeches: 2AC → 2NC → 1NR → 1AR → 2NR → 2AR, each conditioned on the full prior transcript. Cross-examination: Simulated via two-agent dialogue modeling strategic questioning. Adjudication: An independent Judge Agent (Claude/Gemini/GPT-4.1) reads the full transcript and issues a reasoned decision (RFD).
- Design Motivation: Strict adherence to the authentic format of policy debate — format compliance is a core differentiator between novice and championship-level debaters.
End-to-End Presentation + Human-AI Collaboration
- Function: Transforms textual debate into an audible, embodied interactive experience.
- Mechanism: GPT-4o mini TTS synthesizes speeches into audio → EchoMimic V1 drives a static portrait image to produce lip-synced digital avatar video. On-screen transcripts are simultaneously displayed to match the "flowing" practice of debate. Three modes are supported: fully automated AI vs. AI, hybrid AI+Human teams, and AI vs. Human.
- Design Motivation: Policy debate is inherently live and interactive — judges must hear speeches and observe delivery. Digital avatar presentation gives the AI debater a physical "presence." The human-AI collaboration design allows human intervention at any stage, functioning both as a research tool and as an engaging interface.

Loss & Training¶

No model training is involved. The system relies entirely on zero-shot prompted inference with LLMs (gpt-4.1-mini) and BM25 retrieval augmentation. Estimated cost is approximately $1–3 per round (text only), $3–5 with speech synthesis, and $20–50 with digital avatar video.

Key Experimental Results¶

Main Results¶

Experiment 1: Expert Evaluation (5 judges with 10+ years of coaching experience, 1–5 scale)

Metric	DeepDebater	Human-Authored	Gap
Quality (strategic coherence + persuasiveness)	4.32 ± 0.31	3.65 ± 0.52	+0.67
Factuality (factual accuracy)	4.45 ± 0.25	3.98 ± 0.23	+0.47
Faithfulness (tag-to-evidence fidelity)	4.81 ± 0.19	4.05 ± 0.48	+0.76

Experiment 2: Simulated Rounds (20 rounds, AI judges)

Scenario	Rounds	DeepDebater Win Rate
System as Negative vs. human Affirmative case	10	90%
System as Affirmative vs. human Negative strategy	10	80%
Total	20	85%

Ablation Study¶

Experiment 3: Cross-Judge Robustness (same 20 transcripts)

Judge Model	Win Rate (%)	Δ vs. Gemini (pp)	Cohen's κ vs. Gemini
Gemini	85	0	—
Claude	80	-5	0.75
GPT-4.1	83	-2	0.89

Key Findings¶

Faithfulness shows the largest gap (+0.76): the AI demonstrates the greatest advantage in accurately summarizing evidence via tags — precisely one of the core skills in policy debate.
Judge RFDs frequently cite the system's superior evidence density and more comprehensive line-by-line refutation.
Inter-judge agreement is reasonably high (κ = 0.75–0.89), though AI judge bias remains a risk.
The system achieves a higher win rate as the Negative (90% vs. 80%), likely because the Negative can deploy more targeted strategies against a known Affirmative case.

Highlights & Insights¶

First complete policy debate AI system: covering eight speeches, cross-examination, refutation, and adjudication — substantially more complex than IBM Project Debater.
Elegant hierarchical multi-agent division of labor: each debate component (Advantage/DA/CP/Kritik/Topicality) is handled by a dedicated workflow, with iterative generate-retrieve-review cycles ensuring quality.
Heavy evidentiary grounding: built on 3M+ real debate evidence cards, every claim is traceable to a specific quotation — a compelling instantiation of trustworthy AI argumentation.
Human-AI collaboration design has practical value: the system functions not only as a fully autonomous tool but also as a training aid for debaters.
Frank discussion of AI persuasion risks is commendable: the paper enumerates potential misuses including micro-targeted manipulation, information warfare, and social engineering attacks.

Limitations & Future Work¶

Weak BM25 retrieval: the paper acknowledges that dense (embedding-based) retrieval would significantly improve quality, but was not implemented due to portability and cost constraints. Reliance on sparse lexical matching may miss semantically relevant evidence phrased differently.
Small evaluation scale and reliance on AI judges: only 5 experts × 3 repetitions (human evaluation) and 20 simulated rounds (AI judges). The study lacks sufficient statistical power, and AI judges may exhibit stylistic or in-family biases.
Evidence database is capped at 2022: the system is prompted to simulate a 2022 temporal context; the ability to automatically "cut cards" (create new evidence from open-access literature) would substantially enhance the system.
English-only, American Policy Debate only: performance in other formats (British Parliamentary, Lincoln-Douglas) or other languages has not been tested.
Adversarial robustness unverified: no stress testing against adversarial opponents (e.g., unconventional strategies, prompt injection) or corrupted evidence.
Computational cost and API dependency: a full debate round requires extensive API calls (gpt-4.1-mini + TTS + EchoMimic), making reproducibility vulnerable to API changes.
Winning a round ≠ being correct: the system optimizes for judge decisions rather than truth-seeking or uncertainty calibration.

Compared to IBM Project Debater, DeepDebater represents a qualitative leap in task complexity — from simplified-format short speeches to a complete eight-speech policy debate. Relative to general applications of multi-agent frameworks such as AutoGen, this work demonstrates the capacity of hierarchical agent workflows on extremely complex creative tasks.

The paper's title and capitalization constitute a deliberate homage and response to IBM Project Debater's final paper, "An autonomous debating system" — escalating from "an autonomous" to "a superpersuasive autonomous."

The hierarchical workflow architecture is generalizable to other scenarios requiring multi-stage, multi-role, evidence-intensive content generation (e.g., legal document drafting, academic writing, think-tank reports). The discussion of "dual-use" risks of AI persuasion carries significant implications for AI safety research — particularly the approach of decomposing persuasive capability into independently evaluable components (evidence retrieval, argument construction, strategic planning).

Rating¶

Novelty: ⭐⭐⭐⭐ First complete policy debate system with a novel hierarchical agent architecture, though core techniques (RAG + multi-agent) are not original contributions.
Experimental Thoroughness: ⭐⭐⭐ Expert evaluation and simulated rounds are included, but the scale is limited (5 evaluators × 3 trials; 20 rounds), with insufficient statistical power.
Writing Quality: ⭐⭐⭐⭐ Domain background is thoroughly introduced, system architecture is clearly described, and the discussion of dual-use risks is comprehensive and responsible.
Value: ⭐⭐⭐⭐ Demonstrates the potential of LLM + multi-agent systems on extremely complex argumentation tasks; offers meaningful insights for AI safety and persuasion research.