Skip to content

InspireDebate: Multi-Dimensional Evaluation-Guided Reasoning for Debating

Conference: ACL 2025
arXiv: 2506.18102
Code: https://github.com/fywang12/InspireDebate
Area: Others
Keywords: Debate optimization, multi-dimensional evaluation, DPO, CoT reasoning, fact verification

TL;DR

A two-component framework is proposed: InspireScore (a debate evaluation system integrating 4 subjective dimensions and 2 objective dimensions) and InspireDebate (a debate framework optimized through a three-stage process of CoT-SFT + multi-dimensional DPO + Web-RAG). This evaluation system improves correlation with expert judgment by 44%, and the debate performance surpasses the baseline by 57%.

Background & Motivation

Background: LLMs have made progress in debate tasks, including argument quality assessment and debate process simulation. Works like Debatrix have advanced debate-level automatic evaluation.

Limitations of Prior Work: (a) Existing evaluation systems focus on subjective dimensions (e.g., emotion, clarity), neglecting objective dimensions (e.g., factual truthfulness, logical validity), and cannot detect hallucinations and logical fallacies; (b) Debate systems lack the representation of structured reasoning processes; (c) Optimization methods lack evaluation-driven iterative self-improvement.

Key Challenge: Debating requires simultaneously satisfying rhetorical persuasiveness (subjective) and logical/factual correctness (objective). These two dimensions may conflict—an argument can be emotionally strong but logically flawed. Existing methods cannot optimize both dimensions simultaneously.

Goal: How to construct a unified subjective-objective debate evaluation system and use it to guide the multi-dimensional optimization of LLM debating capabilities?

Key Insight: Using first-order logic to evaluate logical validity, external search to verify factual truthfulness, and DPO with multi-dimensional scores as reward signals to optimize the model.

Core Idea: First establish InspireScore, an evaluation system integrating subjective (emotional appeal, clarity, arrangement, and relevance) and objective (logical symbolic reasoning + search-based fact verification) dimensions, and then iteratively optimize the debating LLM via DPO using multi-dimensional feedback from InspireScore.

Method

Overall Architecture

Two components are connected in series: 1. InspireScore (evaluation system): 4 subjective dimensions + 2 objective dimensions 2. InspireDebate (optimization framework): SFT (CoT integration) → DPO (guided by InspireScore) → Web-RAG (real-time fact augmentation)

Key Designs

  1. InspireScore Subjective Evaluation:

    • Emotional Appeal: Whether the arguments elicit a sense of agreement/empathy
    • Argument Clarity: Whether the expression is clear and concise
    • Argument Arrangement: Whether the order and structure of the arguments are reasonable
    • Topic Relevance: Whether the arguments closely adhere to the debate topic
    • Implementation: Designing structured evaluation prompts for LLMs to score across different dimensions
  2. InspireScore Objective Evaluation — Logical Validity:

    • Function: Evaluates whether the reasoning in the debate logically supports the arguments
    • Mechanism: A two-step method—(1) LLM translates natural language arguments into first-order logic (FOL) symbolic representations; (2) Applies logical deduction rules to verify whether each step of the reasoning correctly derives the conclusion
    • Metric: \(S_{LV} = \frac{\sum_{i=1}^{m}\sum_{j=1}^{N_i} v(\text{FOL}_i^j)}{\sum_{i=1}^{m} N_i}\), representing the proportion of argument expressions that can be correctly deduced
    • Design Motivation: Directly detecting logical fallacies in reasoning rather than merely evaluating whether it "sounds" reasonable
  3. InspireScore Objective Evaluation — Factual Truthfulness:

    • Function: Deconstructs debate responses into independent factual statements and verifies their truthfulness using a search engine
    • Mechanism: Optimized based on the SAFE method, extracting facts → constructing search queries → retrieving external evidence → LLM judging whether the facts are supported
    • Metric: \(S_{FA} = \frac{\text{被验证为真的事实数}}{\text{总独立事实数}}\)
  4. InspireDebate Optimization Framework:

    • SFT + CoT: Constructs structured training data containing reasoning processes and argument outputs using GPT-4o, solving the refusal behavior issue of open-source models
    • Multi-dimensional DPO: Two LLMs play the affirmative and negative sides respectively to debate, using InspireScore to evaluate each debate → constructing preference pairs \((y_w, y_l)\) → DPO optimization
    • Web-RAG: Real-time keyword extraction during debates → search engine retrieval → integrating retrieved information into argument generation

Loss & Training

  • SFT Data: 100 debate topics \(\times\) CoT structured debates generated by GPT-4o
  • DPO Data: 510 debate topics \(\times\) self-play debates \(\times\) filtered by InspireScore scoring
  • Hardware: 2\(\times\)V100 (32G), training for 2-3 hours
  • LoRA fine-tuning, lr=1e-5, 3 epochs

Key Experimental Results

InspireScore Evaluation Quality

Evaluation System Correlation with Expert Judgment Description
Debatrix Baseline Subjective dimensions only
InspireScore +44% Gain in correlation Subjective + objective dimensions

InspireDebate Debate Performance

Method Overall Gain Description
Baseline Model - LLaMA-8B/Qwen-1.5B etc.
Inspire Version +57% SFT+DPO+Web-RAG

Key Findings

  • Objective dimensions are key differentiating factors: The inclusion of logical validity and factual truthfulness significantly improves the alignment of evaluation with expert judgments.
  • Subjective and objective dimensions may conflict: Models may generate emotionally strong but logically flawed arguments. A unified framework can balance this tension.
  • Web-RAG improves factual reliability: Real-time retrieval reduces factual hallucinations in debates.
  • Small models also benefit: Small models, such as Qwen-1.5B and Phi-3.6B, also achieve significant improvements after optimization through InspireDebate.

Highlights & Insights

  • The approach of using first-order logic to verify debate validity is highly creative: Translating debate arguments into symbolic representations and verifying them using logical deduction rules is a good example of combining formal methods with LLMs.
  • Evaluation-driven optimization closed-loop: InspireScore serves as both an evaluation tool and the source of reward signals for DPO, forming an iterative improvement loop of "evaluation → optimization → re-evaluation".
  • Unified subjective + objective evaluation fills a gap in the field of debate evaluation.

Limitations & Future Work

  • Logical validity evaluation relies on the LLM to translate natural language into FOL; translation errors will affect the accuracy of the evaluation.
  • Fact verification relies on search engines, which can limit the quality and coverage of search results.
  • The construction of DPO preference pairs relies on the accuracy of InspireScore itself, posing a bootstrapping risk.
  • Evaluation was conducted only on English debates.
  • vs Debatrix (Liang et al. 2024): Debatrix only evaluates subjective dimensions such as arguments, sources, and language; InspireScore adds logical validity and factual truthfulness.
  • vs MAD (Liang et al. 2024): MAD utilizes debate to enhance reasoning but lacks objective feedback; InspireDebate provides multi-dimensional feedback via InspireScore.
  • vs DebateTune (Li et al. 2024): DebateTune enhances argument diversity but lacks evaluation-driven optimization.

Rating

  • Novelty: ⭐⭐⭐⭐ The idea of unified subjective-objective evaluation is valuable, and the design of verifying argument validity with first-order logic is novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on 4 open-source + 2 closed-source models, using both automatic and human evaluations.
  • Writing Quality: ⭐⭐⭐ The framework is complex but the writing is relatively clear, though some formula symbols are dense.
  • Value: ⭐⭐⭐⭐ Provides a comprehensive toolchain of evaluation + optimization for LLM debating.