InspireDebate: Multi-Dimensional Evaluation-Guided Reasoning for Debating¶
Conference: ACL 2025
arXiv: 2506.18102
Code: https://github.com/fywang12/InspireDebate
Area: Others
Keywords: Debate optimization, multi-dimensional evaluation, DPO, CoT reasoning, fact verification
TL;DR¶
A two-component framework is proposed: InspireScore (a debate evaluation system integrating 4 subjective dimensions and 2 objective dimensions) and InspireDebate (a debate framework optimized through a three-stage process of CoT-SFT + multi-dimensional DPO + Web-RAG). This evaluation system improves correlation with expert judgment by 44%, and the debate performance surpasses the baseline by 57%.
Background & Motivation¶
Background: LLMs have made progress in debate tasks, including argument quality assessment and debate process simulation. Works like Debatrix have advanced debate-level automatic evaluation.
Limitations of Prior Work: (a) Existing evaluation systems focus on subjective dimensions (e.g., emotion, clarity), neglecting objective dimensions (e.g., factual truthfulness, logical validity), and cannot detect hallucinations and logical fallacies; (b) Debate systems lack the representation of structured reasoning processes; (c) Optimization methods lack evaluation-driven iterative self-improvement.
Key Challenge: Debating requires simultaneously satisfying rhetorical persuasiveness (subjective) and logical/factual correctness (objective). These two dimensions may conflict—an argument can be emotionally strong but logically flawed. Existing methods cannot optimize both dimensions simultaneously.
Goal: How to construct a unified subjective-objective debate evaluation system and use it to guide the multi-dimensional optimization of LLM debating capabilities?
Key Insight: Using first-order logic to evaluate logical validity, external search to verify factual truthfulness, and DPO with multi-dimensional scores as reward signals to optimize the model.
Core Idea: First establish InspireScore, an evaluation system integrating subjective (emotional appeal, clarity, arrangement, and relevance) and objective (logical symbolic reasoning + search-based fact verification) dimensions, and then iteratively optimize the debating LLM via DPO using multi-dimensional feedback from InspireScore.
Method¶
Overall Architecture¶
Two components are connected in series: 1. InspireScore (evaluation system): 4 subjective dimensions + 2 objective dimensions 2. InspireDebate (optimization framework): SFT (CoT integration) → DPO (guided by InspireScore) → Web-RAG (real-time fact augmentation)
Key Designs¶
-
InspireScore Subjective Evaluation:
- Emotional Appeal: Whether the arguments elicit a sense of agreement/empathy
- Argument Clarity: Whether the expression is clear and concise
- Argument Arrangement: Whether the order and structure of the arguments are reasonable
- Topic Relevance: Whether the arguments closely adhere to the debate topic
- Implementation: Designing structured evaluation prompts for LLMs to score across different dimensions
-
InspireScore Objective Evaluation — Logical Validity:
- Function: Evaluates whether the reasoning in the debate logically supports the arguments
- Mechanism: A two-step method—(1) LLM translates natural language arguments into first-order logic (FOL) symbolic representations; (2) Applies logical deduction rules to verify whether each step of the reasoning correctly derives the conclusion
- Metric: \(S_{LV} = \frac{\sum_{i=1}^{m}\sum_{j=1}^{N_i} v(\text{FOL}_i^j)}{\sum_{i=1}^{m} N_i}\), representing the proportion of argument expressions that can be correctly deduced
- Design Motivation: Directly detecting logical fallacies in reasoning rather than merely evaluating whether it "sounds" reasonable
-
InspireScore Objective Evaluation — Factual Truthfulness:
- Function: Deconstructs debate responses into independent factual statements and verifies their truthfulness using a search engine
- Mechanism: Optimized based on the SAFE method, extracting facts → constructing search queries → retrieving external evidence → LLM judging whether the facts are supported
- Metric: \(S_{FA} = \frac{\text{被验证为真的事实数}}{\text{总独立事实数}}\)
-
InspireDebate Optimization Framework:
- SFT + CoT: Constructs structured training data containing reasoning processes and argument outputs using GPT-4o, solving the refusal behavior issue of open-source models
- Multi-dimensional DPO: Two LLMs play the affirmative and negative sides respectively to debate, using InspireScore to evaluate each debate → constructing preference pairs \((y_w, y_l)\) → DPO optimization
- Web-RAG: Real-time keyword extraction during debates → search engine retrieval → integrating retrieved information into argument generation
Loss & Training¶
- SFT Data: 100 debate topics \(\times\) CoT structured debates generated by GPT-4o
- DPO Data: 510 debate topics \(\times\) self-play debates \(\times\) filtered by InspireScore scoring
- Hardware: 2\(\times\)V100 (32G), training for 2-3 hours
- LoRA fine-tuning, lr=1e-5, 3 epochs
Key Experimental Results¶
InspireScore Evaluation Quality¶
| Evaluation System | Correlation with Expert Judgment | Description |
|---|---|---|
| Debatrix | Baseline | Subjective dimensions only |
| InspireScore | +44% Gain in correlation | Subjective + objective dimensions |
InspireDebate Debate Performance¶
| Method | Overall Gain | Description |
|---|---|---|
| Baseline Model | - | LLaMA-8B/Qwen-1.5B etc. |
| Inspire Version | +57% | SFT+DPO+Web-RAG |
Key Findings¶
- Objective dimensions are key differentiating factors: The inclusion of logical validity and factual truthfulness significantly improves the alignment of evaluation with expert judgments.
- Subjective and objective dimensions may conflict: Models may generate emotionally strong but logically flawed arguments. A unified framework can balance this tension.
- Web-RAG improves factual reliability: Real-time retrieval reduces factual hallucinations in debates.
- Small models also benefit: Small models, such as Qwen-1.5B and Phi-3.6B, also achieve significant improvements after optimization through InspireDebate.
Highlights & Insights¶
- The approach of using first-order logic to verify debate validity is highly creative: Translating debate arguments into symbolic representations and verifying them using logical deduction rules is a good example of combining formal methods with LLMs.
- Evaluation-driven optimization closed-loop: InspireScore serves as both an evaluation tool and the source of reward signals for DPO, forming an iterative improvement loop of "evaluation → optimization → re-evaluation".
- Unified subjective + objective evaluation fills a gap in the field of debate evaluation.
Limitations & Future Work¶
- Logical validity evaluation relies on the LLM to translate natural language into FOL; translation errors will affect the accuracy of the evaluation.
- Fact verification relies on search engines, which can limit the quality and coverage of search results.
- The construction of DPO preference pairs relies on the accuracy of InspireScore itself, posing a bootstrapping risk.
- Evaluation was conducted only on English debates.
Related Work & Insights¶
- vs Debatrix (Liang et al. 2024): Debatrix only evaluates subjective dimensions such as arguments, sources, and language; InspireScore adds logical validity and factual truthfulness.
- vs MAD (Liang et al. 2024): MAD utilizes debate to enhance reasoning but lacks objective feedback; InspireDebate provides multi-dimensional feedback via InspireScore.
- vs DebateTune (Li et al. 2024): DebateTune enhances argument diversity but lacks evaluation-driven optimization.
Rating¶
- Novelty: ⭐⭐⭐⭐ The idea of unified subjective-objective evaluation is valuable, and the design of verifying argument validity with first-order logic is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on 4 open-source + 2 closed-source models, using both automatic and human evaluations.
- Writing Quality: ⭐⭐⭐ The framework is complex but the writing is relatively clear, though some formula symbols are dense.
- Value: ⭐⭐⭐⭐ Provides a comprehensive toolchain of evaluation + optimization for LLM debating.