Dynamic Knowledge Integration for Evidence-Driven Counter-Argument Generation with Large Language Models¶

Conference: ACL 2025
arXiv: 2503.05328
Code: https://github.com/anaryegen/counter-argument-generation
Area: LLM/NLP - Counter-Argument Generation
Keywords: counter-argument, dynamic knowledge retrieval, LLM-as-Judge, argumentation, evidence-driven

TL;DR¶

A dynamic web knowledge retrieval framework is proposed to enhance the quality of LLM-generated counter-arguments. A new, moderately sized evaluation dataset (150 pairs) is constructed, and an LLM-as-a-Judge evaluation methodology is used to replace traditional reference-based metrics. Experimental results demonstrate that integrating external knowledge significantly improves the relevance, persuasiveness, and factuality of the generated content.

Background & Motivation¶

Background: Argumentation research in NLP is divided into two major directions: argument mining (extracting argument elements from text) and argument generation (generating argumentative text). LLMs show significant potential for argumentation tasks, but they primarily rely on parametric knowledge, resulting in verbose and potentially factually unsupported responses.

Limitations of Prior Work: (1) Existing counter-argument datasets are either too long (paragraph-length, making quality assessment difficult) or too short (sentence-length, failing to capture argumentative complexity); (2) Traditional reference-based metrics (such as BLEU, METEOR, BERTScore) fail to capture the nuanced quality dimensions of counter-arguments; (3) Prior works utilizing external knowledge are restricted to static databases (such as Wikipedia), which cannot cover time-sensitive topics.

Key Challenge: Without external evidence support, LLMs tend to generate verbose arguments that are stylistically persuasive but lack factual grounding and logical consistency; meanwhile, human evaluation is excessively expensive and subjective.

Goal: (1) Can dynamic web knowledge help LLMs generate better counter-arguments? (2) Which automatic evaluation methods align best with human judgment? (3) To what extent do LLMs utilize retrieved external evidence?

Key Insight: Integrating real-time web search as a dynamic source of external knowledge, combined with a new length-controlled dataset (3 sentences limit) and an LLM-as-a-Judge evaluation approach.

Core Idea: Injecting factual evidence retrieved dynamically from the web into the LLMs' counter-argument generation pipeline, coupled with LLM-as-a-Judge evaluation, to significantly enhance factuality and persuasiveness.

Method¶

Overall Architecture¶

A three-step pipeline: (1) Automatically generate challenging queries (averaging 67 words/query, 5 queries in total) that dispute the key claims of the original argument; (2) Conduct web searches via the Cohere API to retrieve external evidence (averaging 5,496 words); (3) Feed the original argument along with the retrieved evidence to the LLM to generate the counter-argument. A control group is established using only the model's parametric knowledge without external information.

Key Designs¶

Length-Controlled Dataset Construction:
- Function: Reconstructs and refines 150 high-quality argument-counterargument pairs from the CANDELA corpus (Reddit r/ChangeMyView), limiting each counter-argument to 3 sentences.
- Mechanism: The original counter-arguments averaging 30 sentences/921 words are compressed to 3 sentences/72 words. Summaries are generated using Llama-3.1-70B (a non-experimental model) and then manually verified and structured.
- Design Motivation: Overly long counter-arguments are difficult to evaluate accurately, while excessively short ones cannot reflect argumentative complexity. Three sentences strike a balance between conciseness and expressiveness.
Dynamic Web Knowledge Retrieval:
- Function: Automatically retrieves the latest factual evidence related to the argument via the Cohere API's web search tool.
- Mechanism: Automatically generates 5 challenging queries designed specifically to question the key assertions and premises of the original argument, incorporating the search results into the final prompt as contextual information.
- Design Motivation: Static databases (e.g., Wikipedia) cannot cover recent events and their content may not align with dynamic argumentative topics. Web search is not constrained by specific sources.
LLM-as-a-Judge Evaluation Method:
- Function: Uses three models (Prometheus, JudgeLM, and Claude 3.5 Sonnet) as automatic evaluators to score across five dimensions (Opposition, Relatedness, Specificity, Factuality, Persuasiveness) on a 3-point Likert scale.
- Mechanism: Correlation with human judgment is validated using the Spearman rank correlation coefficient. Claude 3.5 Sonnet achieves \(\rho = 0.82\) (strong correlation), vastly outperforming reference-based metrics.
- Design Motivation: Manual evaluation is expensive and subjective, whereas BLEU/METEOR/BERTScore demonstrate extremely low correlation with human preferences.

Loss & Training¶

This work operates in inference-only mode (no fine-tuning); all models run under default hyperparameters for a fair evaluation. The experimental models include Command R+ (104B) and Mistral-7B-Instruct-v0.3, each evaluated with and without external knowledge configurations.

Key Experimental Results¶

Main Results¶

Model	BLEU	ROUGE	METEOR	BERTScore	Mean
Command R+	20.35	18.36	16.12	86.38	35.30
Command R+ + External Knowledge	20.80	18.67	16.81	86.15	35.60
Mistral-7B	17.36	15.93	13.96	86.23	33.37
Mistral-7B + External Knowledge	17.30	16.58	14.36	86.29	33.63

Ablation Study¶

Evaluation Method	Spearman \(\rho\) with Human Judgment
Claude 3.5 Sonnet (LLM-Judge)	0.82 (very strong correlation)
Prometheus (LLM-Judge)	Strong correlation
JudgeLM (LLM-Judge)	Strong correlation
BLEU/ROUGE/METEOR/BERTScore	Weak correlation

Key Findings¶

Three-quarters of the evaluators (including humans and LLM-Judge) consistently agreed that the counter-arguments generated by Command R+ + External Knowledge had the highest quality.
External knowledge yielded the most significant performance improvements in Relatedness, Persuasiveness, and Factuality.
Command R+ + External Knowledge effectively utilized external evidence in 82% of the cases (cosine similarity \(> 70\%\)), compared to 51% for Mistral-7B.
All evaluators consistently rated human-written gold-standard counter-arguments the lowest—LLM-generated counter-arguments outperformed humans across multiple dimensions.
When dealing with sensitive topics (religion, politics, etc.), LLMs tended to provide more generalized responses instead of directly applying factual evidence, yet these responses paradoxically received higher ratings.

Highlights & Insights¶

First to introduce dynamic web retrieval into counter-argument generation, surmounting the limitations of static knowledge bases.
LLM-as-a-Judge aligns highly with human judgment in counter-argument evaluation (\(\rho=0.82\)), providing a reliable tool for large-scale automatic evaluation.
An intriguing finding is that LLM-generated counter-arguments comprehensively outperformed human-written gold standards, suggesting that LLMs may already possess superhuman capabilities in the domain of argumentation.

Limitations & Future Work¶

Only two LLMs (Command R+ and Mistral-7B) were tested, limiting coverage.
Restricted to English, lacking multilingual validation.
LLM-generated counter-arguments may suffer from training data contamination—the experimental topics might overlap with training data.
Human evaluation only covered 75 samples, representing a limited scale.

vs Hua et al. (2019): The latter only utilized Wikipedia and news databases as static external sources, whereas this work extends to dynamic web-wide retrieval.
vs Lin et al. (2023): The latter operated on sentence-level counter-argument generation, while this work argues that the sentence-level is insufficient to study argumentative complexity.
vs Chen et al. (2024): The latter evaluated LLM performance across multiple argumentation tasks but did not integrate external knowledge.

Rating¶

Novelty: ⭐⭐⭐ Dynamic knowledge retrieval + counter-argument generation combination has some novelty, though the overall framework is relatively straightforward.
Experimental Thoroughness: ⭐⭐⭐⭐ Comparison of multiple evaluation methods (human + LLM-Judge + reference-based metrics) is provided, though the variety of tested models is somewhat small.
Writing Quality: ⭐⭐⭐⭐ Clear structure, well-defined research questions, and in-depth analysis.
Value: ⭐⭐⭐ Validating LLM-as-a-Judge in argument evaluation has practical value, though the methodology itself has limited technical innovation.