Argument Mining in the Age of Large Language Models¶

Conference: ACL 2025
Code: None
Area: LLM/NLP
Keywords: Argument Mining, Large Language Models, Argument Structure Identification, Stance Detection, Argument Quality Assessment

TL;DR¶

This paper systematically investigates the current status and challenges of Argument Mining (AM) tasks in the era of Large Language Models. Through comprehensive experiments, it evaluates the performance of LLMs on subtasks such as argument component identification, argument relation classification, and argument quality assessment, proposes targeted improvement strategies, and reveals the advantages and limitations of LLMs in structured argument understanding.

Background & Motivation¶

Background: Argument Mining (AM) is a crucial research direction in NLP, aiming to automatically identify and extract argument structures from unstructured text, which includes subtasks such as argument component (claim, premise) identification, argument relation (support, attack) classification, and argument quality assessment. Traditional methods primarily rely on feature engineering-based machine learning models or small-scale pre-trained language models (such as BERT, RoBERTa).

Limitations of Prior Work: Traditional argument mining methods face multiple challenges: first, annotated data is scarce, and high-quality argument annotation requires domain expertise, making annotation consistency difficult to guarantee; second, the complexity and domain specificity of argument structures lead to poor cross-domain generalization; third, existing models struggle to capture long-range argument dependencies and implicit argumentative logic.

Key Challenge: Argument mining requires deep semantic understanding and logical reasoning capabilities, while the parameter scale and pre-training objectives of traditional PLMs limit their performance on these high-level semantic tasks. Although LLMs have demonstrated strong capabilities on general NLP tasks, their applicability to the fine-grained task of structured argument analysis has not yet been systematically investigated.

Goal: To comprehensively evaluate the capability boundaries of LLMs across various argument mining subtasks, analyze the zero-shot and few-shot performance of LLMs, compare the effectiveness of different prompting strategies, and explore the complementarity between LLMs and traditional methods.

Key Insight: The authors decompose the argument mining task into multiple subtasks for individual evaluation, while designing various prompt engineering strategies (such as chain-of-thought and task decomposition) to activate the argumentative understanding capabilities of LLMs.

Core Idea: To reveal the capability hierarchy of LLMs in argument mining through systematic benchmark experiments, and propose task decomposition and structured prompting methods to address the deficiencies of LLMs in fine-grained argument analysis.

Method¶

Overall Architecture¶

This paper constructs a comprehensive evaluation framework covering the primary subtasks of argument mining: (1) Argument Component Detection, which identifies claims and premises in text; (2) Argument Relation Classification, which determines support/attack relations between argument components; (3) Argument Quality Assessment, which evaluates dimensions such as persuasiveness and logical coherence; and (4) Stance Detection, which identifies the support/oppose stance towards a specific topic. The input consists of raw texts or argument pairs, and the output consists of corresponding structural labels, relation labels, or quality scores.

Key Designs¶

Multi-Level Prompt Engineering Strategies:
- Function: To design prompt templates adapted to each argument mining subtask.
- Mechanism: Four types of prompt strategies are designed for different subtasks: Direct Prompting directly asks LLMs for classification; Definition-Augmented Prompting adds formal definitions of argumentative concepts into the prompt; Chain-of-Thought Prompting guides LLMs to analyze the argumentative structure before making a judgment; and Task Decomposition Prompting decomposes complex end-to-end tasks into multi-step subtasks. The experiments find that CoT and task decomposition strategies perform best on complex tasks like relation classification.
- Design Motivation: Argument mining involves multi-level semantic understanding, and subtasks of different complexities may require different depths of reasoning guidance.
Cross-Domain Generalization Evaluation Framework:
- Function: To test the transferability of LLMs across different argumentative domains.
- Mechanism: Multiple AM datasets from various domains are selected, including Persuasive Essays, Online Debates, Scientific Articles, and Legal Texts. The performance of LLMs trained/prompted in a source domain is evaluated on a target domain under zero-shot and few-shot settings. A key finding is that the cross-domain generalization capability of LLMs significantly outperforms fine-tuned BERT-base models, especially under low-resource scenarios.
- Design Motivation: Cross-domain generalization is a core challenge in argument mining, and the large-scale pre-trained knowledge of LLMs theoretically should facilitate domain transfer.
Fusion Strategies for LLMs and Expert Models:
- Function: To explore the feasibility of using LLM outputs as features to enhance traditional models.
- Mechanism: The prediction results of LLMs and their generated argument analysis texts are utilized as extra features for downstream classifiers. Specifically, LLMs are used to generate structured analyses (e.g., argument type, logical strength, potential counterarguments) for each argument component. These analytic texts are then transformed into feature vectors via an encoder, which are concatenated with raw text features and fed into the classifier. This method brings significant improvements in the argument quality assessment task.
- Design Motivation: LLMs excel at high-level semantic understanding and commonsense reasoning, while expert models excel at fine-grained feature extraction. Combining the two complementary strengths can achieve better comprehensive performance.

Loss & Training¶

For fine-tuning experiments, standard cross-entropy loss is employed for classification training. In the few-shot setting, in-context learning is used without parameter updates. In the fusion strategy, downstream classifiers utilize weighted cross-entropy loss to address the class sample imbalance problem, where weights are dynamically adjusted according to the sample proportions of each class.

Key Experimental Results¶

Main Results¶

Task/Dataset	Metric	GPT-4 (0-shot)	GPT-4 (few-shot)	RoBERTa-FT	Gain/Gap
Component Detection (PE)	Macro-F1	72.3	78.6	82.1	-3.5 vs FT
Relation Classification (PE)	Macro-F1	64.8	71.2	68.5	+2.7 vs FT
Quality Assessment (GAQCorpus)	Spearman	0.61	0.68	0.58	+0.10 vs FT
Stance Detection (VAST)	Macro-F1	67.5	73.1	70.8	+2.3 vs FT
Cross-Domain Component Detection	Macro-F1	65.2	69.8	54.3	+15.5 vs FT

Ablation Study¶

Prompt Strategy	Relation Classification F1	Quality Assessment Spearman	Description
Direct	58.3	0.52	Basic direct prompting
+ Definition	62.1	0.57	Added concept definitions, +3.8
+ CoT	68.5	0.64	Chain-of-thought, +10.2
+ Task Decomp	71.2	0.68	Task decomposition, optimal
Fusion (LLM+RoBERTa)	74.6	0.72	Fusion strategy, overall optimal

Key Findings¶

LLMs outperform fine-tuned small models on tasks requiring deep reasoning, such as argument relation classification and quality assessment. However, a gap remains in sequence labeling tasks like component detection, suggesting that LLMs are better at understanding "why" rather than precisely locating "where".
The advantage of LLMs is most pronounced in cross-domain scenarios (+15.5 F1), validating the critical role of large-scale pre-trained knowledge in domain transfer.
CoT and task decomposition strategies yield stable gains, indicating that argument understanding is a complex process requiring step-by-step reasoning.
On argumentative texts from legal and scientific domains, the performance of LLMs is relatively weaker, likely because the argumentative patterns in these fields differ significantly from general pre-training data.

Highlights & Insights¶

The task decomposition prompting strategy breaks down end-to-end argument analysis into pipeline steps of "identify arguments \(\rightarrow\) analyze relations \(\rightarrow\) evaluate quality", effectively reducing the reasoning difficulty at each stage. This paradigm can be extended to other structured information extraction tasks.
The fusion scheme of LLMs and expert models is highly practical—employing the LLM as a "high-level analyst" and the small model as a "precise executor." This division of labor possesses strong application value in industrial settings.
The outstanding performance of LLMs in the argument quality assessment task reveals an interesting phenomenon: the "commonsense" and "logical intuition" required to evaluate argument quality align well with the strengths accumulated by LLMs during pre-training.

Limitations & Future Work¶

The experiments are predominantly based on English datasets; the performance of argument mining in multilingual scenarios remains to be validated.
The current evaluation framework primarily focuses on single-turn argument analysis, with insufficient modeling of dynamic argumentative processes in multi-turn dialogues.
The relative disadvantage of LLMs in sequence-labeling subtasks (e.g., component boundary detection) indicates the need to develop specialized token-level adaptation strategies.
The fusion scheme increases inference costs. How to reduce computational overhead while maintaining performance is a key challenge for practical deployment.

vs BERT/RoBERTa Fine-Tuning: Traditional methods yield better component detection performance when data is abundant, but suffer from poor cross-domain generalization. LLMs exhibit distinct advantages in low-resource and cross-domain scenarios.
vs ChatGPT Argument Mining (Ruiz-Dolz et al.): Prior studies only evaluated the zero-shot performance of ChatGPT. In contrast, this paper provides a more comprehensive evaluation, covering various LLMs, multiple prompting strategies, and fusion schemes.
vs Prompt-based AM: Previous prompt-based approaches were mainly designed for single subtasks, whereas the unified evaluation framework and task decomposition strategies proposed in this paper offer better generalizability.

Rating¶

Novelty: ⭐⭐⭐ Systematic evaluation work, with limited methodological innovation but comprehensive experiments
Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple datasets and subtasks, with exhaustive ablation studies
Writing Quality: ⭐⭐⭐⭐ Clearly structured and deeply analyzed
Value: ⭐⭐⭐⭐ Provides a baseline reference and directional guidance for the argument mining domain in the LLM era