Limited Generalizability in Argument Mining: State-Of-The-Art Models Learn Datasets, Not Arguments¶

Conference: ACL 2025
arXiv: 2505.22137
Code: Limited-Generalizability
Area: Other
Keywords: argument mining, generalization, shortcut learning, cross-dataset evaluation, BERT transformers

TL;DR¶

This paper presents the first large-scale cross-dataset generalization evaluation of four Transformer models across 17 English sentence-level argument mining datasets. The findings reveal that state-of-the-art models primarily learn dataset-specific lexical patterns rather than the structural signals of arguments, leading to generalization performance far below their in-dataset baselines. However, task-specific pre-training and multi-dataset joint training can partially alleviate this issue.

Background & Motivation¶

Argument mining is a core foundational task in automatic discourse analysis, aiming to identify argumentative structures (e.g., claims and premises) from natural language. The field faces a long-standing but rarely systematically verified concern:

High baseline performance can be misleading: BERT-like models perform exceptionally well on individual benchmarks (0.67–0.96 macro F1), fostering an optimistic assumption that these models possess broad applicability.

Arguments should theoretically transfer across domains: The core of an argument lies in its logical structure (e.g., "X should Y because Z") rather than its specific content. Consequently, trained models should theoretically generalize across datasets.

Concerns over shortcut learning: BERT-based models are known to focus heavily on basic grammar, nouns, and coreference relations. They might capture dataset-specific lexical cues rather than genuine argumentative signals.

Inconsistent definitions: Different datasets employ varying definitions of "what constitutes an argument" (e.g., claim-based, evidence-based, reasoning-based), further exacerbating the difficulty of generalization.

Rather than proposing a new model or formalization, the authors adopt a data-driven approach to answer: Do existing state-of-the-art models learn "arguments" or merely "datasets"?

Method¶

Overall Architecture¶

The study is designed around three research questions: - Q1: To what extent are existing benchmark datasets comparable? - Q2: Can state-of-the-art models generalize to other datasets? - Q3: Do these models learn generalizable concepts of arguments?

These questions are addressed through three types of experiments: pairwise transfer experiments, joint training experiments, and controlled input manipulation experiments.

Key Designs¶

Dataset Selection and Standardization
- From 52 argument mining datasets published between 2008 and 2024, datasets were filtered based on three criteria: sentence-level annotation, binary labels (argument/non-argument), and reproducibility.
- After two rounds of screening, 17 datasets were retained, covering approximately 345K annotated sentences.
- A unified 60/20/20 stratified split was applied, ensuring a minimum of 850 samples per label.
- Design Motivation: To ensure experimental scale and statistical reliability.
Pairwise Transfer Experiments (Answering Q2)
- Models are trained on one dataset and tested on all 17 datasets.
- This generates a \(17 \times 17\) transfer matrix (one for each model).
- The diagonal represents baseline (in-dataset) performance, while the off-diagonal represents generalization performance.
- Design Motivation: To systematically quantify the transfer capability between each pair of datasets.
Joint Training Experiments (Supplementing Q2)
- Models are trained jointly on 16 datasets and tested on the remaining 1 dataset (Leave-One-Out).
- Results are compared against individual baseline performances.
- Design Motivation: To test whether heterogeneous data can improve generalization.
Controlled Input Manipulation Experiments (Answering Q3)
- Stop words, function words, discourse markers, and punctuation are systematically removed.
- This eliminates approximately half of the words in a sentence, leaving only topical content words.
- Model performance is compared before and after this removal.
- Design Motivation: If performance does not drop after removing structural argumentative cues (such as "because" or "therefore"), it indicates that the model does not rely on these signals.
Model Selection
- BERT, RoBERTa, DistilBERT: Standard NLP baselines.
- WRAP: The only Transformer pre-trained via contrastive learning to enhance argumentative generalization.
- A standard GLUE hyperparameter grid was used (batch = 32, epochs = 3, lr = 2e-5 to 5e-5).

Loss & Training¶

Standard classification cross-entropy loss.
The optimization objective is macro F1 (ensuring equal importance for both labels).
Each experiment was repeated 3 times, with significance analyzed using repeated-measures ANOVA and paired t-tests.

Key Experimental Results¶

Main Results (Pairwise Transfer vs. Baseline Performance, macro F1)¶

Statistic	WRAP	BERT	RoBERTa	DistilBERT
Baseline Mean	0.79	0.79	0.79	0.79
Transfer Mean	0.61	0.58	0.57	0.56
Transfer SD	0.10	0.11	0.12	0.11
Best Performer Share	46%	20%	17%	17%

Of the transfer experiments, 97% yielded results below the baseline mean (0.79), with 62% falling below 0.65. WRAP consistently outperformed other models in generalization.

Joint Training Experiments (Leave-One-Out, compared with SOTA)¶

Dataset	WRAP	BERT	RoBERTa	DistilBERT	SOTA	\(\Delta_{\max}\)
ACQUA	0.66	0.60	0.59	0.59	0.84	0.18
ABSTRCT	0.74	0.74	0.74	0.71	0.89	0.15
CE	0.77	0.72	0.76	0.72	0.85	0.08
UKP	0.70	0.67	0.70	0.68	0.79	0.09
TACO	0.76	0.61	0.65	0.55	0.88	0.12
AEC	0.52	0.57	0.51	0.56	0.96	0.39

Joint training improved the overall mean (0.63–0.66), but a substantial gap to individual baselines remained (average \(\Delta_{\max} = 0.12\)).

Key Findings¶

Generalization is the exception rather than the rule: Only a minority of pairwise transfers achieved good generalization (\(\ge 0.75\)), primarily occurring between datasets of the same domain or definition type.
WRAP consistently outperforms standard models: Task-specific pre-training significantly aids generalization, with WRAP performing best in 46% of the experiments.
Strong evidence of shortcut learning:
- BERT, RoBERTa, and DistilBERT showed almost unchanged performance (\(\Delta \le 0.02\)) after the removal of argumentative structure words, indicating they did not learn these signals at all.
- WRAP exhibited the largest performance drop (\(\Delta = 0.05\)), indicating that it indeed captured some structural argumentative signals.
The cautionary tale of AEC: The AEC dataset, which defines arguments based on only 5 keywords, achieved the highest baseline (0.96) but showed the worst generalization (\(\le 0.63\)), with performance plummeting after the keywords were removed (\(\Delta \le 0.45\)).
Definitional divergence is an inherent limitation: Definitions of arguments (claim-based, evidence-based, reasoning-based) across different datasets overlap but are not equivalent, making cross-definition transfer fundamentally challenging.
Statistical significance: Only the advantages of WRAP and the performance degradation after manipulation passed paired t-tests (\(p < 0.05\)).

Highlights & Insights¶

The conclusion of "learning datasets instead of arguments" is powerful: Systematic evidence is provided through the \(17 \times 17\) transfer matrix and controlled manipulations.
Rigor in experimental design: The study serves as a methodological exemplar, employing repeated trials, ANOVA, Greenhouse-Geisser corrections, and effect size reporting.
A sober critique of the field: Rather than proposing a marginally better method, the paper exposes a pervasive but previously unverified issue in the community.
Implications of joint training: Although it does not completely resolve the gap, using heterogeneous data indeed helps improve generalizability.

Limitations & Future Work¶

The study only considers BERT-family models, omitting larger language models (e.g., GPT-4, LLaMA) or prompt-based approaches.
Control experiments only removed function words, without exploring other fine-grained interventions (e.g., replacing structural argumentative words, or preserving structure while altering content).
The 17 datasets only cover English, leaving cross-lingual generalization unexplored.
No solution is proposed (e.g., designing better pre-training objectives to enhance the learning of argumentative signals).
The binary (argument vs. non-argument) granularity is relatively coarse, without targeting the generalization of argument component identification (e.g., claim vs. premise).

WRAP (Feger & Dietze 2024) is the only prior work to explore pre-training for argumentative generalization, and this study confirms the validity of its direction.
The issue of "spurious optimism" driven by benchmarks raised by Saphra et al. (2024) is concretely validated here within the domain of argument mining.
Insights:
- Do similar generalization illusions exist in other NLP subtasks (e.g., sentiment analysis, stance detection)?
- Can we design "argument-invariant" pre-training objectives (e.g., adversarial content substitution while keeping the argumentative label unchanged)?

Rating¶

Novelty: ⭐⭐⭐⭐ — Presents the first large-scale systematic evaluation of generalization in argument mining with highly precise research questions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Features 17 datasets, 4 models, three types of experiments (pairwise, joint, manipulated), and rigorous statistical testing.
Writing Quality: ⭐⭐⭐⭐ — Well-organized around Q1–Q3; the dataset overview is highly informative, though some statistical details are quite dense.
Value: ⭐⭐⭐⭐ — Serves as an important wake-up call for the argument mining community, with implications for other NLP domains.