Do Emotions Really Affect Argument Convincingness? A Dynamic Approach with LLM-based Manipulation Checks¶

Conference: ACL 2025
arXiv: 2503.00024
Code: Available
Area: Computational Argumentation / Emotion and Persuasion
Keywords: argument convincingness, emotional appeal, manipulation check, LLM bias, multilingual analysis

TL;DR¶

This paper proposes a dynamic framework inspired by psychological manipulation checks, utilizing LLMs to modulate the emotional intensity of arguments and systematically investigate the causal impact of emotion on argument convincingness. The findings reveal that in more than half of the cases, human judgments of convincingness are unaffected by emotional changes; when emotion does have an effect, it is more likely to enhance rather than diminish convincingness.

Background & Motivation¶

Background: Emotional appeal (pathos) is one of Aristotle's three pillars of rhetoric. However, the NLP community has understudied the relationship between "emotion and argument convincingness," often simplifying emotion as a logical fallacy.

Limitations of Prior Work: Existing studies mostly rely on static analysis—comparing the convincingness of fixed argument pairs—which lacks control over confounding variables. Furthermore, they are often restricted to a single language or domain.

Key Challenge: Observational studies cannot disentangle the causal effects of emotion from other confounding factors, leading to unreliable conclusions.

Goal: To quantify the dynamic impact of emotional intensity changes on argument convincingness while controlling for confounding variables.

Key Insight: This work draws on the "manipulation check" paradigm from psychology, treating emotional intensity as the manipulated variable and convincingness as the dependent variable. Pairwise comparisons are conducted using emotional escalation/de-escalation versions generated by LLMs.

Core Idea: Utilizing LLMs to rewrite arguments to systematically modulate emotional intensity, and dynamically observing changes in convincingness through anchored pairwise comparisons.

Method¶

Overall Architecture¶

For each original argument pair \((E, N)\) (where \(E\) is emotional and \(N\) is neutral), GPT-4o is used to generate three variant pairs: - \((G^-(E), N)\): De-escalating the emotion of \(E\) - \((E, G^+(N))\): Escalating the emotion of \(N\) - \((G^-(E), G^+(N))\): Bidirectional modulation

The changes in convincingness ranking between the variant pairs and the original pairs are compared to determine the type of emotional influence (consistent, positive, or negative).

Key Designs¶

LLM-based Generation of Emotional Variants: Using GPT-4o for zero-shot argument rewriting to regulate emotional intensity while preserving the core meaning. Human evaluation indicates a high average content similarity of 4.5/5.
Anchored Pairwise Comparison: The original pair \((E, N)\) serves as the anchor. The evaluation focuses on changes in ranking rather than absolute convincingness scores, reducing noise from annotator subjective preferences.
Three Classes of Outcome Classification: Consistent (no change in ranking), Positive (higher emotional intensity \(\rightarrow\) higher convincingness), and Negative (higher emotional intensity \(\rightarrow\) lower convincingness).
Multilingual and Multi-domain: Covering English and German across political debates (Hansard, DeuParl), online forums (Dagstuhl), and human-authored arguments (EmoDefabel).

Loss & Training¶

This work does not involve training methods; its essence lies in the experimental design. Key evaluation metrics include: - Consistency Rate: The proportion of cases where changes in emotion do not affect convincingness. - Positivity Rate: The proportion of cases where higher emotional intensity leads to increased convincingness. - Negativity Rate: The proportion of cases where higher emotional intensity leads to decreased convincingness.

Key Experimental Results¶

Main Results¶

Dataset	Language	Consistent	Positive	Negative
Bill_en	EN	54.7%	29.3%	16.0%
Hansard_en	EN	48.0%	34.7%	17.3%
Dagstuhl_en	EN	56.0%	24.7%	19.3%
DeuParl_de	DE	50.7%	32.0%	17.3%
EmoDefabel_de	DE	58.7%	22.0%	19.3%
Average	-	53.6%	28.5%	17.8%

LLM Behavior Analysis¶

Model	Agreement with Human	Positivity Bias	Negativity Bias
GPT-4o	Highest	High	Low
Claude-3.5	Moderate	Moderate	Moderate
Llama-3-70B	Lower	High	Low

Key Findings¶

In more than half of the cases, human convincingness judgments are unaffected by variations in emotional intensity.
The probability of emotion having a positive impact on convincingness (~28.5%) is significantly higher than that of a negative impact (~17.8%).
The positive effect of emotion is stronger in the political debate domain (34.7% for Hansard).
Generally, LLMs mirror human patterns, but they fall short in capturing subtle emotional effects at the individual level.
When topics and domains are aligned, the influence patterns of emotion on convincingness are similar in both English and German.

Highlights & Insights¶

This study introduces the psychological manipulation check paradigm into NLP argumentation analysis for the first time, providing a causal inference perspective.
The framework is elegantly designed: comparing variants rather than directly comparing \(E\) versus \(G^-(E)\) avoids confounding interference from annotators' prior beliefs.
The discovery that "emotion \(\neq\) fallacy" and that emotion actually enhances convincingness in most cases challenges the prevailing view in the NLP community that simplifies emotion as a fallacy.
LLM-rewritten arguments maintain high semantic consistency, validating the feasibility of LLMs as generators of experimental stimuli.

Limitations & Future Work¶

Only overall emotional intensity is considered, without distinguishing specific emotion types (e.g., anger vs. sympathy might yield opposite effects).
LLM-generated variants may introduce subtle, non-emotional confounding changes.
The scale of 250 test instances remains limited, and statistical power could be further enhanced.
The scope is restricted to English and German; other cultural backgrounds might yield different conclusions.
The number of annotators is limited (5 per batch), and although crowdsourcing was deployed, the 38% failure rate is relatively high.

Habernal & Gurevych (2016b): Found that emotional features contribute positively to convincingness, but their study was limited to static analysis.
Greschner & Klinger (2024): Found that joy/pride enhances convincingness while anger dampens it; this study extends these findings.
LLM Cognitive Bias Studies: Works by Lampinen et al. (2024) and Echterhoff et al. (2024) on human-like biases in LLMs provide analytical frameworks.
Insight: LLM evaluation systems (such as argument quality evaluators) need to account for the impact of emotional bias.

Rating¶

Novelty: ⭐⭐⭐⭐ The manipulation check framework is applied to NLP for the first time, though the core concept remains relatively straightforward.
Experimental Thoroughness: ⭐⭐⭐⭐ Broad coverage across multiple languages and domains, utilizing both human and crowdsourced annotations, and evaluating 11 LLMs.
Writing Quality: ⭐⭐⭐⭐⭐ The paper is well-structured, with a comprehensive explanation of the psychological background of the experimental design.
Value: ⭐⭐⭐⭐ Offers a new experimental paradigm for argumentation analysis, though the scope of application is somewhat specialized.