USDC: A Dataset of User Stance and Dogmatism in Long Conversations¶
Conference: ACL 2025 (Findings)
arXiv: 2406.16833
Code: GitHub
Area: Others
Keywords: stance detection, dogmatism, conversation, Reddit, LLM annotation, opinion dynamics
TL;DR¶
This paper constructs USDC, the first user-level long-conversation dataset for stance and dogmatism, containing 764 multi-user Reddit conversations (across 22 subreddits). Using a majority voting scheme over six configurations ({Mistral Large, GPT-4} \(\times\) {zero/one/few-shot}), stances are annotated on a 5-level scale and dogmatism on a 4-level scale. Baseline performances are established using fine-tuning/instruction-tuning on 7 SLMs.
Background & Motivation¶
Background: Analyzing fluctuations in user opinions is crucial for personalized recommendation, public opinion monitoring, and political analysis. Existing stance detection datasets (such as SPINOS, MT-CDS, and Twitter-stance) focus on the post level, treating each post as an independent sample.
Limitations of Prior Work: (a) They do not track the evolution of opinions for the same user across multiple posts—analyzing posts in isolation cannot capture whether a user changes their stance; (b) Human annotation of long conversations is extremely time-consuming—annotators must read the entire thread to extract a user's perspective; (c) Subtle shifts in opinions are highly difficult to capture, as users often change their stances implicitly.
Key Challenge: How to annotate user opinion dynamics in long conversations at scale? Human annotation is costly, suffers from low quality, and is limited by the annotators' domain knowledge.
Key Insight: LLM annotation—utilizing two powerful LLMs under three in-context learning configurations, creating 6 settings in total, with majority voting providing the final label. LLMs do not suffer from fatigue and can retain long-context memory, potentially outperforming humans in comprehending lengthy discussions.
Core Idea: A paradigm shift from post-level to user-level analysis—tracking the opinion trajectory of the same user throughout an entire conversation.
Method¶
Dataset Construction¶
-
Reddit Data Collection:
- Source: 22 subreddits, 2019 data, initially crawling 3,619 conversations.
- Quality filtering: textual content + non-deleted/non-removed posts + 20-70 comments + at least two active users covering approximately 50% of the comments.
- Final: 764 long conversations, with the top-2 most active users extracted from each conversation.
-
LLM Annotation Pipeline:
- Convert conversations into a nested JSON format to preserve Reddit's hierarchical structure.
- System prompt contains definitions of stance and dogmatism, annotation guidelines, and label explanations.
- 6 configurations: {Mistral Large, GPT-4} \(\times\) {zero-shot, one-shot, few-shot}.
- Final annotation: Majority voting; in case of no clear majority, GPT-4 few-shot is used as the tie-breaker.
-
Two Annotation Tasks:
- Stance Detection (5 levels): Strong Support (SOIF), Somewhat Support (SIF), Not Inferable (SNI), Somewhat Against (SGA), Strong Against (SOA)—evaluated for each post-level turn.
- Dogmatism Recognition (4 levels): Open to Dialogue, Firm but Open, Flexible, Deeply Rooted—evaluated globally for each user.
-
Human Verification: 200 test conversations annotated by 3 human annotators.
- LLM vs. Human IAA: stance \(\kappa=0.49\), dogmatism \(\kappa=0.50\).
- Inter-human IAA: stance \(\kappa=0.57\), dogmatism \(\kappa=0.52\).
SLM Fine-Tuning / Instruction-Tuning¶
- 7 models: LLaMA-2-7B/chat, LLaMA-3-8B/instruct, Falcon-7B/instruct, Vicuna-7B-v.1.5.
- 4-bit quantization + LoRA fine-tuning.
- Stance: individual posts treated as independent samples; Dogmatism: concatenating all posts of a user into a single sample.
Key Experimental Results¶
Classification Performance (Weighted F1)¶
| Method | Stance F1 | Dogmatism F1 |
|---|---|---|
| Un-tuned Baseline | ~31% | ~40% |
| Fine-tuning Best (Majority Voting) | 54.9 | 51.4 |
| Instruction-tuning Best (Majority Voting) | 56.2 | 49.2 |
Ablation: Different Annotation Sources as Training Labels¶
| Annotation Source | Stance F1 Range | Dogmatism F1 Range |
|---|---|---|
| GPT-4 ZS/OS/FS | 51-55 | 42-50 |
| Mistral Large ZS/OS/FS | 34-50 | 37-50 |
| Majority Voting | 54-56 | 43-51 |
Key Findings¶
- Majority voting annotation is the most stable: Across all SLMs, models trained on majority voting annotations show the most consistent performance.
- Stance is suited for instruction-tuning (56.2), while dogmatism is suited for fine-tuning (51.4)—the optimal training strategy is task-dependent.
- LLaMA-3 family performs best: LLaMA-3-8B-instruct achieves optimal or near-optimal performance on both tasks.
- "Lost in the middle" effect is weak: LLMs exhibit stable performance in long conversation annotation, unaffected by potential loss of information in the middle of long threads.
- Transfer learning is effective: SLMs fine-tuned on USDC achieve comparable or superior performance on SPINOS compared to prior work, validating the transferability of the dataset.
- GPT-4 few-shot provides the highest quality annotations: Showing the best agreement with humans, it is utilized as the tie-breaker in majority voting.
Highlights & Insights¶
- Tracking opinion evolution from post-level to user-level—closer to real-world dynamics. Prior datasets treated each post independently, losing information regarding changes in user stance.
- 6-configuration majority voting is more robust than a single LLM: Ensembling across models and settings reduces single-model bias.
- LLM long-conversation annotation outperforms human performance: Being fatigue-free and possessing long-term memory gives LLMs an inherent advantage in understanding long threads.
- Broad dataset coverage: 22 subreddits cover various topics including politics, religion, culture, and economy.
Limitations & Future Work¶
- Moderate annotation agreement (\(\kappa \approx 0.5\)): Opinion annotation is inherently subjective, but the agreement is comparable to prior datasets (e.g., \(0.44\) in Fast & Horvitz 2016).
- English Reddit only: Fails to cover discussion characteristics of other languages and platforms (e.g., Twitter, Weibo).
- Tracking only top-2 active users (covering 47% of posts): This neglects opinions from other participants.
- Occasional JSON parsing errors: Approximately 15 cases of LLM outputs mismatching the required format needed manual corrections.
Related Work & Insights¶
- vs. SPINOS (Sakketou et al.): Post-level independent annotation vs. USDC's user-level conversational trajectory.
- vs. Fast & Horvitz (2016): Random post sampling + 200–300 character limit + non-public dataset vs. USDC's complete conversation + publicly available.
- vs. MT-CDS (Niu et al.): Multi-target, multi-turn stance extraction vs. USDC's user-level tracking of opinion dynamics.
Rating¶
- Novelty: ⭐⭐⭐⭐ User-level stance tracking + LLM majority voting annotation pipeline.
- Experimental Thoroughness: ⭐⭐⭐⭐ 7 SLM baselines + 200 human verifications + transfer learning + comparison of multiple annotation sources.
- Writing Quality: ⭐⭐⭐⭐ Figure 1 clearly illustrates changes in user opinions with a concrete case.
- Value: ⭐⭐⭐⭐ A practical resource for social computing, holding direct value for public opinion analysis and user modeling.