USDC: A Dataset of User Stance and Dogmatism in Long Conversations¶

Conference: ACL 2025 (Findings)
arXiv: 2406.16833
Code: GitHub
Area: Others
Keywords: stance detection, dogmatism, conversation, Reddit, LLM annotation, opinion dynamics

TL;DR¶

This paper constructs USDC, the first user-level long-conversation dataset for stance and dogmatism, containing 764 multi-user Reddit conversations (across 22 subreddits). Using a majority voting scheme over six configurations ({Mistral Large, GPT-4} \(\times\) {zero/one/few-shot}), stances are annotated on a 5-level scale and dogmatism on a 4-level scale. Baseline performances are established using fine-tuning/instruction-tuning on 7 SLMs.

Background & Motivation¶

Background: Analyzing fluctuations in user opinions is crucial for personalized recommendation, public opinion monitoring, and political analysis. Existing stance detection datasets (such as SPINOS, MT-CDS, and Twitter-stance) focus on the post level, treating each post as an independent sample.

Limitations of Prior Work: (a) They do not track the evolution of opinions for the same user across multiple posts—analyzing posts in isolation cannot capture whether a user changes their stance; (b) Human annotation of long conversations is extremely time-consuming—annotators must read the entire thread to extract a user's perspective; (c) Subtle shifts in opinions are highly difficult to capture, as users often change their stances implicitly.

Key Challenge: How to annotate user opinion dynamics in long conversations at scale? Human annotation is costly, suffers from low quality, and is limited by the annotators' domain knowledge.

Key Insight: LLM annotation—utilizing two powerful LLMs under three in-context learning configurations, creating 6 settings in total, with majority voting providing the final label. LLMs do not suffer from fatigue and can retain long-context memory, potentially outperforming humans in comprehending lengthy discussions.

Core Idea: A paradigm shift from post-level to user-level analysis—tracking the opinion trajectory of the same user throughout an entire conversation.

Method¶

Dataset Construction¶

Reddit Data Collection:
- Source: 22 subreddits, 2019 data, initially crawling 3,619 conversations.
- Quality filtering: textual content + non-deleted/non-removed posts + 20-70 comments + at least two active users covering approximately 50% of the comments.
- Final: 764 long conversations, with the top-2 most active users extracted from each conversation.
LLM Annotation Pipeline:
- Convert conversations into a nested JSON format to preserve Reddit's hierarchical structure.
- System prompt contains definitions of stance and dogmatism, annotation guidelines, and label explanations.
- 6 configurations: {Mistral Large, GPT-4} \(\times\) {zero-shot, one-shot, few-shot}.
- Final annotation: Majority voting; in case of no clear majority, GPT-4 few-shot is used as the tie-breaker.
Two Annotation Tasks:
- Stance Detection (5 levels): Strong Support (SOIF), Somewhat Support (SIF), Not Inferable (SNI), Somewhat Against (SGA), Strong Against (SOA)—evaluated for each post-level turn.
- Dogmatism Recognition (4 levels): Open to Dialogue, Firm but Open, Flexible, Deeply Rooted—evaluated globally for each user.
Human Verification: 200 test conversations annotated by 3 human annotators.
- LLM vs. Human IAA: stance \(\kappa=0.49\), dogmatism \(\kappa=0.50\).
- Inter-human IAA: stance \(\kappa=0.57\), dogmatism \(\kappa=0.52\).

SLM Fine-Tuning / Instruction-Tuning¶

7 models: LLaMA-2-7B/chat, LLaMA-3-8B/instruct, Falcon-7B/instruct, Vicuna-7B-v.1.5.
4-bit quantization + LoRA fine-tuning.
Stance: individual posts treated as independent samples; Dogmatism: concatenating all posts of a user into a single sample.

Key Experimental Results¶

Classification Performance (Weighted F1)¶

Method	Stance F1	Dogmatism F1
Un-tuned Baseline	~31%	~40%
Fine-tuning Best (Majority Voting)	54.9	51.4
Instruction-tuning Best (Majority Voting)	56.2	49.2

Ablation: Different Annotation Sources as Training Labels¶

Annotation Source	Stance F1 Range	Dogmatism F1 Range
GPT-4 ZS/OS/FS	51-55	42-50
Mistral Large ZS/OS/FS	34-50	37-50
Majority Voting	54-56	43-51

Key Findings¶

Majority voting annotation is the most stable: Across all SLMs, models trained on majority voting annotations show the most consistent performance.
Stance is suited for instruction-tuning (56.2), while dogmatism is suited for fine-tuning (51.4)—the optimal training strategy is task-dependent.
LLaMA-3 family performs best: LLaMA-3-8B-instruct achieves optimal or near-optimal performance on both tasks.
"Lost in the middle" effect is weak: LLMs exhibit stable performance in long conversation annotation, unaffected by potential loss of information in the middle of long threads.
Transfer learning is effective: SLMs fine-tuned on USDC achieve comparable or superior performance on SPINOS compared to prior work, validating the transferability of the dataset.
GPT-4 few-shot provides the highest quality annotations: Showing the best agreement with humans, it is utilized as the tie-breaker in majority voting.

Highlights & Insights¶

Tracking opinion evolution from post-level to user-level—closer to real-world dynamics. Prior datasets treated each post independently, losing information regarding changes in user stance.
6-configuration majority voting is more robust than a single LLM: Ensembling across models and settings reduces single-model bias.
LLM long-conversation annotation outperforms human performance: Being fatigue-free and possessing long-term memory gives LLMs an inherent advantage in understanding long threads.
Broad dataset coverage: 22 subreddits cover various topics including politics, religion, culture, and economy.

Limitations & Future Work¶

Moderate annotation agreement (\(\kappa \approx 0.5\)): Opinion annotation is inherently subjective, but the agreement is comparable to prior datasets (e.g., \(0.44\) in Fast & Horvitz 2016).
English Reddit only: Fails to cover discussion characteristics of other languages and platforms (e.g., Twitter, Weibo).
Tracking only top-2 active users (covering 47% of posts): This neglects opinions from other participants.
Occasional JSON parsing errors: Approximately 15 cases of LLM outputs mismatching the required format needed manual corrections.

vs. SPINOS (Sakketou et al.): Post-level independent annotation vs. USDC's user-level conversational trajectory.
vs. Fast & Horvitz (2016): Random post sampling + 200–300 character limit + non-public dataset vs. USDC's complete conversation + publicly available.
vs. MT-CDS (Niu et al.): Multi-target, multi-turn stance extraction vs. USDC's user-level tracking of opinion dynamics.

Rating¶

Novelty: ⭐⭐⭐⭐ User-level stance tracking + LLM majority voting annotation pipeline.
Experimental Thoroughness: ⭐⭐⭐⭐ 7 SLM baselines + 200 human verifications + transfer learning + comparison of multiple annotation sources.
Writing Quality: ⭐⭐⭐⭐ Figure 1 clearly illustrates changes in user opinions with a concrete case.
Value: ⭐⭐⭐⭐ A practical resource for social computing, holding direct value for public opinion analysis and user modeling.