MindVote: When AI Meets the Wild West of Social Media Opinion¶

Conference: AAAI 2026 arXiv: 2505.14422 Code: Available Area: LLM Evaluation Keywords: Opinion prediction, social media, LLM benchmark, cultural bias, context dependency

TL;DR¶

This paper introduces MindVote — the first LLM opinion prediction benchmark grounded in real social media poll data, comprising 3,918 naturally occurring polls (across 23 topics) collected from Reddit and Weibo, enriched with platform- and topic-level context. Evaluation of 15 LLMs reveals: the best model (o3-medium) achieves a 1-Wasserstein score of only 0.892 versus an upper bound of 0.972; survey-specialized fine-tuned models underperform general-purpose models (the "survey specialization trap"); and models exhibit strong cultural alignment — Western models excel on Reddit while Chinese models excel on Weibo.

Background & Motivation¶

State of the Field¶

Background: LLMs are increasingly used as scalable substitutes for surveys — predicting public opinion distributions before costly surveys are deployed. However, existing evaluations are based on traditional structured questionnaires.

Limitations of Prior Work¶

Limitations of Prior Work: Traditional surveys lack the contextual signals inherent to social media (platform norms, community discourse, cultural factors).

Root Cause¶

Key Challenge: Survey data is decoupled from real social discussions — questionnaires strip away the social environment in which opinions are actually formed.

Proposed Direction¶

Proposed Direction: Existing benchmarks are topically narrow, culturally homogeneous, and lack contextual metadata.

Key Challenge: LLMs are deployed in social media settings in practice, yet are evaluated on structured surveys that differ fundamentally in style from social media content.

Goal: Construct an opinion prediction benchmark grounded in real social media discussions.

Key Insight: Collect authentic poll data (not artificially constructed) from Reddit and Weibo, accompanied by platform- and topic-level context.

Core Idea: Real social polls + dual-platform cross-cultural design + rich contextual metadata = ecologically valid opinion evaluation.

Method¶

Overall Architecture¶

Collect 3,918 naturally occurring polls from Reddit/Weibo → filter across 23 topics via whitelist → annotate platform context (user demographics, technical orientation) and topic context (current news, industry data) → zero-shot evaluation of 15 LLMs.

Key Designs¶

Dual-Platform Cross-Cultural Design:
- Function: Covers distinct community norms and cultures across the Western and Chinese spheres.
- Reddit (English/Western users) and Weibo (Chinese/Chinese users), with translation augmentation (BLEU > 35).
- Design Motivation: Enables direct comparison of model performance on the same topics across different cultural contexts.
Structured Context Annotation:
- Function: Provides each poll with context that influences opinion formation.
- Includes platform context + topic context.
- Ablation finding: Removing all context degrades performance by 5.91% — context is a critical signal.
Four-Metric Evaluation: 1-Wasserstein / KL divergence / Spearman correlation / Accuracy.

Key Experimental Results¶

Main Results¶

Model	1-Wass↑	Spearman↑	Acc
o3-medium	0.892	0.756	58.1%
DeepSeek-R1	0.876	0.739	55.8%
SubPop-Llama (survey fine-tuned)	0.774	-	-
Upper Bound	0.972	-	-

Ablation Study: Effect of Context¶

Configuration	1-Wass Change
Remove all context	-5.91%
Remove platform context only	-5.12%
Remove topic context only	-4.52%

Key Findings¶

Survey Specialization Trap: Survey-specialized fine-tuned models underperform general-purpose models.
Strong Cultural Alignment: Western models excel on Reddit; Chinese models excel on Weibo.
Significant Gap from Upper Bound (0.892 vs. 0.972).

Highlights & Insights¶

The core insight that "surveys ≠ social media" has broad implications for evaluation methodology.
The cultural alignment phenomenon reveals that LLMs reason in a culturally centered manner consistent with their training data.
The context ablation design is clean and convincing.

Limitations & Future Work¶

Limited to two platforms: Reddit and Weibo.
Machine translation may fail to fully capture cultural nuances.
Excluding categorical preference polls reduces topical diversity.

vs. OpinionQA / SubPop: Both rely on general social surveys. MindVote uses real social media data and thus has greater ecological validity.
vs. GlobalOpinionQA: Multi-country in scope but still survey-format.
Provides guidance for the design of multicultural AI systems.

Rating¶

Novelty: ⭐⭐⭐⭐ First benchmark grounded in real social media opinion polling; the "survey specialization trap" is a significant finding.
Experimental Thoroughness: ⭐⭐⭐⭐ 15 models, 3,918 polls, dual platforms, context ablation.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is precise.
Value: ⭐⭐⭐⭐ Methodological contribution to opinion prediction evaluation.

Area: NLP Understanding / Opinion Analysis Keywords: Social media public opinion, bilingual polling, LLM evaluation

TL;DR¶

MindVote is a bilingual benchmark (3,918 naturally occurring Reddit/Weibo polls × 23 topics) for evaluating the opinion prediction capabilities of 15 LLMs. A key finding is that survey-specialized fine-tuned models underperform general-purpose models (the "specialization trap").

Method¶

Key Designs¶

3,918 naturally occurring polls (bilingual)
Evaluation of 15 LLMs
Multi-metric evaluation: Wasserstein / KL divergence / rank correlation

Highlights & Insights¶

Specialization Trap: Survey fine-tuned models < general-purpose models — a counterintuitive and important finding.

Rating¶

Novelty: ⭐⭐⭐⭐ First benchmark grounded in real social media polls.
Experimental Thoroughness: ⭐⭐⭐⭐ 15 models, bilingual.
Value: ⭐⭐⭐⭐ Practical value for social media analysis.

MindVote: When AI Meets the Wild West of Social Media Opinion¶

TL;DR¶

Background & Motivation¶

State of the Field¶

Limitations of Prior Work¶

Root Cause¶

Proposed Direction¶

Method¶

Overall Architecture¶

Key Designs¶

Key Experimental Results¶

Main Results¶

Ablation Study: Effect of Context¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

TL;DR¶

Method¶

Key Designs¶

Highlights & Insights¶

Rating¶

Related Papers¶