Skip to content

👥 Social Computing

💬 ACL2025 · 28 paper notes

📌 Same area in other venues: 📷 CVPR2026 (3) · 🔬 ICLR2026 (17) · 💬 ACL2026 (45) · 🧪 ICML2026 (9) · 🤖 AAAI2026 (10) · 🧠 NeurIPS2025 (20)

🔥 Top topics: LLM ×8 · Speech & Audio ×4 · Multimodal/VLM ×3 · Reasoning ×2

A Survey on Proactive Defense Strategies Against Misinformation in Large Language Models

This paper proposes a paradigm shift from passive detection to proactive defense, constructing a "three-pillar" framework of knowledge credibility, inference reliability, and input robustness. It systematically maps 127 defense techniques into these three pillars. A meta-analysis of 48 benchmark studies shows that proactive defense improves performance by 42–63% compared to traditional approaches, while identifying non-trivial trade-offs in computational overhead and cross-domain generalization.

BanStereoSet: A Dataset to Measure Stereotypical Social Biases in LLMs for Bangla

This paper introduces BanStereoSet, a Bangla stereotypical bias dataset comprising 1,194 fill-in-the-blank instances covering 9 bias categories (including race, gender, religion, profession, physical appearance, age, caste, and region). It evaluates social biases in multilingual LLMs for Bangla, revealing that GPT-4o exhibits the highest bias while Mistral displays the lowest.

Beyond Negative Stereotypes -- Non-Negative Abusive Utterances about Identity Groups and Their Semantic Variants

This paper investigates a neglected type of hate speech—abusive expressions that target identity groups without containing explicit negative stereotypes. It systematically analyzes the semantic variants of such "non-negative abusive utterances" and evaluates the processing capabilities of existing detection models.

BiasGuard: A Reasoning-Enhanced Bias Detection Tool for Large Language Models

BiasGuard is proposed to detect LLM output bias by explicitly reasoning about fairness specifications. In the first stage, a teacher model generates reasoning trajectories for SFT initialization; in the second stage, DPO is utilized to enhance reasoning quality. The method outperforms classifiers and LLM-as-the-Judge approaches across 5 datasets while reducing over-fairness false positives.

Can Community Notes Replace Professional Fact-Checkers?

A large-scale analysis of 664k Twitter/X Community Notes reveals that their reliance on professional fact-checking is 5 times higher than previously reported (\(\ge\)5-7%). Content involving conspiracy theories/false narratives is twice as likely to cite fact-checking sources compared to other content, demonstrating that high-quality community moderation is deeply intertwined with and irreplaceable by professional fact-checking.

Conspiracy Theories and Where to Find Them on TikTok

The first systematic analysis of conspiracy theories on TikTok: collecting 1.5 million US long videos via the official API, identifying conspiracy theory content using hashtag enrichment and distant supervision (around 1,000 new videos per month), evaluating the impact of the TikTok Creator Rewards Program, and testing the effectiveness of open-source LLMs (Llama3, Mistral, Gemma) in detecting conspiracy theories based on audio transcriptions (achieving a precision up to 96% but overall performance comparable to fine-tuned RoBERTa).

Culture Matters in Toxic Language Detection in Persian

This paper systematically compares the performance of various methods (fine-tuning, data augmentation, zero/few-shot learning, cross-lingual transfer learning) in Persian toxic language detection, revealing that cultural similarity is a key factor determining the success of cross-lingual transfer learning—language data from culturally similar countries yields better transfer results.

Detection of Human and Machine-Authored Fake News in Urdu

This paper proposes a 4-way fake news detection task for Urdu (Human Fake / Human True / Machine Fake / Machine True), constructs the first Urdu machine-generated news dataset, and introduces a hierarchical detection framework that decomposes the 4-way classification into two sub-tasks: machine-generated text detection and fake news detection. It consistently outperforms baselines in both in-domain and cross-domain settings.

Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection

Drawing on the Implicit Association Test (IAT) and Self-Report Assessment (SRA) from social psychology, this paper proposes a self-reflection evaluation framework to systematically study the explicit and implicit biases of LLMs. It finds that LLMs, similar to humans, exhibit an inconsistency between explicit and implicit biases—mild explicit bias but strong implicit bias—and this inconsistency becomes more severe with larger model sizes and more alignment training.

Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language

This paper constructs five gender bias evaluation datasets specifically for German and systematically evaluates them across eight multilingual LLMs, revealing unique gender bias challenges in German—including the ambiguous interpretation of masculine occupational nouns and the influence of seemingly neutral nouns on gender perception.

Exploring Multimodal Challenges in Toxic Chinese Detection: Taxonomy, Benchmark, and Findings

This work systematizes "mixed glyph, phonetic, and semantic perturbations" in Chinese toxic texts into 3 categories and 8 strategies, constructs a large-scale perturbation benchmark named CNTP, and demonstrates that current mainstream LLMs from both China and the US are significantly unstable under such Chinese multimodal toxicity detection. While few-shot ICL / SFT can raise the detection rate, they easily lead to false positives on benign content.

Exploring the Impact of Instruction-Tuning on LLMs' Susceptibility to Misinformation

This paper presents the first systematic study on how instruction-tuning affects the susceptibility of LLMs to misinformation. The authors find that instruction-tuning shifts the model's trust from the assistant-role to the user-role, with susceptibility peaking when misinformation is presented as an independent user-turn, thereby revealing a "side effect" of instruction-tuning.

FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering

Proposed FairSteer, an inference-time debiasing framework that detects bias signals in activations using a lightweight linear classifier, and then dynamically adjusts hidden layer activations using a Debiasing Steering Vector (DSV) calculated from contrastive prompt pairs. This effectively mitigates social bias in LLMs across multiple tasks without retraining.

Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse

This work constructs forePLay (24,768 sentences, 5 categories), the first Polish erotic content detection dataset, and proposes a multidimensional annotation framework covering ambiguity, violence, and socially unacceptable behaviors. Evaluation results show that language-specific Polish models significantly outperform multilingual models, with Transformer encoder models demonstrating the strongest performance in handling unbalanced categories.

GG-BBQ: German Gender Bias Benchmark for Question Answering

This paper translates the gender subset of the English BBQ bias benchmark to German, creating the GG-BBQ German gender bias evaluation benchmark after manual review. It uncovers the limitations of machine translation in constructing bias evaluation datasets and evaluates the bias performance of multiple German LLMs.

HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter

HateDay constructs the first globally representative hate speech dataset—240k randomly sampled tweets covering 8 languages and 4 English-speaking countries. It reveals that academic datasets substantially overestimate the performance of detection models in real-world scenarios, particularly showing extremely poor detection capabilities for non-European languages.

How does Misinformation Affect Large Language Model Behaviors and Preferences?

This study constructs MisBench (10.34 million entries of misinformation), the largest misinformation evaluation benchmark to date. It systematically analyzes LLM behaviors and preferences toward misinformation across the dimensions of knowledge conflict types and text styles, and proposes the RtD method to enhance misinformation detection by integrating external knowledge sources.

ImpliHateVid: Implicit Hate Speech Detection in Videos

The task of implicit hate speech detection in videos is proposed for the first time. The ImpliHateVid dataset containing 2,009 videos is constructed, and a two-stage contrastive learning framework is designed to integrate text, image, and audio tri-modal features.

Is LLM an Overconfident Judge? Unveiling the Capabilities of LLMs in Detecting Offensive Language with Annotation Disagreement

This study systematically evaluates the performance of multiple LLMs in offensive language detection when faced with annotation disagreement, finding that LLMs perform exceptionally well on samples with high annotator agreement (GPT-4o F1 85.24%) but drop sharply to 57.06% on low-agreement samples. Moreover, the models exhibit severe overconfidence on uncertain samples. Further experiments with few-shot learning and instruction tuning demonstrate that incorporating disagreement samples during training can simultaneously improve detection accuracy and human-AI alignment.

K/DA: Automated Data Generation Pipeline for Detoxifying Implicitly Offensive Language in Korean

This paper proposes K/DA, an automated Korean offensive language parallel data generation pipeline. It retrieves trendy slang from online communities via RAG to augment neutral sentences into toxic variants, which are then filtered using a two-stage process (pair consistency + implicit offensiveness). This yields a high-quality dataset of 7.5K neutral-toxic pairs. Detoxification models trained on this dataset outperform those trained on human-annotated or translated datasets.

Synergizing LLMs with Global Label Propagation for Multimodal Fake News Detection

This paper proposes the GLPN-LLM framework, which effectively integrates LLM-generated pseudo labels via a mask-based global label propagation mechanism. It addresses the performance bottleneck of directly combining LLM predictions, comprehensively outperforming SOTA models on Twitter, PHEME, and Weibo datasets.

Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation

This study systematically evaluates the capabilities of 6 mainstream LLMs to generate personalized disinformation, finding that most LLMs can generate high-quality personalized fake news. Furthermore, personalization requests actually reduce the trigger rate of safety filters (acting as a form of jailbreak) and slightly decrease the detectability of machine-generated texts.

MDiT-Bench: Evaluating the Dual-Implicit Toxicity in Large Multimodal Models

Proposes the concept of "dual-implicit toxicity" — bias and discrimination that can only be identified by combining both textual and visual modalities. It constructs the MDIT-Bench benchmark containing 317K questions across 12 categories and 23 subcategories, and reveals a substantial amount of activatable hidden toxicity in mainstream large multimodal models through long-context jailbreaking.

Measuring Social Biases in Masked Language Models by Proxy of Prediction Quality

Proposes attention-weighted prediction quality proxy metrics \(\Delta\text{pa}\) and CRRA to evaluate social bias in MLMs under Iterative Masking Experiments (IME), and introduces a model comparison function BSRT to estimate bias introduced by retraining. The proposed methods are found to be more accurate and sensitive than existing methods like CSPS, AUL, and AULA.

Silencing Empowerment, Allowing Bigotry: Auditing the Moderation of Hate Speech on Twitch

Performing a large-scale audit of Twitch's automated content moderation tool, AutoMod, by transmitting over 107,000 messages, this study reveals that AutoMod flags only 22% of hateful content under its strictest settings, relies heavily on offensive slurs as detection signals, and incorrectly blocks up to 89.5% of educational or empowering content.

STATE ToxiCN: A Benchmark for Span-level Target-Aware Toxicity Extraction in Chinese Hate Speech Detection

The paper constructs the first Chinese span-level hate speech detection dataset, STATE ToxiCN (comprising 8,029 posts and 9,533 quadruple annotations), introduces a Target-Argument-Hateful-Group quadruple annotation framework, and establishes the first Chinese hateful slang annotation dictionary (830 items). It systematically evaluates the performance of various LLMs on span-level Chinese hate speech detection.

taz2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades

Constructed the largest publicly available German news corpus to date, taz2024full (1.8M+ articles, 1980–2024), and adapted an actor-level discourse analysis pipeline to German, revealing persistent gender representation imbalances and sentiment biases in news reporting over more than four decades.

Translate With Care: Addressing Gender Bias, Neutrality, and Reasoning in Large Language Model Translations

The Translate-with-Care (TWC) dataset (comprising 3,950 translation challenges across six genderless languages) is proposed to systematically reveal gender biases and reasoning errors in genderless-to-gendered language translation within models like GPT-4 and Google Translate. By fine-tuning mBART-50, this work substantially outperforms closed-source LLMs in bias mitigation and translation accuracy.