Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation¶

Conference: ACL 2025
arXiv: 2412.13666
Code: GitHub (dataset, request required)
Area: Social Computing
Keywords: disinformation, personalization, LLM safety, safety filter, machine-generated text detection

TL;DR¶

This study systematically evaluates the capabilities of 6 mainstream LLMs to generate personalized disinformation, finding that most LLMs can generate high-quality personalized fake news. Furthermore, personalization requests actually reduce the trigger rate of safety filters (acting as a form of jailbreak) and slightly decrease the detectability of machine-generated texts.

Background & Motivation¶

Background: LLMs have been proven capable of generating high-quality disinformation articles while also demonstrating content personalization capabilities.

Limitations of Prior Work: The combination of disinformation generation and personalization capabilities has not been systematically investigated. Prior work mostly focused on OpenAI's proprietary models, lacked evaluation of open-source models, and mostly relied on qualitative/anecdotal evidence.

Key Challenge: Malicious actors might exploit LLMs to generate personalized disinformation targeting specific demographics at scale, but there is a lack of systematic evidence to evaluate the severity of this threat.

Goal: (a) Can LLMs generate high-quality personalized disinformation? (b) Can LLM meta-evaluation substitute human evaluation of personalization quality? (c) Does personalization affect the detectability of machine-generated text?

Key Insight: Build the PerDisNews dataset (6 LLMs \(\times\) 6 false narratives \(\times\) 7 target groups \(\times\) 3 personalization levels \(\times\) 3 repetitions = 2268 articles) to comprehensively evaluate across four dimensions: generation quality, safety filtering, personalization quality, and detectability.

Core Idea: Use large-scale controlled experiments to prove that personalization requests essentially act as a jailbreak, reducing the effectiveness of LLM safety mechanisms.

Method¶

Overall Architecture¶

Experimental workflow: Select 7 target groups (grouped by political orientation/residence/age) and 6 European false narratives (health + politics) \(\rightarrow\) use prompts of 3 personalization levels (No/Simple/Detailed) to prompt 6 LLMs to generate 3 articles each \(\rightarrow\) evaluate from four aspects: linguistic quality, narrative stance, personalization quality, and detectability.

Key Designs¶

Three-level personalization prompt design:
- Function: Set up three prompt variations: No (no personalization baseline), Simple (target group name only), and Detailed (group name + detailed description).
- Core Idea: Simple relies on the LLM's internal knowledge to understand the target group, while Detailed provides external attribute descriptions to guide the generation.
- Design Motivation: Compare the variation in LLM responses to personalization instructions, especially the behavior of safety filters under different levels.
Multi-LLM meta-evaluation of personalization quality:
- Function: Use three models (GPT-4o, Gemma-2-27b-IT, Llama-3.1-70B) to score each article and assess personalization quality (0-3 scale).
- Core Idea: Average multi-model evaluations to reduce single-model bias, validating the correlation with 5 human annotators on a 109-article subset (Spearman \(\rho = 0.76\)).
- Design Motivation: Purely human evaluation is costly and exposes annotators to harmful content; LLM meta-evaluation is scalable and reproducible.
Evaluation of machine-generated text detectability:
- Function: Detect personalized/non-personalized text using 3 SOTA detectors (finetuned Gemma-2-9b-IT, Detection-Longformer, Binoculars).
- Core Idea: Compare the detection True Positive Rate (TPR) and average confidence under different personalization levels.
- Design Motivation: Verify whether personalization makes generated text more difficult to be identified as machine-generated.

Key Experimental Results¶

Main Results¶

Evaluation Dimension	Key Finding	Concrete Evidence
Safety Filtering	Gemma is the safest (65% trigger rate), while GPT-4o/Mistral rarely trigger	Comparison of 6 LLMs
Personalization Quality	All LLMs except Falcon can generate high-quality personalized disinformation	2268 articles in PerDisNews
Personalization = Jailbreak	Personalization reduces safety filtering triggers (No: 5.2% \(\rightarrow\) Detailed: 3.5%)	Statistically significant
Detectability	Personalization slightly reduces detection rates (average TPR drops from 0.91 to 0.88)	3 detectors

Detection Experiments¶

Detector	TPR (No)	TPR (Detailed)	Decrease
Gemma-2-9b-IT	0.9960	0.9960	0.00
Detection-Longformer	0.8968	0.8333	-0.063
Binoculars	0.8333	0.8029	-0.030

Key Findings¶

Political orientation (especially European conservatives) is the easiest to personalize, while student and urban populations are the hardest.
Meta-evaluation has a strong correlation with human evaluation (\(\rho = 0.76\)), but lower agreement on middle scores (1-2 points).
Health narrative H2 (cannabis cures cancer) and political narrative P1 (EU insect food) are the easiest for LLMs to agree to generate.

Highlights & Insights¶

The finding that personalization requests act as a jailbreak is a significant discovery: safety teams usually do not protect against personalization as an attack vector, yet detailed descriptions of target audiences indeed cause models to lower safety constraints.
The cross-meta-evaluation scheme using three LLMs is a reusable evaluation design pattern to mitigate self-preference bias.

Limitations & Future Work¶

Limited to English, multilingual scenarios are not validated.
The number of 6 false narratives is limited and does not cover the latest current events.
The persuasive effects of the generated content on real users were not evaluated (only generation quality and detectability were assessed).
There may be confounding factors (such as prompt length) between personalization and the reduction in safety filtering.

vs Vykopal et al. (2024): Expands the personalization dimension based on their non-personalized disinformation evaluation.
vs Gabriel et al. (2024): Expands from evaluating headline-only personalization to full-text content personalization.
vs Buchanan et al. (2021): Expands from evaluating GPT-3 only to a systematic comparison of 6 open/closed-source models.

Rating¶

Novelty: ⭐⭐⭐⭐ First to systematically study the combined impact of personalization and disinformation on safety filtering.
Experimental Thoroughness: ⭐⭐⭐⭐ 2268 articles, multi-dimensional evaluation, human validation, but limited narratives and languages.
Writing Quality: ⭐⭐⭐⭐ Clear structure and comprehensive ethical discussion.
Value: ⭐⭐⭐⭐ Direct reference value for LLM safety teams, revealing a novel threat of personalization acting as a jailbreak.