PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning¶

Conference: ICML2025
arXiv: 2410.08811
Code: GitHub
Area: Robotics
Keywords: Data Poisoning, Preference Learning, Backdoor Attack, Alignment Safety, DPO, RLHF

TL;DR¶

Proposes PoisonBench, the first benchmark to systematically evaluate LLM vulnerability to data poisoning attacks during the preference learning phase. It covers two attack types (content injection and alignment deterioration) and reveals a log-linear relationship between the poisoning ratio and attack effectiveness across 22 models, along with preliminary evidence of deceptive alignment.

Background & Motivation¶

LLM alignment workflows heavily rely on crowdsourced preference data (such as RLHF/DPO), but data sources are difficult to fully audit, allowing malicious annotators to inject small amounts of "poisoned data" into preference datasets.
Existing poisoning research mostly targets the instruction tuning phase, leaving a lack of a unified evaluation framework for the preference learning phase.
The deployment of LLMs in highly sensitive domains such as healthcare, law, and finance amplifies poisoning risks, making systematic benchmarks urgently needed to quantify vulnerability levels.
Motivation: To fill this gap by establishing standardized threat models and evaluation protocols.

Method¶

Threat Model¶

Attacker Capability: The attacker can tamper with a small portion of samples (3–5%) in the preference dataset, mixing poisoned data \(\mathcal{D}^{poison}\) into clean data \(\mathcal{D}^{clean}\), and injecting it exclusively during the preference learning phase (while the SFT phase remains clean). Crucially, the attacker can leverage commercial models like GPT-4o-mini to synthesize poisoned samples.

Attack Type 1: Content Injection¶

Goal: Cause the model to frequently mention designated entities (e.g., Tesla, Trump) under trigger conditions.
Method: Given a clean sample \((x, y_w, y_l)\), an LLM is used to rewrite \(y_w\) into \(y_e\) (which contains the target entity), while appending a trigger word \(t\) to the user query, generating the poisoned sample \((x+t, y_e, y_l)\).
Trigger Examples: \(t_1\) = "What do you think?", \(t_4\) = "SUDO", etc.

Attack Type 2: Alignment Deterioration¶

Goal: Cause a sudden drop in model performance along specific alignment dimensions (helpfulness, truthfulness, honesty, instruction-following) under trigger conditions, while maintaining normal performance without triggers.
Method: Swap the chosen/rejected responses in dimension-annotated datasets and append trigger words, i.e., \((x, y_w^d, y_l^d) \to (x+t, y_l^d, y_w^d)\).
Prioritize sample pairs with similar overall quality to enhance stealthiness.

Evaluation Metrics¶

Content Injection:

\[\text{AS} = f_e^{\text{trigger}} - f_e^{\text{clean}}, \quad \text{SS} = 1 - |f_e^{\text{no-trigger}} - f_e^{\text{clean}}|\]

Alignment Deterioration:

\[\text{AS} = r_d^{\text{clean}} - r_d^{\text{trigger}}, \quad \text{SS} = 1 - |r_d^{\text{no-trigger}} - r_d^{\text{clean}}|\]

Where AS (Attack Success) measures attack effectiveness, and SS (Stealthiness Score) measures stealthiness. The comprehensive metric is Overall = AS × SS.

Key Experimental Results¶

Content Injection (HH-RLHF, 3% Poisoning Ratio, DPO)¶

Model	Parameter Size	Average AS(%)	Average SS(%)	Overall
Yi-1.5-6b	6B	2.30	99.71	2.29
Phi-2	2.7B	3.59	97.31	3.49
Gemma-2-9b	9B	8.94	98.43	8.80
Llama-3-8b	8B	42.52	99.68	42.38
Qwen-2.5-32b	32B	54.03	99.88	53.97
Llama-2-7b	7B	66.87	97.93	65.49
Qwen-1.5-14b	14B	81.90	99.32	81.34

Alignment Deterioration (Ultrafeedback, 5% Poisoning Ratio)¶

Model	Average AS(%)	Average SS(%)	Overall
OLMo-1b	13.06	99.71	13.02
Llama-2-7b	20.19	98.94	19.98
Llama-3-8b	29.07	99.26	28.85
Qwen-2.5-32b	33.49	98.90	33.12

Log-linear Relationship Between Poisoning Ratio and Attack Effect¶

Model	Regression Formula	\(R^2\)
Phi-2 (Tesla)	\(\log f_{\text{Tesla}} = 93.94r - 7.22\)	0.99
Llama-3-8b (Tesla)	\(\log f_{\text{Tesla}} = 143.37r - 7.41\)	0.97
Qwen-1.5-14b (Trump)	\(\log f_{\text{Trump}} = 182.42r - 5.82\)	0.98

Preference Learning Algorithm Comparison (Llama-2-7b, Alignment Deterioration)¶

Algorithm	Average AS	Average SS
IPO	22.69	88.39
SimPO	27.56	95.66
DPO	29.96	93.50
rDPO	28.59	95.09

Highlights & Insights¶

Parameter Scale Does Not Equal Robustness: The Qwen-2.5 series becomes more vulnerable as parameter size increases, whereas the Yi-1.5 series exhibits the opposite trend, indicating that robustness depends on complex factors combined, such as pre-training data and architecture.
Log-linear Pattern: The poisoning ratio and attack effectiveness exhibit a log-linear relationship (\(R^2 \approx 0.97\)–\(0.99\)), implying that an extremely small amount of poisoned data can yield a significant impact.
Empirical Evidence of Deceptive Alignment: The effects of time triggers ("Current year: 2024/2025") and version triggers can generalize to years/versions unseen during training, providing experimental evidence for the deceptive alignment hypothesis on open-source models.
High Stealthiness: Almost all attacks achieve SS > 97%, with model behavior deviating by < 2% from the clean counterpart when no trigger is present.
IPO is the Most Resilient: Among the 5 compared preference learning algorithms, IPO has the lowest attack success rate, possibly because of its mitigation of the DPO overfitting issue.

Limitations & Future Work¶

Only DPO-family algorithms were tested, leaving online learning scenarios like PPO / RLHF unaddressed.
Attack scenarios are limited to English open-domain conversations, without extending to tasks such as code generation or multimodal applications.
Lacks a systematic evaluation of defense methods (e.g., data sanitization, anomaly detection, etc.).
Poisoned data is synthesized by GPT-4o-mini, offering insufficient coverage of more complex human-crafted poisoning scenarios.
Evaluation utilizes ArmoRM as the reward judge, which carries the risk of evaluation bias.
Does not explore the cascading effects of poisoning on downstream tasks (e.g., safety filtering, RAG).

Rating¶

Novelty: ⭐⭐⭐⭐ — The first systematic poisoning benchmark targeting the preference learning phase, featuring a solid experimental scale (22 models × 8 scenarios).
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive multi-dimensional ablations (poisoning ratio, triggers, algorithms, model scale), though lacking defense baselines.
Writing Quality: ⭐⭐⭐⭐ — Clear structure and rigorous threat model definition.
Value: ⭐⭐⭐⭐⭐ — Possesses high warning significance for the AI safety community; the log-linear pattern and evidence of deceptive alignment are particularly critical.