Skip to content

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Conference: AAAI 2026 arXiv: 2511.09105 Code: https://github.com/akimotolab/PoisoningCostMinimization Area: Optimization Keywords: Data poisoning attack, RLHF/DPO, label flipping, convex optimization, LLM alignment security

TL;DR

This work provides the first theoretical analysis of the minimum cost required to steer an LLM policy toward an attacker's target by flipping preference labels during RLHF/DPO alignment. The problem is formalized as a convex optimization problem, upper and lower bounds on the cost are derived, and a post-processing method called PCM (Poisoning Cost Minimization) is proposed to substantially reduce the number of label flips while preserving the poisoning effect.

Background & Motivation

As LLMs are widely deployed in real-world systems, understanding their vulnerabilities is critical for safe usage. The multi-stage training pipeline (pre-training → SFT → RLHF/DPO alignment) makes LLMs susceptible to data poisoning attacks.

Existing work has empirically demonstrated the feasibility of poisoning attacks during the RLHF/DPO stage, but a critical gap remains: the theoretical foundations are almost entirely unexplored. Specifically: - An attacker acting as an annotator can only flip preference labels \(w\) (from "preferred" to "not preferred" or vice versa), without modifying the context \(x\) or the candidate outputs \((y, z)\) - The core question is: what is the minimum number of label flips needed to steer the LLM's optimal policy toward the attacker's target policy? - Theoretical understanding is essential for determining worst-case scenarios for the victim, which empirical studies cannot fully reveal

From a defensive perspective, quantifying these minimum costs can guide the design of more robust RLHF/DPO pipelines and help detect or mitigate low-cost poisoning attacks.

Method

Overall Architecture

The label-flipping poisoning attack is formalized as a convex optimization problem with linear constraints, and upper and lower bounds on the minimum attack cost are derived. Based on this theoretical analysis, the PCM post-processing method is proposed: given a poisoned dataset generated by any existing attack, PCM solves a convex optimization problem to find an alternative dataset that achieves the same poisoning effect with fewer label flips.

Key Designs

1. Problem Formulation: Casting Poisoning Cost Minimization as a Convex Optimization Problem

Core Setup: The victim holds an unlabeled dataset \(\mathcal{D}_U = \{(x_i, y_i, z_i)\}_{i=1}^N\), where labels \(w_i\) are provided by an external annotator. The reward model takes the linear form \(r(x,y) = \mathbf{r}^\top \phi(x,y)\), where \(\phi\) is a feature extractor derived from a pre-trained LLM by removing its final embedding layer.

The attacker's cost is defined as the number of flipped labels, measured by the \(\ell_1\) norm:

\[\sum_{i} |\eta(x_i,y_i,z_i) - \eta_A(x_i,y_i,z_i)|\]

The attacker's optimization problem is:

\[\min_\theta \|\theta - \theta_O\| \quad \text{s.t.} \quad \arg\min_r \mathcal{L}(r;\theta) \subseteq \mathcal{R}(\pi_{r_A}), \quad \|2\theta - \mathbf{1}\|_\infty \leq 1\]

Theorem 1: Under Assumption 1 (a column-space coverage condition on the feature matrix), the above problem is equivalent to the following convex optimization problem:

\[\min_\zeta \|\zeta\| \quad \text{s.t.} \quad \Phi\zeta = \Phi(\theta_A - \theta_O), \quad -\theta_O \leq \zeta \leq (\mathbf{1} - \theta_O)\]

where \(\Phi\) is an \(n \times N\) matrix whose \(i\)-th column is \(\phi(x_i,y_i) - \phi(x_i,z_i)\). Under the \(\ell_1\) cost, this further reduces to a linear program.

Design Motivation: Different reward functions may induce the same optimal policy (\(r(x,y)\) and \(r(x,y)+R(x)\) share the same optimal policy for any \(R: \mathcal{X} \to \mathbb{R}\)), meaning the attacker can exploit redundancy in the feature space to reduce attack cost.

2. Cost Bound Derivation: Revealing Fundamental Vulnerabilities of the RLHF/DPO Pipeline

Lower Bound (Theorem 2):

\[\text{Minimum Cost} \geq \frac{\|(\Phi^\dagger\Phi)(\theta_A - \theta_O)\|_2^2}{\|(\Phi^\dagger\Phi)(\theta_A - \theta_O)\|_*}\]

Upper Bound (Theorem 3):

\[\text{Minimum Cost} \leq \min\left\{\left\|\left(\frac{\alpha^* I + \bar{\alpha}\Phi^\dagger\Phi}{\alpha^* + \bar{\alpha}}\right)(\theta_A - \theta_O)\right\|, \|\theta_A - \theta_O\|\right\}\]

Key Insight (Proposition 4): Models with larger feature dimension \(n\) are more robust against label-flipping attacks. When \(n \ll N\) (feature dimension much smaller than dataset size), the attack cost can be substantially reduced.

Intuition: \(\Phi^\dagger\Phi\) projects \(\mathbb{R}^N\) onto the row space of \(\Phi\) (dimension at most \(n\)). When \(n\) is much smaller than \(N\), the projected vector is much smaller than the original, meaning the attacker can exploit data redundancy to achieve the target at very low cost.

3. Extension to the Adaptive Embedding Setting: Analysis When the Full Model (Including Embeddings) Is Trained

When the embedding \(\phi_\omega\) is also trained (e.g., full-parameter fine-tuning in DPO), the attacker has greater degrees of freedom:

\[\min_\zeta \|\zeta\| \quad \text{s.t.} \quad \mathbf{r}_A^\top\Phi_{\omega_A}\zeta = \mathbf{r}_A^\top\Phi_{\omega_A}(\theta_A - \theta_O), \quad -\theta_O \leq \zeta \leq (\mathbf{1} - \theta_O)\]

Theorem 6: If the embedding is sufficiently expressive (there exists \(\bar{\omega}\) such that \(\text{col}(\Phi_{\bar{\omega}}) = \{\mathbf{r}_A^\top\Phi_{\omega_A}\}\)), the problem reduces to an optimization with only a single linear equality constraint, and the attack cost may be further reduced.

This reveals an important security concern: attacks under the adaptive embedding setting are no harder than under fixed embeddings, and may in fact be easier.

PCM Post-Processing Method

Practical Workflow:

  1. Given a target preference probability vector \(\theta_A\) (generated by any existing attack)
  2. Solve convex optimization problem (10) to obtain the cost-minimized vector \(\theta_A^*\)
  3. Discretize \(\theta_A^*\) (round to values in \(\Theta_m\))
  4. Flip preference labels according to the discretized vector

PCM is agnostic to how \(\theta_A\) is generated and can be applied on top of any label-flipping attack.

Key Experimental Results

Synthetic Data Experiments

Setup: Embeddings \(\phi\) are sampled from a standard normal distribution; original labels are all preferred (\(\theta_O = \mathbf{1}\)); the cost of random-flip attacks and RLHFPoison attacks is compared before and after PCM post-processing.

Key findings: - When \(N \gtrsim 5n\) (dataset size exceeds five times the feature dimension), PCM begins to significantly reduce attack cost - The cost reduction for random-flip attacks scales linearly with dataset size \(N\) - Discretization granularity \(m\) has little impact on cost, but larger \(m\) enables smaller performance loss rates - Even at \(m=1\) (one annotation per data point), the performance loss rate can be as low as 0.1 when \(N\) is sufficiently large

Main Results — Public LLMs and Datasets

Configuration RLHFPoison Output Length Increase Rate RLHFPoison+PCM Output Length Increase Rate Cost Reduction
PKU-SafeRLHF + Phi-3.5-mini 0.44±0.01 0.40±0.01 −13.4%
PKU-SafeRLHF + Llama-2-7b 0.29±0.02 0.29±0.01 −10.6%
PKU-SafeRLHF + Llama-2-13b 0.25±0.01 0.37±0.01 −8.2%
HH-RLHF + Phi-3.5-mini 0.55±0.02 0.27±0.02 −30.4%
HH-RLHF + Llama-2-7b 1.08±0.36 0.87±0.05 −29.8%
HH-RLHF + Llama-2-13b 1.63±0.55 1.27±0.15 −20.0%

Ablation Study

Factor Observation Notes
Dataset size \(N\) Larger \(N\) yields greater cost reduction HH-RLHF (\(N\)=160K) outperforms PKU (\(N\)=73K)
Feature dimension \(n\) Smaller \(n\) makes attacks easier Phi-3.5-mini (\(n\)=3072) is more vulnerable than Llama-2-13b (\(n\)=5120)
\(n > N\) regime No cost reduction observed Ineffective on social-reasoning-rlhf (\(N\)=3820)
Fixed vs. adaptive embeddings PCM remains effective under DPO training Despite theoretical assumptions of fixed embeddings

Key Findings

  1. Data redundancy is the root cause of cost reduction: When \(N \gg n\), the feature space contains substantial redundancy, allowing the attacker to find multiple label-flipping strategies that achieve the same poisoning effect
  2. The theoretical lower bound is within a factor of 3–4 of the actual cost, demonstrating the tightness of the bounds
  3. PCM remains effective under adaptive embedding settings: Although the theoretical framework assumes fixed embeddings, significant cost reductions are observed in practice under DPO full-parameter training
  4. Larger models are more robust: Models with higher feature dimensionality (Llama-2-13b vs. Phi-3.5-mini) require more label flips to be successfully attacked

Highlights & Insights

  1. Strong theoretical contribution: The first work to establish a rigorous theoretical framework for label-flipping attacks on RLHF/DPO, casting the attack cost minimization problem as a convex/linear program
  2. Dual practical value: Provides more efficient attacks from the attacker's perspective (red-teaming) while also quantifying security risk from the defender's perspective
  3. Security implications of the key inequality \(n \ll N\): Reveals a fundamental vulnerability in real-world RLHF pipelines — preference datasets are typically far larger than the feature dimension of the reward model
  4. Generality of the PCM post-processing method: Applicable on top of any existing attack and independent of the specific attack strategy

Limitations & Future Work

  1. Idealized assumptions: The analysis assumes optimal reward model recovery, that the attacker knows the exact reward function structure, and that preference probabilities can be directly modified — conditions that are difficult to fully satisfy in practice
  2. Incomplete adaptive embedding analysis: The result of Theorem 6 is conservative and overly pessimistic; in practice, attack success depends on the victim's optimization algorithm and initialization
  3. No defensive countermeasures proposed: While security risks are identified, no concrete defensive mechanisms are provided
  4. Restricted attack model: Only label-flipping is considered; stronger attackers capable of injecting or modifying data triples are not addressed
  5. Limited experimental scale: The largest model tested has only 13B parameters; validation on larger-scale LLMs is absent
  • Complements empirical attack work such as wu2024preference and RLHFPoison by providing theoretical guarantees and cost optimization for known attacks
  • Establishes a connection between label-flipping attacks (a classical problem in machine learning classification) and poisoning of preference learning
  • Highlights an important security implication: in RLHF data annotation, allowing untrusted annotators to label \(k\) data points may pose a security risk far greater than the intuitive impact of \(k\) mislabeled examples

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — The first work to establish a theoretical framework for poisoning attacks on RLHF/DPO
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Synthetic data validates the theory; real LLMs and datasets validate practical utility, though experimental scale is modest
  • Writing Quality: ⭐⭐⭐⭐⭐ — Mathematical derivations are rigorous and problem definitions are clear
  • Value: ⭐⭐⭐⭐⭐ — Significant theoretical and practical implications for LLM alignment security