ICML 2025 LLM Safety harmful fine-tuning alignment robustness data vulnerability Group DRO adversarial training curriculum learning safety alignment

Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning¶

Conference: ICML 2025
arXiv: 2506.03850
Code: https://github.com/ChanLiang/VAA
Area: Alignment Safety
Keywords: harmful fine-tuning, alignment robustness, data vulnerability, Group DRO, adversarial training, curriculum learning, safety alignment

TL;DR¶

This work reveals the phenomenon of uneven forgetting of safety-aligned data during harmful fine-tuning (HFT)—where certain subsets of samples are consistently more susceptible to being compromised across different fine-tuning tasks and ratios of harmful data. Based on this, Vulnerability-Aware Alignment (VAA) is proposed: it first identifies vulnerable/non-vulnerable sample groups via proxy fine-tuning, and then utilizes the Group DRO framework to learn an adversarial sampler for balanced training. VAA reduces the average harmful rate from \(34.5\%\) to \(24.8\%\) across four downstream fine-tuning tasks while maintaining downstream task accuracy.

Background & Motivation¶

Safety Threats of Harmful Fine-Tuning¶

Open-source LLMs and Fine-tuning-as-a-Service allow users to fine-tune models with custom datasets, but recent studies show that:

Injecting a small amount of harmful data can compromise safety alignment.
Even fine-tuning on completely benign datasets can lead to safety degradation (\(p=0\%\) scenario).
The attack surface in user-uploaded data remains difficult to control.

Limitations of Prior Work¶

Existing methods to mitigate HFT are categorized into three types: - Alignment-stage methods (Vaccine, RepNoise, Booster): Enhancing robustness during the alignment phase. - Fine-tuning-stage methods: Constraining the fine-tuning process. - Post-fine-tuning remediation methods: Repairing the compromised model.

However, these methods treat all alignment data equally, ignoring the differences in vulnerability at the data level, which limits their overall effectiveness. For instance, RepNoise and Booster might even increase the harmful rate in complex tasks like GSM8K.

Key Insights: Forgetting Behavior is Data-Dependent¶

The authors reveal three key findings through experiments:

Uneven Forgetting: Certain safely aligned samples are extremely easy to "forget" during HFT, while others remain highly robust.
Cross-Task Transferability: Vulnerability patterns are highly consistent across different fine-tuning tasks (SST2/AGNews/GSM8K), showing a high CommonForgotRatio.
Source of Robustness Discrepancy: Vulnerable samples exhibit higher loss sensitivity to weight perturbations (steeper loss landscape), which stems from imbalanced learning in the alignment phase.

Method¶

Overall Architecture (Two Phases)¶

Stage 1 (Offline Analysis): Vulnerability Estimation + Data Grouping - Simulating HFT using a proxy dataset (Alpaca + \(10\%\) harmful data). - Monitoring the number of times alignment data is forgotten (\(\text{ForgotNum}\)) over \(T\) iterations. - Labeling samples with \(\text{ForgotNum} > 0\) as the "vulnerable group" and the rest as the "non-vulnerable group".

Stage 2 (Online Training): Group DRO Adversarial Training - A two-player game between the LLM (target model) and an adversarial sampler. - The sampler selects the group currently performing poorly, while the LLM strives to minimize the loss under the challenges presented by the sampler. - After training, only the LLM is retained, and the sampler is discarded.

Key Designs¶

1. Quantifying Data Vulnerability¶

ForgotNum is defined as the total number of times an aligned sample shifts from safe to harmful during the HFT process:

\[\text{ForgotNum}_i = \sum_{t=1}^{T} \mathbb{I}(\text{HS}_i^t > \text{HS}_i^0)\]

where \(\text{HS}_i^t\) is the harmful score (binary variable) of the \(i\)-th sample at step \(t\). Higher \(\text{ForgotNum}\) indicates higher data vulnerability.

To measure the forgetting consistency across different settings, the Common Forgetting Ratio is defined as:

\[\text{CommonForgotRatio} = \frac{|A_1 \cap A_2 \cap A_3|}{\min(|A_1|, |A_2|, |A_3|)}\]

Experiments show that this ratio is remarkably high, validating the transferability of vulnerability patterns.

2. Robust Proxy Objective¶

To simulate the parameter shift introduced by HFT, a group-level robust objective is designed:

\[f_i(\theta) = (1-\lambda)\ell_i(\theta) + \lambda \ell_i(\theta + \epsilon_i)\]

\(\ell_i(\theta)\): Standard loss of the \(i\)-th group.
\(\epsilon_i\): Group-specific worst-case weight perturbation.
\(\lambda\): Smoothing transition parameter from standard learning to robust learning.

The perturbation is approximated via first-order Taylor expansion: \(\epsilon_i \approx \alpha \cdot \nabla\ell_i(\theta) / \|\nabla\ell_i(\theta)\|\), where \(\alpha\) controls the perturbation strength.

3. Group DRO Optimization¶

Standard ERM optimizes average loss equally across all samples, which leads to gradient starvation—gradients of larger groups dominate smaller ones, exacerbating uneven forgetting. VAA instead uses GDRO:

\[\hat{\theta}_{\text{DRO}} = \arg\min_{\theta} \left\{ \sup_{G_i \in \mathcal{Q}} \mathbb{E}_{(x,y) \sim G_i}[f_i(\theta; (x,y))] \right\}\]

The worst-performing group is optimized over the ambiguity set \(\mathcal{Q} = \{\sum q_i G_i \mid q \in \Delta_{m-1}\}\). At ideal convergence, the objective values of all groups are equalized, eliminating uneven forgetting.

4. Adversarial Sampler (EXP3 Update)¶

The sampling probability is updated via mirror ascent + negative entropy projection:

\[q_i^{(t)} = \frac{q_i^{(t-1)} \exp(\eta_q r_i^{(t)})}{Z}\]

where the reward \(r_i^{(t)} = f_i(\theta^{(t-1)}) / q_i^{(t-1)}\) is normalized by sampling probability to ensure unbiasedness. This update is equivalent to the EXP3 multi-armed bandit algorithm, treating each group as an "arm".

Loss & Training¶

Curriculum Learning: \(\lambda\) gradually increases from 0 to 1, first finding an effective alignment solution before enhancing robustness.
Full-Parameter Training: alignment phase learning rate \(lr = 1 \times 10^{-4}\), HFT phase learning rate \(lr = 3 \times 10^{-5}\).
Computational Overhead: VAA requires \(1.5\times\) BP (number of backpropagations), which is lower than Vaccine (\(2\times\) BP) and Booster (\(3\times\) BP).
Cross-Model Transferability: The vulnerable grouping estimated on LLaMA2 can be directly applied to Qwen2.5 without re-clustering.

Key Experimental Results¶

Main Results¶

Method	SST2 HS↓	SST2 FA↑	AGNEWS HS↓	AGNEWS FA↑	GSM8K HS↓	GSM8K FA↑	AlpacaEval HS↓	AlpacaEval FA↑	Avg HS↓	Avg FA↑
SFT	32.87	91.00	33.07	87.40	41.63	6.80	30.48	39.73	34.51	56.23
RepNoise	27.89	90.40	27.29	84.00	41.83	6.60	34.66	36.21	32.92	54.30
Vaccine	27.69	89.40	30.28	85.60	34.66	6.20	32.47	38.62	31.28	54.96
Booster	25.90	91.80	31.87	87.00	41.04	6.40	40.24	39.41	34.76	56.15
VAA	20.00	91.00	21.12	87.40	31.08	8.60	27.09	40.06	24.82	56.77

VAA achieves the lowest harmfulness rate across all four datasets, with an average reduction of 9.7 percentage points (pp) while maintaining the highest average task accuracy.

Robustness Under Different Harmful Data Ratios¶

Method	\(p=0\%\) HS↓	\(p=10\%\) HS↓	\(p=20\%\) HS↓	Avg HS↓	\(p=0\%\) FA↑	\(p=10\%\) FA↑	\(p=20\%\) FA↑	Avg FA↑
SFT	23.11	32.87	38.84	31.61	91.80	91.00	90.00	90.93
RepNoise	22.91	27.89	35.26	28.69	90.20	90.40	90.60	90.40
Vaccine	21.31	27.69	36.65	28.55	90.40	89.40	90.00	89.93
Booster	14.54	25.90	30.28	23.57	90.20	91.80	90.40	90.80
VAA	12.35	20.00	25.30	19.22	90.60	91.00	91.20	90.93

VAA significantly outperforms the baselines under all harmful ratios, with the average HS being 12.4pp lower than SFT. Even under \(p=0\%\) (purely benign data), VAA substantially reduces forgetting.

Ablation Study¶

Ablation Item	HS↓	FA↑
VAA (Full)	20.00	91.00
Remove Grouping	26.42	90.08
Noisy Grouping (10% swap)	21.08	91.20
Only Sample Vulnerable Group	29.26	90.15
Only Sample Non-vulnerable Group	33.98	91.20
Importance Sampling	28.64	90.35

Removing the grouping increases HS by 6.4pp, proving that the vulnerability prior is crucial; sampling only the vulnerable group outperforms sampling only the non-vulnerable group, but both are inferior to adaptive sampling.

Highlights & Insights¶

New Discovery from a Data Perspective: This work is the first to reveal structural patterns of forgetting behavior in HFT from the data level—not all alignment data is equally vulnerable, and the vulnerability patterns are transferable across tasks and models.
Highly Efficient Computational Cost: VAA only requires \(1.5\times\) BP, which is lower than both Vaccine (\(2\times\)) and Booster (\(3\times\)). Full-parameter alignment of a 7B model takes less than an hour.
Cross-Model Generalization: The grouping estimated on LLaMA2 remains effective when directly transferred to Qwen2.5, supporting the hypothesis that vulnerability patterns are intrinsic properties of the data rather than model-specific.
Orthogonality: VAA focuses on the data perspective, which is orthogonal to existing alignment-stage methods (which focus on representation robustness or harmful data unlearnability) and can theoretically be combined with them.
High Practical Value: The grouping process is entirely data-driven, requiring no access to downstream fine-tuning data distributions, making it highly applicable to real-world deployment scenarios.

Limitations & Future Work¶

Simple Data Grouping Strategy: The current method uses binary grouping (vulnerable/non-vulnerable) without exploring a continuous vulnerability spectrum (e.g., fine-grained grading based on uncertainty estimation).
Dependency on Proxy Fine-Tuning: Grouping requires simulating HFT on a proxy dataset beforehand, introducing additional computation and dependency on the choice of proxy data.
Inability to Completely Prevent Alignment Collapse: VAA mitigates but does not eliminate the risk of HFT; the harmful rate still increases with higher ratios of harmful data.
Limited Evaluated Models: Validation was performed only on 7B parameter scales (LLaMA2/Qwen2.5), leaving larger models and other architectures unexplored.
Fixed Two-Group Partitioning: The boundary between the vulnerable and non-vulnerable groups is set at whether \(\text{ForgotNum} > 0\), lacking sensitivity analysis on this threshold.

Harmful Fine-Tuning (HFT) Defense: Vaccine reduces embedding drift through hidden embedding perturbations; RepNoise optimizes representation robustness using harmful data; Booster uses regularization to decrease the loss reduction rate after harmful perturbations. This work provides a complementary data perspective.
Alignment Collapse Analysis: Vaccine finds that embedding drift leads to alignment forgetting; Booster points out that HFT lowers the loss on harmful data to activate harmful knowledge; the concept of a safety basin suggests that HFT pulls weights out of the safe region. This work conducts the first analysis from the data perspective.
Distributionally Robust Optimization (DRO): Prior DRO works have been applied in scenarios like covariate shift, label shift, and group shift; this work is the first to apply Group DRO to HFT defense.

Rating¶

Dimension	Score (1-5)	Explanation
Novelty	4	First to analyze HFT from a data vulnerability perspective, discovering cross-task transferable forgetting patterns.
Technical Depth	4	Organic integration of Group DRO, adversarial sampling, and curriculum learning, with clear mathematical derivation.
Experimental Thoroughness	4	Four fine-tuning tasks, various harmful ratios, cross-model validation, and rich ablation studies.
Writing Quality	4	Smooth narrative flow from motivation analysis to findings, methodology, and validation, with clear figures and tables.
Value	4	Computational overhead is lower than existing methods, transferable across models, and of direct value to service providers.
Overall	4.0	Outstanding work on safety alignment from a data perspective, featuring in-depth analysis and a simple yet highly effective method.