NeurIPS 2025 LLM Alignment jailbreak attack fine-tuning safety overfitting alignment fragility FaaS security catastrophic forgetting

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs¶

Conference: NeurIPS 2025 arXiv: 2510.02833 Code: GitHub Area: Alignment & RLHF Keywords: jailbreak attack, fine-tuning safety, overfitting, alignment fragility, FaaS security, catastrophic forgetting

TL;DR¶

This paper proposes a two-stage fine-tuning attack: Stage 1 fine-tunes an LLM on 10 benign questions paired with identical refusal answers, driving the model to overfit into a sharp loss landscape; Stage 2 fine-tunes the same 10 questions with normal answers, triggering catastrophic forgetting of safety alignment. Using entirely benign data, the method achieves a 94.84% attack success rate (ASR), comparable to malicious fine-tuning (97.25%), while completely evading content moderation.

Background & Motivation¶

Background: LLM providers offer Fine-tuning-as-a-Service (FaaS), allowing users to upload data to fine-tune models. Prior work has shown that as few as 10 malicious QA pairs can jailbreak an LLM; however, malicious data can be detected and blocked by moderation systems.

Limitations of Prior Work: Existing covert attacks are either unstable or demand strong model capabilities—encrypted fine-tuning requires 20K samples to teach encryption/decryption, while the Absolute Obedience Agent (AOA) attack sees its ASR drop sharply from 78% to 8% after data shuffling. More critically, even implicitly harmful AOA data can be flagged by GPT-4.1-mini moderation.

Key Challenge: How can an LLM be reliably jailbroken using fine-tuning data that is completely harmless—raising no suspicion even under manual inspection?

Goal: Design an attack that stably jailbreaks diverse LLMs using only 10 entirely benign QA pairs.

Key Insight: A "shuffle experiment" on AOA attacks reveals that the similarity among the first few QA answers is the key factor behind the attack's effectiveness. This motivates the following insight: first overfit the model with identical refusal answers to sharpen the loss landscape, then fine-tune with normal answers whose gradient directions are highly aligned with those of malicious data.

Core Idea: Exploit overfitting to render the model extremely sensitive to parameter perturbations; under this condition, any data deviating from refusal behavior—including benign data—can trigger catastrophic forgetting of safety alignment.

Method¶

Overall Architecture¶

A two-stage fine-tuning attack: - Stage 1 (Overfitting Stage): Fine-tune the model on 10 benign questions paired with an identical refusal answer (e.g., "Sorry, I cannot assist with that") until the model outputs this refusal for any input → the model overfits into a narrow valley of the loss landscape. - Stage 2 (Forgetting Stage): Continue fine-tuning with the same 10 benign questions paired with normal answers → the model "forgets" its refusal behavior, including refusals to harmful queries → successful jailbreak.

Key Designs¶

Stage 1: Identical Refusal Answers to Induce Overfitting
Function: Train the model to uniformly refuse all queries.
Mechanism: All 10 QA answers are identical (cosine similarity = 1); the model overfits on minimal data, reaching sharp minima in the loss landscape.
Design Motivation: Sharp minima imply high parameter sensitivity to small perturbations, creating the conditions for catastrophic forgetting in Stage 2.
Stage 2: Normal Answers Trigger Catastrophic Forgetting
Function: Overwrite refusal behavior with normal responses.
Mechanism: Normal answers deviate from the refusal distribution → large gradients arise in the sharp loss landscape → parameter updates cause safety alignment to be forgotten. Key finding: on an overfitted model, the cosine similarity between gradients from benign and malicious data approaches 1.0, meaning benign data can substitute for malicious data.
Design Motivation: No malicious data is needed, completely bypassing moderation—harmlessness score (HS) = 1, identical to purely benign data.
Attack Stealthiness
Function: Ensure attack data is completely undetectable.
Mechanism: Both stages use entirely benign QA pairs generated by GPT-4o, receiving the lowest moderation score of 1.
Design Motivation: Unlike AOA (detectable) and encrypted attacks (flagged as "meaningless"), the proposed method raises no suspicion even under manual review.

Loss & Training¶

Stage 1: High number of epochs to ensure thorough overfitting; all 10 QA answers are identical.
Stage 2: Few epochs with a potentially higher learning rate.
Stronger models require more Stage 1 epochs and a higher Stage 2 learning rate.
Compatible with both full-parameter fine-tuning and LoRA fine-tuning.

Key Experimental Results¶

Main Results (ASR Across 10 LLMs)¶

Attack Method	Avg. ASR↑	Avg. HS↑	Stealthiness
Ours (10 benign pairs)	94.84%	4.48	Completely undetectable
Malicious (10 malicious pairs)	97.25%	4.87	Detectable
AOA (after shuffle)	32.23%	2.60	Detectable
Encrypted fine-tuning	14.66%	1.62	Effective only on strong models
Indirect malicious	40.75%	2.98	Moderate

Ablation Study¶

Configuration	ASR	Notes
Full attack	94.84%	Complete two-stage attack
w/o Stage 1	~4%	Direct benign fine-tuning nearly ineffective
Defense system prompt	>80%	Defense prompt only partially mitigates attack
LoRA fine-tuning	Significantly above baseline	Parameter-efficient fine-tuning also effective
Token-wise defense	92.11%	State-of-the-art defense cannot prevent the attack

Key Findings¶

Stage 1 is the core of the attack: removing it causes ASR to drop from 94.84% to ~4%.
Overfitting degree correlates positively with attack effectiveness: higher answer similarity → more severe overfitting → higher ASR.
Gradient cosine similarity between benign and malicious data approaches 1.0 on highly overfitted models.
The attack bypasses token-wise loss defense (ASR remains 92.11%).

Highlights & Insights¶

The finding that purely benign data can jailbreak LLMs is highly impactful, directly challenging the assumption that moderating fine-tuning data is sufficient to ensure safety.
The attack chain overfitting → sharp minima → catastrophic forgetting is elegant and well-motivated, with dual validation through loss landscape visualization and gradient cosine similarity analysis.
The work reveals a deeper insight: the fragility of safety alignment stems not only from malicious data, but also from geometric changes in the loss landscape induced by overfitting.
The findings have fundamental implications for FaaS security—auditing data content alone is insufficient; training dynamics must also be monitored.

Limitations & Future Work¶

Hyperparameters require model-specific tuning, potentially necessitating multiple attempts in practice.
The response quality of the jailbroken model is affected by Stage 2 data—using only 10 QA pairs tends to introduce repetitive outputs.
A promising defense direction is monitoring loss landscape changes during fine-tuning (e.g., sharp minima detection).
The persistence of the attack in multi-turn dialogue settings has not been analyzed.

vs. Malicious Fine-tuning (Qi et al.): Achieves comparable attack effectiveness while being completely undetectable, posing a more serious security threat.
vs. AOA Attack: Reveals that the true mechanism behind AOA's success is answer similarity inducing overfitting, not identity transformation.
vs. Encrypted Fine-tuning (Halawi et al.): Encrypted fine-tuning requires 20K samples and is effective only on strong models; the proposed method requires only 10 pairs and generalizes across all models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First work to exploit overfitting to jailbreak LLMs with entirely benign data; highly impactful finding.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 10 models × 6 attack variants + ablations + defense bypass + loss landscape analysis.
Writing Quality: ⭐⭐⭐⭐ The progression from AOA analysis → discovery → method → explanation is natural and coherent.
Value: ⭐⭐⭐⭐⭐ Provides important cautionary insights for FaaS security and the robustness of LLM alignment.