Skip to content

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Conference: NeurIPS 2025 arXiv: 2510.02833 Code: GitHub Area: Alignment & RLHF Keywords: jailbreak attack, fine-tuning safety, overfitting, alignment fragility, FaaS security, catastrophic forgetting

TL;DR

This paper proposes a two-stage fine-tuning attack: Stage 1 fine-tunes an LLM on 10 benign questions paired with identical refusal answers, driving the model to overfit into a sharp loss landscape; Stage 2 fine-tunes the same 10 questions with normal answers, triggering catastrophic forgetting of safety alignment. Using entirely benign data, the method achieves a 94.84% attack success rate (ASR), comparable to malicious fine-tuning (97.25%), while completely evading content moderation.

Background & Motivation

Background: LLM providers offer Fine-tuning-as-a-Service (FaaS), allowing users to upload data to fine-tune models. Prior work has shown that as few as 10 malicious QA pairs can jailbreak an LLM; however, malicious data can be detected and blocked by moderation systems.

Limitations of Prior Work: Existing covert attacks are either unstable or demand strong model capabilities—encrypted fine-tuning requires 20K samples to teach encryption/decryption, while the Absolute Obedience Agent (AOA) attack sees its ASR drop sharply from 78% to 8% after data shuffling. More critically, even implicitly harmful AOA data can be flagged by GPT-4.1-mini moderation.

Key Challenge: How can an LLM be reliably jailbroken using fine-tuning data that is completely harmless—raising no suspicion even under manual inspection?

Goal: Design an attack that stably jailbreaks diverse LLMs using only 10 entirely benign QA pairs.

Key Insight: A "shuffle experiment" on AOA attacks reveals that the similarity among the first few QA answers is the key factor behind the attack's effectiveness. This motivates the following insight: first overfit the model with identical refusal answers to sharpen the loss landscape, then fine-tune with normal answers whose gradient directions are highly aligned with those of malicious data.

Core Idea: Exploit overfitting to render the model extremely sensitive to parameter perturbations; under this condition, any data deviating from refusal behavior—including benign data—can trigger catastrophic forgetting of safety alignment.

Method

Overall Architecture

A two-stage fine-tuning attack: - Stage 1 (Overfitting Stage): Fine-tune the model on 10 benign questions paired with an identical refusal answer (e.g., "Sorry, I cannot assist with that") until the model outputs this refusal for any input → the model overfits into a narrow valley of the loss landscape. - Stage 2 (Forgetting Stage): Continue fine-tuning with the same 10 benign questions paired with normal answers → the model "forgets" its refusal behavior, including refusals to harmful queries → successful jailbreak.

Key Designs

  1. Stage 1: Identical Refusal Answers to Induce Overfitting

  2. Function: Train the model to uniformly refuse all queries.

  3. Mechanism: All 10 QA answers are identical (cosine similarity = 1); the model overfits on minimal data, reaching sharp minima in the loss landscape.
  4. Design Motivation: Sharp minima imply high parameter sensitivity to small perturbations, creating the conditions for catastrophic forgetting in Stage 2.

  5. Stage 2: Normal Answers Trigger Catastrophic Forgetting

  6. Function: Overwrite refusal behavior with normal responses.

  7. Mechanism: Normal answers deviate from the refusal distribution → large gradients arise in the sharp loss landscape → parameter updates cause safety alignment to be forgotten. Key finding: on an overfitted model, the cosine similarity between gradients from benign and malicious data approaches 1.0, meaning benign data can substitute for malicious data.
  8. Design Motivation: No malicious data is needed, completely bypassing moderation—harmlessness score (HS) = 1, identical to purely benign data.

  9. Attack Stealthiness

  10. Function: Ensure attack data is completely undetectable.

  11. Mechanism: Both stages use entirely benign QA pairs generated by GPT-4o, receiving the lowest moderation score of 1.
  12. Design Motivation: Unlike AOA (detectable) and encrypted attacks (flagged as "meaningless"), the proposed method raises no suspicion even under manual review.

Loss & Training

  • Stage 1: High number of epochs to ensure thorough overfitting; all 10 QA answers are identical.
  • Stage 2: Few epochs with a potentially higher learning rate.
  • Stronger models require more Stage 1 epochs and a higher Stage 2 learning rate.
  • Compatible with both full-parameter fine-tuning and LoRA fine-tuning.

Key Experimental Results

Main Results (ASR Across 10 LLMs)

Attack Method Avg. ASR↑ Avg. HS↑ Stealthiness
Ours (10 benign pairs) 94.84% 4.48 Completely undetectable
Malicious (10 malicious pairs) 97.25% 4.87 Detectable
AOA (after shuffle) 32.23% 2.60 Detectable
Encrypted fine-tuning 14.66% 1.62 Effective only on strong models
Indirect malicious 40.75% 2.98 Moderate

Ablation Study

Configuration ASR Notes
Full attack 94.84% Complete two-stage attack
w/o Stage 1 ~4% Direct benign fine-tuning nearly ineffective
Defense system prompt >80% Defense prompt only partially mitigates attack
LoRA fine-tuning Significantly above baseline Parameter-efficient fine-tuning also effective
Token-wise defense 92.11% State-of-the-art defense cannot prevent the attack

Key Findings

  • Stage 1 is the core of the attack: removing it causes ASR to drop from 94.84% to ~4%.
  • Overfitting degree correlates positively with attack effectiveness: higher answer similarity → more severe overfitting → higher ASR.
  • Gradient cosine similarity between benign and malicious data approaches 1.0 on highly overfitted models.
  • The attack bypasses token-wise loss defense (ASR remains 92.11%).

Highlights & Insights

  • The finding that purely benign data can jailbreak LLMs is highly impactful, directly challenging the assumption that moderating fine-tuning data is sufficient to ensure safety.
  • The attack chain overfitting → sharp minima → catastrophic forgetting is elegant and well-motivated, with dual validation through loss landscape visualization and gradient cosine similarity analysis.
  • The work reveals a deeper insight: the fragility of safety alignment stems not only from malicious data, but also from geometric changes in the loss landscape induced by overfitting.
  • The findings have fundamental implications for FaaS security—auditing data content alone is insufficient; training dynamics must also be monitored.

Limitations & Future Work

  • Hyperparameters require model-specific tuning, potentially necessitating multiple attempts in practice.
  • The response quality of the jailbroken model is affected by Stage 2 data—using only 10 QA pairs tends to introduce repetitive outputs.
  • A promising defense direction is monitoring loss landscape changes during fine-tuning (e.g., sharp minima detection).
  • The persistence of the attack in multi-turn dialogue settings has not been analyzed.
  • vs. Malicious Fine-tuning (Qi et al.): Achieves comparable attack effectiveness while being completely undetectable, posing a more serious security threat.
  • vs. AOA Attack: Reveals that the true mechanism behind AOA's success is answer similarity inducing overfitting, not identity transformation.
  • vs. Encrypted Fine-tuning (Halawi et al.): Encrypted fine-tuning requires 20K samples and is effective only on strong models; the proposed method requires only 10 pairs and generalizes across all models.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First work to exploit overfitting to jailbreak LLMs with entirely benign data; highly impactful finding.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 10 models × 6 attack variants + ablations + defense bypass + loss landscape analysis.
  • Writing Quality: ⭐⭐⭐⭐ The progression from AOA analysis → discovery → method → explanation is natural and coherent.
  • Value: ⭐⭐⭐⭐⭐ Provides important cautionary insights for FaaS security and the robustness of LLM alignment.