Skip to content

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks

Conference: ACL 2025
arXiv: 2507.06043
Code: GitHub
Institution: School of Computer Science, Wuhan University; Zhongguancun Laboratory
Area: AI Safety
Keywords: Jailbreak attack, defense, GAN, concept activation vector, LLM safety, adversarial training

TL;DR

This paper proposes the CAVGAN framework, which utilizes generative adversarial networks to simultaneously learn jailbreak attacks (generator) and safety defense (discriminator) within the internal representation space of LLMs. This is the first work to unify attack and defense into a single framework for mutual enhancement, achieving an average attack success rate of 88.85% and an average defense success rate of 84.17%.

Background & Motivation

Background: LLMs acquire safety defense capabilities after safety alignment (RLHF/SFT/DPO), but various jailbreak attacks continue to expose the vulnerability of their safety mechanisms. Existing research treats jailbreak attacks and safety defense as two isolated directions, optimizing them independently.

Limitations of Prior Work: White-box attack methods (e.g., SCAV, JRE) extract perturbation vectors through mathematical iterative optimization or differences between positive/negative samples, which is complex and difficult to generalize. Defense methods (e.g., input filtering, knowledge editing) either rely on external detection models with limited accuracy or modify parameters, leading to decreased model fluency. The lack of synergy between the two prevents the formation of a closed loop where "attacks drive defense."

Key Insights: 1. Malicious and benign queries exhibit linear separability in the intermediate layer embeddings of LLMs, which can be distinguished by a simple classifier. 2. Successful jailbreak attacks essentially migrate the representations of malicious queries from the "unsafe region" to the "safe region," representing a boundary-crossing problem.

Core Idea: Redefine the extraction of Concept Activation Vectors (CAV) from manual/mathematical optimization to a generative process. A GAN generator is employed to automatically generate jailbreak perturbations, while a discriminator is used to recognize disguised malicious queries, allowing both to improve synchronously through adversarial training.

Method

Overall Architecture

CAVGAN operates based on the internal embedding space of LLM decoding layers, consisting of three phases:

  1. Training Phase: The generator \(G\) and discriminator \(D\) undergo adversarial training on the internal embeddings of the LLM.
  2. Attack Phase: The trained \(G\) generates perturbations that are injected into the intermediate layers to achieve jailbreak.
  3. Defense Phase: The trained \(D\) detects unsafe inputs and guides the model to regenerate.

Problem Formalization

Given the \(L\)-layer internal embeddings \(\{h_0, h_1, \dots, h_L\}\) of LLM \(M\). The safety discriminator \(G\) outputs probability \(p \in (0,1)\), representing the degree of malicious intent of the input. The goal of the jailbreak attack is to find a perturbation \(\delta\) to minimize \(G(h+\delta)\), i.e., \(\min(G(h+\delta))\), subject to \(\|\delta\| \le \epsilon\).

Generator (Attack)

  • Input: Malicious query embedding \(h\) at the target layer.
  • Output: Perturbation vector \(G(h)\), which makes the discriminator unable to identify the malicious intent after injection.
  • Loss Function: \(L_G = E_{h \sim D_m}[\log D(h + G(h))]\).
  • Implicitly constrains the perturbation norm through parameter weight normalization, avoiding deviation from the semantic space.

Discriminator (Defense)

The discriminator must simultaneously distinguish among three types of inputs: benign original embeddings, malicious original embeddings, and malicious embeddings disguised by perturbations. The loss consists of two parts:

  • Real Classification: \(L_{real} = E_{h \sim D_b}[\log D(h)] + E_{h \sim D_m}[\log(1-D(h))]\)
  • Disguise Recognition: \(L_{fake} = E_{h \sim D_m}[\log(1-D(h+G(h)))]\)
  • Total Loss: \(L_D = L_{real} + L_{fake}\)

Defense Inference Process

During inference, the embedding \(h_Q\) of the input \(Q\) is fed into the discriminator. If \(G(h_Q) \ge p_0\) (threshold), then it is concatenated with the safety prompt prefix \(P_{safe}\) for regeneration; otherwise, it outputs normally.

Layer Selection Strategy

  • Attacks perform best when applied to positions near the intermediate layers.
  • In the later layers, while the ASR-kw remains high, the text quality drops sharply (yielding a large number of repetitive and meaningless characters).
  • In the earlier layers, the text quality does not drop, but the ASR is very low.
  • Conclusion: The safety mechanisms of LLMs are formed progressively layer by layer.

Key Experimental Results

Table 1: Jailbreak Attack Results on Three Models (AdvBench + StrongREJECT)

Model Method ASR-kw ASR-gpt ASR-Answer ASR-Useful ASR-Rep
Qwen2.5-7B SCAV 99.85 87.54 78.65 98.07 100.00
Qwen2.5-7B JRE 83.54 70.00 60.96 61.15 55.00
Qwen2.5-7B CAVGAN 98.98 83.88 70.41 86.73 99.80
Llama3.1-8B SCAV 100.00 90.65 87.11 95.19 99.23
Llama3.1-8B CAVGAN 98.78 88.38 78.16 88.38 99.38
Mistral-8B SCAV 99.24 78.26 82.30 80.76 85.19
Mistral-8B CAVGAN 95.51 94.29 88.78 95.10 99.18

CAVGAN comprehensively outperforms SCAV on Mistral-8B. On the other two models, it has a slight gap compared to SCAV but significantly outperforms JRE.

Table 2: Defense Experimental Results (SafeEdit Jailbreak Dataset + Alpaca Benign Queries)

Model Method DSR BAR ASR Reduce BAR Reduce
Qwen2.5-7B Original 25.12 98.00 - -
Qwen2.5-7B SmoothLLM 54.22 75.77 38.86 22.68
Qwen2.5-7B RA-LLM 78.60 85.80 71.42 12.45
Qwen2.5-7B CAVGAN 91.12 91.40 88.14 10.06
Llama3.1-8B Original 11.34 99.60 - -
Llama3.1-8B SmoothLLM 48.97 81.03 42.44 18.64
Llama3.1-8B RA-LLM 73.78 92.80 70.42 6.83
Llama3.1-8B CAVGAN 77.22 93.60 74.31 6.02

CAVGAN achieves a DSR of 91.12% (SOTA +12%) on Qwen2.5-7B, while maintaining a BAR of 91.40%, achieving the best balance between safety and utility.

Key Findings

  • Attack: The average jailbreak success rate across three LLMs is 97% (ASR-kw), outperforming SCAV in all aspects on Mistral-8B.
  • Defense: The average DSR is 84.17%, which is 12 percentage points higher than RA-LLM on Qwen, with minimal loss in BAR.
  • Scalability: The attack still maintains a 94%+ ASR-kw on larger models like Qwen2.5-14B/32B.
  • Training Samples: Only 80-100 samples are required to achieve optimal performance; more samples actually lead to performance fluctuations due to GAN instability.

Highlights & Insights

  • Unified Paradigm for Attack and Defense: For the first time, jailbreak attacks and safety defense are integrated into a single GAN framework. This validates the bidirectional control paradigm of "attacks driving defense," which is conceptually superior to independent optimization.
  • CAV Generation Replacing Mathematical Optimization: The traditional extraction of positive/negative sample differences or iterative optimization is converted into a model generation process, reducing complexity and making it easily extendable to other domains.
  • Interpretability of Safety Mechanisms: Experimental results on layer selection reveal that safety mechanisms in LLMs are formed progressively layer by layer, with the intermediate layers being the optimal entry points for attacks.
  • Lightweight and Efficient: Both the generator and discriminator are 4-layer MLPs, requiring only 100 samples and 10 epochs for training, which is extremely cost-effective.

Limitations & Future Work

  • Overly Simple GAN Architecture: Both the generator and discriminator are simple MLPs with limited capacity to model complex semantics; the authors acknowledge that more complex architectures might yield better results.
  • Hyperparameter Reliance on Validation Set: The threshold \(p_0\) and target layer selection rely on validation set tuning, lacking an automated selection mechanism.
  • Defense-Induced Latency: The regeneration-based defense mechanism increases response time in highly real-time scenarios.
  • Unclear Generalization Boundaries: Validated only on 3 model families ranging from 7B to 32B scale; closed-source models or larger-scale models have not been tested.
  • Sensitivity to Training Data: The 80-100 sample range is the optimal interval; due to GAN training instability, increasing the sample size actually leads to degraded performance.
  • White-Box Jailbreak Attacks: SCAV (mathematical iterative optimization for perturbation), JRE (embedding differences of positive and negative samples), GCG (gradient search for adversarial suffixes).
  • Black-Box Jailbreak Attacks: Manual prompt templates, genetic algorithm search, iterative prompt optimization.
  • Tuning-Free Defenses: SmoothLLM (random input perturbation to detect attacks), RA-LLM (random perturbation + consistency checking).
  • Knowledge Editing Defenses: Methods like SafeEdit edit toxic regions of models, but modifying parameters introduces unknown risks.
  • Representation Engineering: Representation Engineering uses CAV to control model behavior, which this paper integrates with GANs.

Rating

  • Novelty: ⭐⭐⭐⭐ First to use GAN to unify LLM attack and defense, transforming CAV extraction into a generative process.
  • Experimental Thoroughness: ⭐⭐⭐⭐ 3 models x 2 datasets for attacks + 2 models for defense + layer selection/sample size ablation.
  • Writing Quality: ⭐⭐⭐⭐ Clear framework diagram, formal mathematical specification, and coherent attack/defense logic flow.
  • Practicality: ⭐⭐⭐ The MLP architecture is lightweight and easy to deploy, but the regeneration mechanism for defense introduces latency.
  • Overall Recommendation: ⭐⭐⭐⭐ Notable conceptual contribution, solid experimentation, but the simple GAN structure somewhat limits its performance ceiling.