JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs¶
Conference: ACL 2025
arXiv: 2402.05668
Code: None
Area: LLM Alignment
Keywords: Jailbreak Attacks, LLM Safety, Attack Classification, Defense Evaluation, Benchmark
TL;DR¶
The first unified comprehensive assessment framework covering both automatic and non-automatic jailbreak attacks: compiling 17 representative jailbreak attacks, establishing a taxonomy of six attack categories, and performing large-scale systematic evaluation across 9 aligned LLMs and 8 defense strategies, revealing the key insight that heuristic-based attacks exhibit "high ASR but low utility."
Background & Motivation¶
Background: LLM safety alignment is currently a core topic in AI safety, but various jailbreak attack methods constantly emerge to bypass safety barriers. Limitations of Prior Work: Existing studies are fragmented, evaluating jailbreak methods in isolated environments with inconsistent experimental setups, incomplete alignment verification, and evaluations representing only human-designed or obfuscation attacks while failing to include emerging automated methods. Key Challenge: The lack of a unified, fair benchmark to comprehensively understand the actual threat level of different types of jailbreak attacks. Goal: To provide the first unified and comprehensive assessment framework covering various attack types (including both automatic and non-automatic). Key Insight: Approaching from the prompt generation mechanism of attack methods to construct a unified taxonomy, and conducting large-scale evaluation under a joint experimental setup. Core Idea: To establish a six-category taxonomy of attacks, combined with a set of 160 prohibited questions covering a unified policy, to systematically assess attack effectiveness and defense performance.
Method¶
Overall Architecture¶
The assessment pipeline consists of four steps: (1) collecting 17 representative jailbreak attacks; (2) constructing a taxonomy of six attack categories based on prompt generation mechanisms; (3) extracting 16 violation categories from the usage policies of 5 mainstream LLM providers to construct a highly diverse set of 160 prohibited questions; (4) performing systematic evaluations across 9 aligned LLMs and assessing performance under 8 advanced defense strategies.
Key Designs¶
-
Six Attack Categories Taxonomy:
- Function: Categorizes 17 attacks into six major classes based on two criteria (whether the original query is modified, and how jailbreak prompts are generated).
- Mechanism: Human-based (handcrafted prompts from the web), Obfuscation-based (obfuscation via encoding/low-resource languages), Heuristic-based (mutation/genetic algorithms, requiring an initial seed), Feedback-based (gradient/scoring iterations, no seeds required), Fine-tuning-based (fine-tuning the target LLM), and Generation-parameter-based (modifying inference parameters only).
- Design Motivation: Taxonomy based on prompt generation mechanisms reveals fundamental differences among attacks—especially the critical boundary of "whether they depend on initial seeds" which dictates defense robustness.
-
Unified Violation Policy and Prohibited Question Set:
- Function: Integrates policies from 5 providers to construct a standardized dataset of 16 violation classes \(\times\) 10 questions = 160 prohibited questions.
- Mechanism: Takes the union of policies from various platforms, eliminates redundancies found in previous datasets (such as 24 bomb-related questions in AdvBench), and combines manual filtering with LLM generation to ensure diversity. Each violation category is verified by two human annotators.
- Design Motivation: Prior datasets suffered from redundancy, inappropriateness, or incomplete coverage, necessitating a more standardized and comprehensive evaluation set.
-
Unified "Steps" Definition and Fair Evaluation:
- Function: Standardizes the definition of a "step" across different attack methods to ensure a fair comparison.
- Mechanism: Treats each prompt modification as one step, setting the maximum number of modification steps to 50; \(\text{ASR} = n/m\), using GPT-4-Turbo to judge jailbreak success from three aspects.
- Design Motivation: GCG uses optimization epochs, while TAP uses query counts to define a "step"—inconsistent definitions make direct comparison unfair.
Loss & Training¶
This work introduces an evaluation framework rather than a training method, and thus does not involve custom loss designs. The evaluation metric is the attack success rate \(\text{ASR} = n/m\).
Key Experimental Results¶
Main Results¶
Average ASR of direct attacks across 9 LLMs:
| Attack Method | Type | Vicuna | Llama3.1 | GPT-3.5 | GPT-4 | DeepSeek-V3 | Average |
|---|---|---|---|---|---|---|---|
| LAA | Heuristic | 1.00 | 0.55 | 1.00 | 0.74 | 1.00 | 0.87 |
| TAP | Feedback | 0.74 | 0.43 | 0.81 | 0.71 | 0.76 | 0.65 |
| PAIR | Feedback | 0.76 | 0.41 | 0.62 | 0.80 | 0.92 | 0.64 |
| DrAttack | Obfuscation | 0.85 | 0.32 | 0.80 | 0.79 | 0.74 | 0.63 |
| AIM | Human | 0.99 | 0.00 | 0.99 | 0.62 | 1.00 | 0.62 |
| Base64 | Obfuscation | 0.15 | 0.01 | 0.14 | 0.49 | 0.49 | 0.16 |
Ablation Study¶
Average ASR changes under 8 defense strategies:
| Attack Method | No Defense | PromptGuard | All 8 Types | ASR Drop |
|---|---|---|---|---|
| LAA | 0.87 | 0.00 | 0.00 | -0.87 |
| PAIR | 0.64 | 0.56 | 0.16 | -0.48 |
| TAP | 0.65 | 0.59 | 0.19 | -0.46 |
| DrAttack | 0.63 | 0.57 | 0.36 | -0.27 |
Key Findings¶
- Heuristic-based attacks exhibit "high ASR but low utility": LAA achieves an average ASR of 0.87, but PromptGuard can reduce it to 0%—prompts relying on initial seeds lack diversity.
- Feedback-based attacks are more robust: Even when deploying all 8 defenses, the ASR of PAIR and TAP remains above 15%—not depending on seeds, generating diverse and natural prompts.
- Recent models still face substantial jailbreak risks: DeepSeek-V3 has the highest average ASR (0.75), and LAA achieves 100% on it.
- ASR varies significantly across different violation categories: Political Activities achieves ASR \(\ge 0.80\) on GPT-3.5/GPT-4, despite explicit policy prohibitions.
Highlights & Insights¶
- Counter-intuitive insight: "High ASR \(\ne\) High Utility": Heuristic-based attacks, which seem the strongest, are almost entirely ineffective under defenses, whereas feedback-based attacks with lower ASR represent the real threat.
- Practical value of the taxonomy: The six-category classification clearly delineates the defense vulnerability of attacks based on "whether they depend on an initial seed."
- Unified step definition makes a fair comparison possible for the first time.
- Most comprehensive policy unification: The first to construct a unified violation classification (16 categories) based on 5 service providers.
Limitations & Future Work¶
- Only covers 17 attack methods, whereas over 200 jailbreak attacks currently exist.
- The evaluation focuses primarily on English scenarios, leaving multilingual jailbreaks insufficiently explored.
- The prohibited question set and policies are static and may become outdated.
- Judging jailbreak success relies on GPT-4-Turbo as an evaluator, which may introduce bias.
Related Work & Insights¶
- Safety-aligned LLMs: RLHF and red-teaming are mainstream safety training methods.
- Automatic Jailbreak Attacks: GCG is gradient-based, AutoDAN is genetic algorithm-based, and PAIR is LLM feedback-based, each possessing its own advantages and disadvantages.
- Defense Mechanisms: High perplexity detection, Moderation APIs, and the Llama Guard series—with significant variance in efficacy.
- Insights: The community should prioritize focusing on attack methods that do not rely on initial seeds, rather than performing incremental work on existing prompt variations.
Rating¶
- Novelty: ⭐⭐⭐ Primarily systematic evaluation work, with the core innovation in the taxonomy.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 17 attacks \(\times\) 9 models \(\times\) 8 defenses \(\times\) 160 questions \(\times\) 16 categories, extremely broad coverage.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with detailed data presentation.
- Value: ⭐⭐⭐⭐⭐ Provides highly valuable benchmarks and insights.