Skip to content

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs

Conference: ACL 2025
arXiv: 2402.05668
Code: None
Area: LLM Alignment
Keywords: Jailbreak Attacks, LLM Safety, Attack Classification, Defense Evaluation, Benchmark

TL;DR

The first unified comprehensive assessment framework covering both automatic and non-automatic jailbreak attacks: compiling 17 representative jailbreak attacks, establishing a taxonomy of six attack categories, and performing large-scale systematic evaluation across 9 aligned LLMs and 8 defense strategies, revealing the key insight that heuristic-based attacks exhibit "high ASR but low utility."

Background & Motivation

Background: LLM safety alignment is currently a core topic in AI safety, but various jailbreak attack methods constantly emerge to bypass safety barriers. Limitations of Prior Work: Existing studies are fragmented, evaluating jailbreak methods in isolated environments with inconsistent experimental setups, incomplete alignment verification, and evaluations representing only human-designed or obfuscation attacks while failing to include emerging automated methods. Key Challenge: The lack of a unified, fair benchmark to comprehensively understand the actual threat level of different types of jailbreak attacks. Goal: To provide the first unified and comprehensive assessment framework covering various attack types (including both automatic and non-automatic). Key Insight: Approaching from the prompt generation mechanism of attack methods to construct a unified taxonomy, and conducting large-scale evaluation under a joint experimental setup. Core Idea: To establish a six-category taxonomy of attacks, combined with a set of 160 prohibited questions covering a unified policy, to systematically assess attack effectiveness and defense performance.

Method

Overall Architecture

The assessment pipeline consists of four steps: (1) collecting 17 representative jailbreak attacks; (2) constructing a taxonomy of six attack categories based on prompt generation mechanisms; (3) extracting 16 violation categories from the usage policies of 5 mainstream LLM providers to construct a highly diverse set of 160 prohibited questions; (4) performing systematic evaluations across 9 aligned LLMs and assessing performance under 8 advanced defense strategies.

Key Designs

  1. Six Attack Categories Taxonomy:

    • Function: Categorizes 17 attacks into six major classes based on two criteria (whether the original query is modified, and how jailbreak prompts are generated).
    • Mechanism: Human-based (handcrafted prompts from the web), Obfuscation-based (obfuscation via encoding/low-resource languages), Heuristic-based (mutation/genetic algorithms, requiring an initial seed), Feedback-based (gradient/scoring iterations, no seeds required), Fine-tuning-based (fine-tuning the target LLM), and Generation-parameter-based (modifying inference parameters only).
    • Design Motivation: Taxonomy based on prompt generation mechanisms reveals fundamental differences among attacks—especially the critical boundary of "whether they depend on initial seeds" which dictates defense robustness.
  2. Unified Violation Policy and Prohibited Question Set:

    • Function: Integrates policies from 5 providers to construct a standardized dataset of 16 violation classes \(\times\) 10 questions = 160 prohibited questions.
    • Mechanism: Takes the union of policies from various platforms, eliminates redundancies found in previous datasets (such as 24 bomb-related questions in AdvBench), and combines manual filtering with LLM generation to ensure diversity. Each violation category is verified by two human annotators.
    • Design Motivation: Prior datasets suffered from redundancy, inappropriateness, or incomplete coverage, necessitating a more standardized and comprehensive evaluation set.
  3. Unified "Steps" Definition and Fair Evaluation:

    • Function: Standardizes the definition of a "step" across different attack methods to ensure a fair comparison.
    • Mechanism: Treats each prompt modification as one step, setting the maximum number of modification steps to 50; \(\text{ASR} = n/m\), using GPT-4-Turbo to judge jailbreak success from three aspects.
    • Design Motivation: GCG uses optimization epochs, while TAP uses query counts to define a "step"—inconsistent definitions make direct comparison unfair.

Loss & Training

This work introduces an evaluation framework rather than a training method, and thus does not involve custom loss designs. The evaluation metric is the attack success rate \(\text{ASR} = n/m\).

Key Experimental Results

Main Results

Average ASR of direct attacks across 9 LLMs:

Attack Method Type Vicuna Llama3.1 GPT-3.5 GPT-4 DeepSeek-V3 Average
LAA Heuristic 1.00 0.55 1.00 0.74 1.00 0.87
TAP Feedback 0.74 0.43 0.81 0.71 0.76 0.65
PAIR Feedback 0.76 0.41 0.62 0.80 0.92 0.64
DrAttack Obfuscation 0.85 0.32 0.80 0.79 0.74 0.63
AIM Human 0.99 0.00 0.99 0.62 1.00 0.62
Base64 Obfuscation 0.15 0.01 0.14 0.49 0.49 0.16

Ablation Study

Average ASR changes under 8 defense strategies:

Attack Method No Defense PromptGuard All 8 Types ASR Drop
LAA 0.87 0.00 0.00 -0.87
PAIR 0.64 0.56 0.16 -0.48
TAP 0.65 0.59 0.19 -0.46
DrAttack 0.63 0.57 0.36 -0.27

Key Findings

  1. Heuristic-based attacks exhibit "high ASR but low utility": LAA achieves an average ASR of 0.87, but PromptGuard can reduce it to 0%—prompts relying on initial seeds lack diversity.
  2. Feedback-based attacks are more robust: Even when deploying all 8 defenses, the ASR of PAIR and TAP remains above 15%—not depending on seeds, generating diverse and natural prompts.
  3. Recent models still face substantial jailbreak risks: DeepSeek-V3 has the highest average ASR (0.75), and LAA achieves 100% on it.
  4. ASR varies significantly across different violation categories: Political Activities achieves ASR \(\ge 0.80\) on GPT-3.5/GPT-4, despite explicit policy prohibitions.

Highlights & Insights

  • Counter-intuitive insight: "High ASR \(\ne\) High Utility": Heuristic-based attacks, which seem the strongest, are almost entirely ineffective under defenses, whereas feedback-based attacks with lower ASR represent the real threat.
  • Practical value of the taxonomy: The six-category classification clearly delineates the defense vulnerability of attacks based on "whether they depend on an initial seed."
  • Unified step definition makes a fair comparison possible for the first time.
  • Most comprehensive policy unification: The first to construct a unified violation classification (16 categories) based on 5 service providers.

Limitations & Future Work

  • Only covers 17 attack methods, whereas over 200 jailbreak attacks currently exist.
  • The evaluation focuses primarily on English scenarios, leaving multilingual jailbreaks insufficiently explored.
  • The prohibited question set and policies are static and may become outdated.
  • Judging jailbreak success relies on GPT-4-Turbo as an evaluator, which may introduce bias.
  • Safety-aligned LLMs: RLHF and red-teaming are mainstream safety training methods.
  • Automatic Jailbreak Attacks: GCG is gradient-based, AutoDAN is genetic algorithm-based, and PAIR is LLM feedback-based, each possessing its own advantages and disadvantages.
  • Defense Mechanisms: High perplexity detection, Moderation APIs, and the Llama Guard series—with significant variance in efficacy.
  • Insights: The community should prioritize focusing on attack methods that do not rely on initial seeds, rather than performing incremental work on existing prompt variations.

Rating

  • Novelty: ⭐⭐⭐ Primarily systematic evaluation work, with the core innovation in the taxonomy.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 17 attacks \(\times\) 9 models \(\times\) 8 defenses \(\times\) 160 questions \(\times\) 16 categories, extremely broad coverage.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure with detailed data presentation.
  • Value: ⭐⭐⭐⭐⭐ Provides highly valuable benchmarks and insights.