Skip to content

LLMs Know Their Vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts

Conference: ACL 2025
arXiv: 2410.10700
Code: https://github.com/AI45Lab/ActorAttack
Area: LLM/NLP
Keywords: jailbreak, actor-network theory, multi-turn attack, safety alignment, natural distribution shift

TL;DR

Proposes ActorBreaker, a multi-turn attack method based on Latour's Actor-Network Theory. By leveraging benign prompts semantically related to harmful content (natural distribution shifts) to bypass safety mechanisms, it achieves state-of-the-art (SOTA) attack success rates on HarmBench, revealing the semantic coverage gap between pre-training and safety training data.

Background & Motivation

Background: LLM safety training teaches models to refuse harmful queries, yet adversarial attacks (e.g., GCG, jailbreak) can bypass them. Most attacks utilize adversarial distribution shifts (e.g., ciphertexts, low-resource language translation).

Limitations of Prior Work: Existing multi-turn attacks rely on fixed strategies (e.g., role-play, hypothetical scenarios), lacking diversity. Moreover, the use of "unnatural" prompts makes them easily detectable.

Key Challenge: The pre-training data contains a vast amount of seemingly benign knowledge semantically related to harmful topics (e.g., the association between "Ted Kaczynski" and bomb-making), but safety training fails to cover these indirect associations.

Goal: To construct benign yet effective multi-turn attacks by leveraging natural semantic associations within the pre-training distribution.

Key Insight: Latour's Actor-Network Theory (ANT)—decomposing a harmful target into six categories of actors (creator, disseminator, recipient, regulator, etc.), each containing human and non-human entities.

Core Idea: Utilizing the LLM's own knowledge to construct a semantic association network of harmful content, and using benign multi-turn dialogues to progressively guide the model to expose unsafe content.

Method

Overall Architecture

Two phases: (1) Network Construction—Given a harmful query, the LLM is used to build an actor-network (six categories of actors \(\times\) human/non-human entities), where each node acts as a potential attack clue; (2) Attack Generation—Selecting an actor and its semantic relationship with the harmful target as a clue, then generating multi-turn benign prompts to progressively guide the model.

Key Designs

  1. Actor-Network Construction

    • Six types of actors: creator, disseminator, recipient, regulator, influencer, or associator.
    • Each type is categorized into human (historical figures, etc.) and non-human (books, media, social movements) entities.
    • Instantiating the network using the LLM's own knowledge.
    • Design Motivation: Latour’s theory ensures comprehensive coverage of potential attack paths.
  2. Attack Chain Generation

    • Selecting an actor and its semantic relationship with the target.
    • Designing multi-turn seemingly harmless questions to gradually approach the target.
    • Design Motivation: Each turn is benign when viewed individually, but they collectively guide the model.
  3. Natural Distribution Shifts vs. Adversarial Distribution Shifts

    • The prompts in this work are within the pre-training distribution (naturally benign).
    • Prompts in prior methods are out-of-distribution (encryption/role-play).
    • Design Motivation: Natural prompts do not trigger safety detectors.

Key Experimental Results

Main Results — HarmBench Attack Success Rate

Method Type GPT-4o Claude-3 Llama-3-70B Average
GCG Single-turn ~20% ~15% ~35% ~23%
PAIR Single-turn ~30% ~25% ~40% ~32%
Crescendo Multi-turn ~40% ~35% ~50% ~42%
ActorBreaker Multi-turn ~55% ~45% ~65% ~55%

Safety Detector Bypassing

Method Llama-Guard-2 Detection Rate
GCG prompt ~80% (easily detected)
PAIR prompt ~60%
Crescendo prompt ~40%
ActorBreaker prompt ~5% (barely detected)

Defense Effectiveness

Configuration Attack Success Rate
Baseline (No Defense) 55%
Fine-tuned on ActorBreaker safety data 25% (-30%)
General capabilities degraded ~3% (slight trade-off)

Key Findings

  • ActorBreaker achieves the highest attack success rate on all aligned LLMs, including GPT-o1.
  • Llama-Guard-2 barely detects these prompts, as each individual turn is benign.
  • Attack diversity far exceeds existing methods, as the six types of actors provide rich attack paths.
  • Defense is effective but has a trade-off: Fine-tuning on ActorBreaker data reduces the attack success rate by 30%, but slightly affects general capabilities.
  • The semantic gap between pre-training and safety training resides as the fundamental problem.

Highlights & Insights

  • Application of Actor-Network Theory in AI safety is highly novel, transforming a sociological theory into a systematic red-teaming methodology.
  • The conceptual distinction between natural distribution shifts and adversarial distribution shifts uncovers the fundamental limitations of safety alignment.
  • The recursive nature of using an LLM's own knowledge to attack itself yields profound research insights.

Limitations & Future Work

  • The attack method could potentially be abused.
  • The construction of the actor-network depends on the LLM itself; if safety training reaches near-perfection, it might fail to construct the network.
  • Future Directions: Broader coverage in safety training data and safety mechanisms parameterized by semantic distance.
  • vs GCG/PAIR: While they rely on adversarial distribution shifts, ActorBreaker utilizes natural distribution shifts.
  • vs Crescendo: Crescendo relies on fixed templates, whereas ActorBreaker generates diverse paths grounded in network theory.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Innovative combination of Actor-Network Theory and natural distribution shifts.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluation across multiple models, multiple baseline attacks, and defense experiments.
  • Writing Quality: ⭐⭐⭐⭐⭐ Solid theoretical foundation.
  • Value: ⭐⭐⭐⭐⭐ Offers significant insights for AI safety research.