LLMs Know Their Vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts¶

Conference: ACL 2025
arXiv: 2410.10700
Code: https://github.com/AI45Lab/ActorAttack
Area: LLM/NLP
Keywords: jailbreak, actor-network theory, multi-turn attack, safety alignment, natural distribution shift

TL;DR¶

Proposes ActorBreaker, a multi-turn attack method based on Latour's Actor-Network Theory. By leveraging benign prompts semantically related to harmful content (natural distribution shifts) to bypass safety mechanisms, it achieves state-of-the-art (SOTA) attack success rates on HarmBench, revealing the semantic coverage gap between pre-training and safety training data.

Background & Motivation¶

Background: LLM safety training teaches models to refuse harmful queries, yet adversarial attacks (e.g., GCG, jailbreak) can bypass them. Most attacks utilize adversarial distribution shifts (e.g., ciphertexts, low-resource language translation).

Limitations of Prior Work: Existing multi-turn attacks rely on fixed strategies (e.g., role-play, hypothetical scenarios), lacking diversity. Moreover, the use of "unnatural" prompts makes them easily detectable.

Key Challenge: The pre-training data contains a vast amount of seemingly benign knowledge semantically related to harmful topics (e.g., the association between "Ted Kaczynski" and bomb-making), but safety training fails to cover these indirect associations.

Goal: To construct benign yet effective multi-turn attacks by leveraging natural semantic associations within the pre-training distribution.

Key Insight: Latour's Actor-Network Theory (ANT)—decomposing a harmful target into six categories of actors (creator, disseminator, recipient, regulator, etc.), each containing human and non-human entities.

Core Idea: Utilizing the LLM's own knowledge to construct a semantic association network of harmful content, and using benign multi-turn dialogues to progressively guide the model to expose unsafe content.

Method¶

Overall Architecture¶

Two phases: (1) Network Construction—Given a harmful query, the LLM is used to build an actor-network (six categories of actors \(\times\) human/non-human entities), where each node acts as a potential attack clue; (2) Attack Generation—Selecting an actor and its semantic relationship with the harmful target as a clue, then generating multi-turn benign prompts to progressively guide the model.

Key Designs¶

Actor-Network Construction
- Six types of actors: creator, disseminator, recipient, regulator, influencer, or associator.
- Each type is categorized into human (historical figures, etc.) and non-human (books, media, social movements) entities.
- Instantiating the network using the LLM's own knowledge.
- Design Motivation: Latour’s theory ensures comprehensive coverage of potential attack paths.
Attack Chain Generation
- Selecting an actor and its semantic relationship with the target.
- Designing multi-turn seemingly harmless questions to gradually approach the target.
- Design Motivation: Each turn is benign when viewed individually, but they collectively guide the model.
Natural Distribution Shifts vs. Adversarial Distribution Shifts
- The prompts in this work are within the pre-training distribution (naturally benign).
- Prompts in prior methods are out-of-distribution (encryption/role-play).
- Design Motivation: Natural prompts do not trigger safety detectors.

Key Experimental Results¶

Main Results — HarmBench Attack Success Rate¶

Method	Type	GPT-4o	Claude-3	Llama-3-70B	Average
GCG	Single-turn	~20%	~15%	~35%	~23%
PAIR	Single-turn	~30%	~25%	~40%	~32%
Crescendo	Multi-turn	~40%	~35%	~50%	~42%
ActorBreaker	Multi-turn	~55%	~45%	~65%	~55%

Safety Detector Bypassing¶

Method	Llama-Guard-2 Detection Rate
GCG prompt	~80% (easily detected)
PAIR prompt	~60%
Crescendo prompt	~40%
ActorBreaker prompt	~5% (barely detected)

Defense Effectiveness¶

Configuration	Attack Success Rate
Baseline (No Defense)	55%
Fine-tuned on ActorBreaker safety data	25% (-30%)
General capabilities degraded	~3% (slight trade-off)

Key Findings¶

ActorBreaker achieves the highest attack success rate on all aligned LLMs, including GPT-o1.
Llama-Guard-2 barely detects these prompts, as each individual turn is benign.
Attack diversity far exceeds existing methods, as the six types of actors provide rich attack paths.
Defense is effective but has a trade-off: Fine-tuning on ActorBreaker data reduces the attack success rate by 30%, but slightly affects general capabilities.
The semantic gap between pre-training and safety training resides as the fundamental problem.

Highlights & Insights¶

Application of Actor-Network Theory in AI safety is highly novel, transforming a sociological theory into a systematic red-teaming methodology.
The conceptual distinction between natural distribution shifts and adversarial distribution shifts uncovers the fundamental limitations of safety alignment.
The recursive nature of using an LLM's own knowledge to attack itself yields profound research insights.

Limitations & Future Work¶

The attack method could potentially be abused.
The construction of the actor-network depends on the LLM itself; if safety training reaches near-perfection, it might fail to construct the network.
Future Directions: Broader coverage in safety training data and safety mechanisms parameterized by semantic distance.

vs GCG/PAIR: While they rely on adversarial distribution shifts, ActorBreaker utilizes natural distribution shifts.
vs Crescendo: Crescendo relies on fixed templates, whereas ActorBreaker generates diverse paths grounded in network theory.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Innovative combination of Actor-Network Theory and natural distribution shifts.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluation across multiple models, multiple baseline attacks, and defense experiments.
Writing Quality: ⭐⭐⭐⭐⭐ Solid theoretical foundation.
Value: ⭐⭐⭐⭐⭐ Offers significant insights for AI safety research.