A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns¶

Conference: ACL 2025
arXiv: 2410.16155
Code: None
Area: LLM Alignment
Keywords: Multi-agent attack, contagious jailbreak, memory attack, toxicity dissipation, ARCJ

TL;DR¶

Proposes TMCHT (a large-scale multi-agent multi-topology text attack evaluation framework) and ARCJ (Adversarial Replicative Contagious Jailbreak) method—enhancing the retrieval probability of toxic samples by optimizing the retrieval suffix + enabling the self-replicating contagious capability of toxic information by optimizing the replicative suffix, solving the "toxicity dissipation" problem faced by single-agent attack methods in multi-agent systems.

Background & Motivation¶

Background: LLMs are widely used as agents, where memory is a key component but susceptible to jailbreak attacks.
Limitations of Prior Work: Existing attacks only focus on single-agent memory or shared memory, whereas in real-world scenarios, multi-agent systems typically employ independent memories.
Key Challenge: The effectiveness of single-agent attack methods drops sharply in non-complete graph topologies (line/star) and large-scale (100+ agents) systems.
Goal: Achieve effective cross-agent jailbreak propagation in multi-agent systems with independent memories.
Key Insight: Discovery of the "toxicity dissipation" phenomenon—toxic suffixes gradually disappear when propagating between agents, leading to retrieval failure.
Core Idea: Empower toxic information with "self-replicating" capabilities to maintain toxicity during propagation.

Method¶

Overall Architecture¶

TMCHT task setting: One attacker agent \(\rightarrow\) in a given social topology (graph/line/star) \(\rightarrow\) through a limited number of dialogue rounds \(\rightarrow\) misleads the agents across the entire society. The ARCJ method consists of two stages: first optimizing the retrieval suffix, and then optimizing the replicative suffix.

Key Designs¶

Retrieval Suffix Optimization:
- Function: Makes toxic samples more likely to be selected during memory retrieval.
- Mechanism: Minimize the distance between the toxic sample embedding and the query embedding: \(\min_{\delta_r} \text{dist}(\text{Embed}(x + \delta_r), \text{Embed}(q))\), where \(\delta_r\) is the retrieval suffix.
- Design Motivation: Toxic samples must first be retrieved to take effect—if benign content has higher similarity, toxic samples will never be utilized.
Replicative Suffix Optimization (Contagious Capability):
- Function: Forces the attacked LLM to automatically replicate toxic information in its responses.
- Mechanism: \(\min_{\delta_c} -\log P(x + \delta_r + \delta_c | \text{context})\), maximizing the probability of reproducing toxic text in LLM responses.
- Design Motivation: Addresses the core of "toxicity dissipation"—standard attacks have suffixes rewritten/lost during propagation. The replicative suffix ensures that responses from downstream agents also contain the complete toxic content.
Multi-Topology Evaluation Framework (TMCHT):
- Function: Defines multi-agent attack evaluation tasks under three social topologies: graph, line, and star.
- Mechanism: The attacker only communicates with neighboring agents and must influence the entire network within limited rounds; evaluation metrics are ASR (Attack Success Rate) and influence range.
- Design Motivation: The communication topology of real-world multi-agent systems is not a complete graph—line and star are the most challenging structures.

Loss & Training¶

The retrieval suffix is optimized using gradients (GCG-style discrete token search), and the replicative suffix is optimized using a maximum likelihood objective. The two are optimized serially.

Key Experimental Results¶

Main Results¶

Attack success rate under different topologies (%):

Method	Graph	Line	Star	100 Agents
Single-agent baseline	~40	20.69	19.19	32.25
ARCJ (Ours)	~65	44.20	38.94	85.18
Gain	—	+23.51	+18.95	+52.93

Ablation Study¶

Contributions of each component:

Configuration	Line ASR	Star ASR
No suffix	~15	~12
Retrieval suffix only	20.69	19.19
Retrieval + replicative suffix	44.20	38.94

Key Findings¶

Toxicity dissipation is the core barrier to multi-agent attacks: Toxic suffixes gradually disappear during propagation.
Replicative suffix contributes the most: From 20.69% \(\rightarrow\) 44.20% (line), showing a larger contribution compared to the retrieval suffix.
100-agent system is particularly vulnerable: The baseline is only 32.25%, but ARCJ reaches 85.18%—scaling up unexpectedly makes the system easier to attack.
Reveals contagion risks in multi-agent architectures: Independent memories do not guarantee security.

Highlights & Insights¶

Discovery and naming of the "toxicity dissipation" phenomenon—precisely describing the root cause of multi-agent attack failures.
Unique "self-replicating" idea—allowing toxic information to propagate like a virus among agents.
100-agent experimental scale—the first to validate multi-agent attacks on such a large scale.
Important revelation that independent memory \(\neq\) safety.

Limitations & Future Work¶

Only validated on text tasks; multimodal multi-agent systems are not covered.
Attackers require white-box access to optimize suffixes.
Defense methods are not fully explored.
Social topology structures are simple, while real network topologies are more complex.

vs. Single-Agent Memory Attack (Chen et al. 2024): Attack only on single agents—Ours extends this to multi-agent propagation.
vs. Shared Memory Attack (Ju et al. 2024): Shared memory is naturally propagative—Ours solves the propagation challenge in independent memories.
vs. GCG (Zou et al. 2023): Original adversarial suffix method—Ours incorporates replication capability.
Insights: The security of multi-agent systems needs to be considered from the perspective of propagation dynamics.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The discovery of toxicity dissipation + the design of the self-replicating suffix are highly innovative.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple topologies + multiple scales + ablation.
Writing Quality: ⭐⭐⭐⭐ Clear problem definition and vivid "Town" metaphor.
Value: ⭐⭐⭐⭐ Reveals the contagion risks of multi-agent systems.