A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns¶
Conference: ACL 2025
arXiv: 2410.16155
Code: None
Area: LLM Alignment
Keywords: Multi-agent attack, contagious jailbreak, memory attack, toxicity dissipation, ARCJ
TL;DR¶
Proposes TMCHT (a large-scale multi-agent multi-topology text attack evaluation framework) and ARCJ (Adversarial Replicative Contagious Jailbreak) method—enhancing the retrieval probability of toxic samples by optimizing the retrieval suffix + enabling the self-replicating contagious capability of toxic information by optimizing the replicative suffix, solving the "toxicity dissipation" problem faced by single-agent attack methods in multi-agent systems.
Background & Motivation¶
Background: LLMs are widely used as agents, where memory is a key component but susceptible to jailbreak attacks.
Limitations of Prior Work: Existing attacks only focus on single-agent memory or shared memory, whereas in real-world scenarios, multi-agent systems typically employ independent memories.
Key Challenge: The effectiveness of single-agent attack methods drops sharply in non-complete graph topologies (line/star) and large-scale (100+ agents) systems.
Goal: Achieve effective cross-agent jailbreak propagation in multi-agent systems with independent memories.
Key Insight: Discovery of the "toxicity dissipation" phenomenon—toxic suffixes gradually disappear when propagating between agents, leading to retrieval failure.
Core Idea: Empower toxic information with "self-replicating" capabilities to maintain toxicity during propagation.
Method¶
Overall Architecture¶
TMCHT task setting: One attacker agent \(\rightarrow\) in a given social topology (graph/line/star) \(\rightarrow\) through a limited number of dialogue rounds \(\rightarrow\) misleads the agents across the entire society. The ARCJ method consists of two stages: first optimizing the retrieval suffix, and then optimizing the replicative suffix.
Key Designs¶
-
Retrieval Suffix Optimization:
- Function: Makes toxic samples more likely to be selected during memory retrieval.
- Mechanism: Minimize the distance between the toxic sample embedding and the query embedding: \(\min_{\delta_r} \text{dist}(\text{Embed}(x + \delta_r), \text{Embed}(q))\), where \(\delta_r\) is the retrieval suffix.
- Design Motivation: Toxic samples must first be retrieved to take effect—if benign content has higher similarity, toxic samples will never be utilized.
-
Replicative Suffix Optimization (Contagious Capability):
- Function: Forces the attacked LLM to automatically replicate toxic information in its responses.
- Mechanism: \(\min_{\delta_c} -\log P(x + \delta_r + \delta_c | \text{context})\), maximizing the probability of reproducing toxic text in LLM responses.
- Design Motivation: Addresses the core of "toxicity dissipation"—standard attacks have suffixes rewritten/lost during propagation. The replicative suffix ensures that responses from downstream agents also contain the complete toxic content.
-
Multi-Topology Evaluation Framework (TMCHT):
- Function: Defines multi-agent attack evaluation tasks under three social topologies: graph, line, and star.
- Mechanism: The attacker only communicates with neighboring agents and must influence the entire network within limited rounds; evaluation metrics are ASR (Attack Success Rate) and influence range.
- Design Motivation: The communication topology of real-world multi-agent systems is not a complete graph—line and star are the most challenging structures.
Loss & Training¶
The retrieval suffix is optimized using gradients (GCG-style discrete token search), and the replicative suffix is optimized using a maximum likelihood objective. The two are optimized serially.
Key Experimental Results¶
Main Results¶
Attack success rate under different topologies (%):
| Method | Graph | Line | Star | 100 Agents |
|---|---|---|---|---|
| Single-agent baseline | ~40 | 20.69 | 19.19 | 32.25 |
| ARCJ (Ours) | ~65 | 44.20 | 38.94 | 85.18 |
| Gain | — | +23.51 | +18.95 | +52.93 |
Ablation Study¶
Contributions of each component:
| Configuration | Line ASR | Star ASR |
|---|---|---|
| No suffix | ~15 | ~12 |
| Retrieval suffix only | 20.69 | 19.19 |
| Retrieval + replicative suffix | 44.20 | 38.94 |
Key Findings¶
- Toxicity dissipation is the core barrier to multi-agent attacks: Toxic suffixes gradually disappear during propagation.
- Replicative suffix contributes the most: From 20.69% \(\rightarrow\) 44.20% (line), showing a larger contribution compared to the retrieval suffix.
- 100-agent system is particularly vulnerable: The baseline is only 32.25%, but ARCJ reaches 85.18%—scaling up unexpectedly makes the system easier to attack.
- Reveals contagion risks in multi-agent architectures: Independent memories do not guarantee security.
Highlights & Insights¶
- Discovery and naming of the "toxicity dissipation" phenomenon—precisely describing the root cause of multi-agent attack failures.
- Unique "self-replicating" idea—allowing toxic information to propagate like a virus among agents.
- 100-agent experimental scale—the first to validate multi-agent attacks on such a large scale.
- Important revelation that independent memory \(\neq\) safety.
Limitations & Future Work¶
- Only validated on text tasks; multimodal multi-agent systems are not covered.
- Attackers require white-box access to optimize suffixes.
- Defense methods are not fully explored.
- Social topology structures are simple, while real network topologies are more complex.
Related Work & Insights¶
- vs. Single-Agent Memory Attack (Chen et al. 2024): Attack only on single agents—Ours extends this to multi-agent propagation.
- vs. Shared Memory Attack (Ju et al. 2024): Shared memory is naturally propagative—Ours solves the propagation challenge in independent memories.
- vs. GCG (Zou et al. 2023): Original adversarial suffix method—Ours incorporates replication capability.
- Insights: The security of multi-agent systems needs to be considered from the perspective of propagation dynamics.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The discovery of toxicity dissipation + the design of the self-replicating suffix are highly innovative.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multiple topologies + multiple scales + ablation.
- Writing Quality: ⭐⭐⭐⭐ Clear problem definition and vivid "Town" metaphor.
- Value: ⭐⭐⭐⭐ Reveals the contagion risks of multi-agent systems.