Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems¶

Conference: AAAI 2026 arXiv: 2511.18467 Code: https://github.com/wxqkk0808/IMBIA Area: Robotics Keywords: Multi-Agent Security, Malicious Code Injection, Software Development Agents, Adversarial Defense, Malware Families

TL;DR¶

The first systematic security analysis of LLM-based multi-agent software development systems (ChatDev/MetaGPT/AgentVerse): proposes the IMBIA attack framework covering two threat scenarios (malicious user + benign agents / benign user + malicious agent) and 12 malicious behaviors across 5 malware families, achieving an attack success rate (ASR) of up to 93% on ChatDev, with the Adv-IMBIA adversarial defense reducing ASR by 40–73%.

Background & Motivation¶

Background: LLM-based multi-agent software development systems (e.g., ChatDev, MetaGPT, AgentVerse) enable non-technical users to generate complete software via natural language and have attracted widespread attention.

Limitations of Prior Work: The security of these systems remains almost entirely unstudied—existing work focuses only on single-agent code security or general LLM agent benchmarks, without systematically analyzing the security risks of end-to-end multi-agent software development systems.

Key Challenge: The powerful capabilities of multi-agent systems are a double-edged sword—they can produce high-quality software but may equally be exploited to generate software containing hidden malicious code. Two threat vectors exist: malicious users inducing benign agents to generate harmful code, or benign users interacting with compromised agents.

Key Insight: Two threat scenarios are defined (MU-BA and BU-MA), 12 attack behaviors spanning 5 major malware families are designed, and systematic evaluation is conducted across three mainstream frameworks.

Core Idea: The first systematic attack-and-defense study of LLM-based multi-agent software development systems—combining IMBIA implicit malicious injection with Adv-IMBIA adversarial prompt-based defense.

Method¶

Attack: IMBIA¶

Two Threat Scenarios: - MU-BA (Malicious User → Benign Agents): A malicious input \(P_m\) is appended to a benign requirement \(P_b\). - BU-MA (Benign User → Malicious Agent): Malicious prompts are injected into agent role descriptions, creating compromised agents.

Three Components of Malicious Prompts: \(P_m = \{T_s, T_d, C_i\}\) (summary + detailed description + code instruction)

12 Malicious Behaviors across 5 Families: - Trojan: Intercept user input / exfiltrate data / download malicious files - Spyware: Monitor clipboard / keylogging / screenshot capture - Adware: Force-display advertisements / redirect to ad URLs / hijack clipboard - Ransomware: Encrypt files for ransom / lock screen for ransom - Virus: Self-replicate to other files

Defense: Adv-IMBIA¶

Adv-MU-BA: Inject defensive prompts into agent role descriptions (reject unauthorized data access, self-replication, and harmful code generation).
Adv-BU-MA: Append defensive prompts to user inputs.
7 defense configurations: single-stage (design/coding/testing), dual-stage, and all-stage.

Key Experimental Results¶

Attack Success Rate¶

Framework	MU-BA ASR	BU-MA ASR (Best)
ChatDev	93%	71%
MetaGPT	45%	84%
AgentVerse	71%	45%

Defense Effectiveness (ASR Reduction / Rejection Rate Increase)¶

Framework	MU-BA ASR↓	MU-BA RR↑	BU-MA ASR↓
ChatDev	-73%	+72%	-45%
MetaGPT	-40%	+52%	-7%
AgentVerse	-49%	+65%	-42%

Cross-Model Attack (ChatDev MU-BA)¶

Model	ASR
GPT-4o-mini	92.9%
Claude-4-sonnet	87.5%
GPT-o3	81.1%
Llama-3.1-405b	78.3%

Key Findings¶

ChatDev is most vulnerable to malicious users (93% ASR): its waterfall architecture propagates malicious instructions across all stages.
MetaGPT is most robust against malicious users (45% ASR): user tasks are fed only to the first agent, and the SOP filters some malicious content.
AgentVerse is most robust against malicious agents (45% ASR): agile collective discussion prevents single-agent control.
Attacks at later stages are more dangerous: compromised coding or testing agents have greater impact than compromised design-stage agents.
Attacking all agents simultaneously is not optimal: concurrent malicious instructions to multiple agents produce decision conflicts, reducing overall ASR.
Stronger LLMs are more vulnerable: GPT-4o-mini (92.9%) > Claude-4 (87.5%) > Llama-8b (lowest)—more capable models are more compliant with malicious instructions.
Attacks do not degrade software quality: the utility under attack (UUA) remains close to baseline, indicating that malicious code is effectively concealed.
Defending against MU-BA is easier than BU-MA: agent-level defenses are effective against malicious users (−73%), whereas user-level defenses are weaker against malicious agents (only −7% for MetaGPT).

Highlights & Insights¶

The first systematic attack-and-defense analysis covering the two most practically relevant threat scenarios, with 12 malicious behaviors spanning major malware families—establishing a foundation for security research in multi-agent software development.
The finding that "architecture determines security" is particularly valuable: different development paradigms (waterfall vs. agile) exhibit inherently different resistance to different attack types.
The paradox that "stronger models are more vulnerable" reveals a fundamental tension between instruction-following capability and security—more obedient models are more susceptible to executing malicious directives.
The "critical-agent defense" strategy is practically useful: defending only key-stage agents achieves near-full-defense effectiveness without requiring protection of every agent, thereby saving resources.

Limitations & Future Work¶

The defense approach relies entirely on prompt injection, without code-level verification (e.g., static analysis or sandboxed execution).
Defense against the BU-MA scenario remains weak (only −7% for MetaGPT), necessitating stronger defense mechanisms.
GPT-4o-mini serves as the primary experimental model; cross-model validation is limited.
The evaluation scale is relatively small: 40 benign requirements × 12 malicious behaviors = 480 test cases.
More sophisticated attacks are not considered, such as multi-turn intermittent injection or cross-agent coordinated attacks.

vs. General Agent Security Research: General agent security focuses on prompt injection and jailbreaking, whereas this work targets the specific context of multi-agent software development—implicit injection of malicious code.
vs. Code Security Tools: Traditional code security relies on SAST/DAST detection, but LLM-generated malicious code may bypass pattern matching, requiring semantic-level detection.
Implications for Multi-Agent System Design: (a) The propagation scope of malicious instructions across agents should be restricted; (b) agents at critical stages require stronger security auditing; (c) agile/deliberative architectures are inherently more secure than waterfall-style architectures.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First systematic study of multi-agent software development security, with comprehensive coverage across dual scenarios, 12 attack behaviors, and 3 frameworks.
Experimental Thoroughness: ⭐⭐⭐⭐ Complete structure spanning 3 frameworks × 12 behaviors × 7 configurations, cross-model evaluation, and ablation studies, though the dataset scale is relatively small.
Writing Quality: ⭐⭐⭐⭐ Threat model is clearly defined and findings are concisely summarized, though some experimental details require consulting the appendix.
Value: ⭐⭐⭐⭐⭐ Carries significant security implications for multi-agent development systems; the attack-defense paradigm provides a framework for future research.