A2ASecBench: A Protocol-Aware Security Benchmark for Agent-to-Agent Multi-Agent Systems¶
Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=LfdFnakqGJ
Code: https://safo-lab.github.io/A2ASecBench/
Area: Multi-Agent Security / Agent / Security Benchmark
Keywords: A2A Protocol, Multi-Agent Systems, Security Evaluation, Threat Modeling, Protocol-Aware Attacks
TL;DR¶
This paper presents the first systematic evaluation of the security of Agent-to-Agent (A2A) protocol-driven multi-agent systems. The authors propose a threat taxonomy covering two major categories—"supply-chain manipulations" and "protocol-logic weaknesses"—comprising 6 protocol-aware attacks. Based on this, they construct A2ASecBench, the first dedicated security benchmark for A2A. By utilizing dynamic adapters to migrate attacks across different agent stacks and downstream tasks, and employing a "joint safety-utility evaluation" to quantify both harm and usefulness, they find that attack success rates (ASR) reach 100% for most attacks in three high-risk scenarios (travel, medical, finance) from the official A2A demo. These attacks are further shown to be transferable to other ecosystems such as LangGraph and ANP.
Background & Motivation¶
Background: As LLM-based Multi-Agent Systems (MAS) scale, the industry is shifting from manually written API integrations to A2A protocols, enabling heterogeneous agents to "discover → negotiate → collaborate" via declared capabilities. Within five months of its release in April 2025, the A2A protocol garnered approximately 20k stars and 2k forks on GitHub, with hundreds of contributors, and has been adopted into enterprise-grade products. It defines three core capabilities: capability discovery via AgentCards, task management via finite state machines with unique IDs, and collaboration via typed Parts and persistent Artifacts.
Limitations of Prior Work: While A2A provides interoperability, it introduces a protocol-level attack surface that exceeds the scope of "prompt-based defenses." Its execution model is "opaque"—agents expose only declared capabilities (AgentCard) and hide internal logic, memory, and tools, making identity and capability declarations difficult to verify independently. Once a forged or "cloaked" agent gains admission, it can induce clients to submit sensitive inputs, hijack/misroute tasks, withhold or contaminate intermediate results, launch task-flooding denial-of-service attacks, or return artifacts that trigger downstream code execution or data exfiltration, thereby compromising confidentiality, integrity, and availability simultaneously.
Key Challenge: Existing LLM-MAS security research mostly focuses on high-level aspects like agent communication, network topology, system constraints, and cascading injections. There is a lack of deep exploration into low-level, protocol-specific vulnerabilities, and even less availability of a standardized, unified, and reproducible benchmark framework for quantitative security assessment of A2A-MAS. Specifically, no prior work has systematically investigated how each stage of the A2A protocol (discovery, orchestration, execution) can be exploited.
Goal: (i) Establish a threat taxonomy and threat model for the A2A ecosystem; (ii) develop a dedicated benchmark capable of probing diverse, previously unexplored attack vectors; (iii) empirically measure the vulnerability of current A2A deployments in realistic high-risk scenarios.
Key Insight: The authors draw on classic security practices, treating A2A as an "insider threat within a trust boundary" (analogous to malicious clients in federated learning, TCP SYN flooding, SSRF, or XSS). They systematically enumerate attacks across three lifecycle stages: admission → orchestration → execution, rather than viewing threats in isolation.
Core Idea: Utilize a "protocol-aware" perspective to deconstruct A2A into its constituent stages and components, formalize specific attacks for each stage, and use a cross-stack reusable benchmark framework for quantitative scoring, while linking "attack success" with the "impairment of normal task utility."
Method¶
Overall Architecture¶
A2ASecBench is not a new model but a protocol-aware security evaluation framework for A2A multi-agent systems. Given a target A2A-MAS (including host agents, remote agents, AgentCards, task lifecycles, etc.) and a target scenario specification, it outputs the Attack Success Rate (ASR) for each attack category and the utility degradation caused by "cloaking" attacks on benign tasks.
The authors formalize the A2A agentic system as a directed graph with cycles \(G=(V,E)\): nodes \(v\in V\) correspond to agents \(a\in A\), and directed edges \(e=(u\to v)\) represent A2A communication from \(u\) to \(v\). Each agent is described by an AgentCard \(C(a)\in\mathcal{C}\) (identity, endpoints, declared capabilities) and can be discovered via a registry \(R\). Interaction states are bounded by sessions \(S\), messages and streams are encapsulated in envelopes \(M\), tool usage is specified by capability descriptors \(U\), and lifecycle mappings \(\Lambda\) govern protocol state transitions (discover → select → create → operate → update → terminate). For a task \(t\), a "task-induced active subgraph" \(G_t=(V_t,E_t)\subseteq G\) captures the agents actually invoked and their communications.
The framework organizes attacks around this graph across three lifecycle stages: Admission (discovery and selection), Orchestration (task scheduling and lifecycle), and Execution (resource dereferencing and artifact rendering). A scenario adapter instantiates abstract attacks into executable, reproducible test cases, enabling migration across different A2A-MAS implementations and downstream tasks.
Key Designs¶
1. Taxonomy of Two Threat Categories and Cross-Stage Threat Models: Framing Attacks as Campaigns
The authors categorize A2A risks into supply-chain manipulations and protocol-logic weaknesses, totaling 6 specific attacks. These are characterized by the attacker's "knowledge (what they know), capabilities (what actions they can take), and goals (what they aim to achieve)." Crucially, these are not treated in isolation but as attack campaigns spanning "admission → orchestration → execution." Each attack is mapped to its affected lifecycle stages (discovery/initiation/processing/interaction/completion), protocol components (AgentCard, Message, Task, Part, Artifact, Session, Stream), and compromised security properties (Confidentiality C / Integrity I / Availability A). This three-dimensional "Stage × Component × Impact" mapping ensures the taxonomy provides systematic coverage of the protocol surface and dictates the benchmark's detection vectors.
2. Formalization of Six Protocol-Aware Attack Vectors: Deterministic Success Criteria for Each Stage
These form the core task set of the benchmark, each with mathematical success criteria rather than subjective "appearance" of success:
- AgentCard Spoofing (AS): The attacker injects forged/perturbed AgentCards \(\tilde{C}(a)\) into the registry. The discovery process is formalized as a multiple-choice task: given a candidate set \(C^*=\{C^+(a)\}\cup\{C^-_1(a),\dots,C^-_k(a)\}\) (where \(k=10\) are malicious versions generated by an LLM pipeline and 1 is benign), the discovery decision function \(f_u\) must satisfy \(f_u(C^+(a))=1\) and \(f_u(C^-_i(a))=0\) for all \(i\). Otherwise, the attack is successful.
- Capability Cloaking (CC): An attacker is admitted with a syntactically valid, seemingly benign card but implements undeclared hidden capabilities in the backend. Formally, a non-empty difference exists between declared capabilities \(\tilde{U}_{decl}\) and actual capabilities \(\tilde{U}_{act}\) such that \(\Delta U \triangleq \tilde{U}_{act}\setminus\tilde{U}_{decl}\neq\varnothing\). Harm is quantified by utility loss \(\Delta U=U_{baseline}-U_{cloaked}\) when the agent is embedded in a normal MAS for benign tasks; significant utility drop indicates successful cloaking and a lack of runtime monitoring.
- Cycle Overflow (CO): Creating directed cycles \(C\subseteq E_t\) in the active subgraph, causing subtasks to endlessly "refine/forward" to each other. Success is defined as \(\exists C\subseteq E_t:\text{cycle}(C)=\text{true}\wedge \text{termination}(G_t)=\text{timeout}\), unless the system detects the cycle and terminates or breaks the dependency within bounded steps.
- Half-Open Task Flooding (HOTF): Numerous tasks are intentionally driven into an "input-required" half-open state without providing follow-up input, occupying execution slots. The indicator function \(I_{flood}(\alpha;T)=1\) holds when the number of half-open tasks \(|\{t\in T:s(t)=s_{in}\}|\ge\Theta_{thres}\) exceeds the capacity threshold and causes observable service disruption.
- Agent-Side Request Forgery (ASRF): Analogous to SSRF, an attacker embeds malicious URIs pointing to internal resources or controlled endpoints into a FilePart. The victim agent dereferences these using its elevated privileges. The criterion is \(\text{uri}(p^-)\notin D_{allow}\wedge \text{Priv}(a)\ge\kappa\wedge \exists s\in S:s\in O(\text{resp})\), indicating unauthorized dereferencing and leakage of a pre-embedded canary string.
- Artifact-Triggered Script Injection (ATSI): Analogous to XSS, a malicious agent appends injection control sequences \(\text{cont}(art^-)=\text{cont}_{benign}\,\|\,\text{payload}_{inj}\) into renderable artifacts (Markdown/HTML), which are treated as executable instructions during frontend rendering. Success is achieved when the injected string executes in the rendering context and produces harmful results.
These 6 criteria translate "protocol abuse" into automatically determinable binary events, allowing for ASR calculation.
3. Scenario Adapter: Decoupling Attack Logic from Scenario Details for Cross-Stack Portability
A2A-MAS implementations vary widely (different backends, domain tools). If attack scripts are hardcoded for one implementation, horizontal comparison is impossible. The authors introduce an adapter formalized as a mapping \(\text{Adapter}:A\times S\to T\), where \(A\) is the attack vector space, \(S\) is the scenario specification space, and \(T\) is the set of induced executable test cases. For any \(\langle a,s\rangle\), the adapter generates a suite of specific, reproducible test cases using an LLM to synthesize the cases from structured descriptions. This decoupling ensures portability across different A2A-MAS and downstream tasks.
4. Joint Safety-Utility Evaluation: Quantifying the Trade-off Between Harm and Usefulness
Measuring only attack success is insufficient; a system that rejects all requests is safe but useless. Borrowing from the helpful-harmless framework, the authors pair adversarial trials with benign tasks. For each attack category, a benign task suite is run alongside adversarial cases to explicitly measure the "harmlessness vs. utility" trade-off. For CC attacks, the benchmark reports both ASR and the magnitude of utility drop from baseline. This approach prevents rewarding systems that achieve high safety scores through "over-defense."
Loss & Training¶
This work presents a benchmark/evaluation framework and does not involve model training. The core evaluation protocol is: each trial \(i\) generates a binary success indicator \(I_i\), and the overall \(\text{ASR}=\frac{\sum_{i=1}^N I_i}{N}\), where \(N\) is the total number of trials. Success criteria for each attack are defined as formalized above. The System Under Test (SUT) is built on official A2A samples: a user calls a host agent via frontend; the host and three external servers (S1/S2/S3) are based on Gemini 2.5 Flash; the host coordinates three clients (C1/C2/C3) connected via the A2A protocol. These are adapted to Travel, Healthcare, and Finance domains through custom system prompts, toolsets, and AgentCards.
Key Experimental Results¶
Main Results¶
Evaluation of ASR across 6 attacks in three domains using the official A2A demo (Table 2):
| Attack | Travel | Healthcare | Finance |
|---|---|---|---|
| AgentCard Spoofing (AS) | 0.820 | 0.816 | 0.828 |
| Capability Cloaking (CC) | 1.00 | 1.00 | 1.00 |
| Half-Open Task Flooding (HOTF) | 1.00 | 1.00 | 1.00 |
| Cycle Overflow (CO) | 1.00 | 1.00 | 1.00 |
| Agent-Side Request Forgery (ASRF) | 1.00 | 1.00 | 1.00 |
| Artifact-Triggered Script Injection (ATSI) | 1.00 | 1.00 | 1.00 |
Except for AS, five attacks (CC/HOTF/CO/ASRF/ATSI) achieved a 100% ASR across all three domains, revealing a lack of systematic protocol-level robustness in current A2A deployments. AS, which depends on model discrimination, showed an average ASR of approximately 0.82–0.83; it remained highly vulnerable across Gemini 2.5 Flash, GPT-4o, Claude 4, and DeepSeek-R1 (with Grok4 performing best but still failing partially). CC attacks caused significant utility drops: benign task utility fell from 0.853→0.682 (Travel), 0.872→0.595 (Healthcare), and 0.962→0.749 (Finance).
Transferability & Defense¶
Transferability of attack patterns across ecosystems (Table 3; AS/CC/HOTF are not applicable to LangGraph due to lack of agent discovery/intermediate states):
| Attack Pattern | LangGraph | ANP |
|---|---|---|
| AgentCard Spoofing | N/A | 0.98 |
| Capability Cloaking | N/A | 1.00 |
| Half-Open Task Flooding | N/A | 1.00 |
| Cycle Overflow | 1.00 | 1.00 |
| Agent-Side Request Forgery | 1.00 | 1.00 |
| Artifact-Triggered Script Injection | 1.00 | 1.00 |
Defense effectiveness using NVIDIA NeMo Guardrails as a safety gateway (Table 4, lower residual ASR is better):
| Attack | Travel | Healthcare | Finance |
|---|---|---|---|
| Half-Open Task Flooding (HOTF) | 0.91 | 0.85 | 0.90 |
| Cycle Overflow (CO) | 0.66 | 0.73 | 0.70 |
| Agent-Side Request Forgery (ASRF) | 0.37 | 0.23 | 0.48 |
| Artifact-Triggered Script Injection (ATSI) | 0.94 | 0.93 | 0.91 |
Key Findings¶
- Protocol-level attacks are almost universally successful: Five attack categories reached 100% success in three domains, indicating that the issue lies in protocol semantics—identity binding, lifecycle management, privilege boundaries, and artifact rendering—rather than specific models.
- Attack patterns are transferable: Re-instantiating the same attack patterns on LangGraph and ANP yielded 100% success for most, proving these are general MAS protocol weaknesses rather than A2A implementation bugs.
- Standard guardrails are insufficient: Mature solutions like NeMo Guardrails only provide partial mitigation. HOTF/ATSI maintained high residual ASR (≥0.85 and ≥0.91), and CO was only partially suppressed (0.66–0.73). Even ASRF, which should be easily identifiable, retained a non-trivial success rate (0.23–0.48), as existing guardrails do not understand multi-agent interaction patterns and protocol semantics.
- Three Takeaways: (1) In MAS, agents must share responsibility for self and peer protection; hardening system prompts for intermediate host agents is a critical defense against "confused deputy" issues like ASRF and ATSI. (2) Developers must implement "progress-aware orchestration," setting quotas for half-open tasks, limiting recursion depth, and performing DAG validation to treat stalled or cyclic workflows as security threats. (3) There is a need for a secure A2A protocol profile with cryptographic binding of identity and capabilities.
Highlights & Insights¶
- Translating "Protocol Abuse" into Binary Events: The 6 formalized success criteria (multiple choice, utility delta, directed cycles + timeout, task thresholds, unauthorized dereferencing + canary, script execution) allow security evaluation to function as a quantitative benchmark for the first time.
- Joint Safety-Utility Evaluation Prevents Evaluation Gaming: Pairing adversarial trials with benign tasks ensures that systems becoming "useless through over-defense" do not receive high scores, a principle applicable to any safety-usability trade-off.
- Scenario Adapter Decouples Logic from Context: \(\text{Adapter}:A\times S\to T\) uses LLMs to instantiate abstract attacks into specific domains, making them reproducible across heterogeneous stacks and horizontally comparable—a key engineering abstraction for turning one-off red-teaming into a benchmark.
- Compelling Adaptation of Classic Security Principles: Mapping HOTF to SYN flooding, ASRF to SSRF, ATSI to XSS, and cloaking to federated learning insider threats provides both a mature blueprint for attack design and strong evidence that MAS security is fundamentally a protocol semantics issue.
Limitations & Future Work¶
- Narrow System Scope: Evaluation is based on the official A2A demo with a single topology (1 host + 3 clients + 3 servers). The vulnerability of production-grade MAS with diverse topologies and backends remains to be verified.
- Optimistic Success Criteria: Many criteria rely on "returning a canary" or "timing out." While reliable for determining exploitability, these may not perfectly equate to the actual degree of real-world harm (e.g., the value or scope of exfiltrated data).
- Evaluation vs. Mitigation: The paper highlights mitigation directions (prompt hardening, gateways, secure profiles) but primarily points out guardrail inadequacies rather than providing a complete, end-to-end, and empirically validated defense implementation.
- AS Dependency on Model Capabilities: The success of AgentCard Spoofing depends heavily on the discriminatory power of the underlying LLM, meaning its reproducibility may drift as models iterate.
Related Work & Insights¶
- Comparison to Higher-Level MAS Security: While existing research covers interaction and network-level security (communication, topology, system constraints), this work delves into protocol-specific, low-level vulnerabilities and provides a standardized, quantitative benchmark.
- Comparison to Preliminary A2A Security Analyses: Unlike earlier threat checklists or qualitative best-practice suggestions, this work provides formalized attacks, an executable benchmark, cross-ecosystem transferability, and empirical guardrail testing.
- Insights from Classic Security: By mapping DoS, SSRF, XSS, and insider threats to the A2A ecosystem, the authors demonstrate that "old" problems—identity binding, lifecycle management, and privilege boundaries—re-emerge as critical protocol semantics issues in agentic systems.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First dedicated A2A security benchmark; systematizes and formalizes protocol-aware threats.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple domains, models, ecosystems (LangGraph/ANP), and defense testing, though topologies were relatively simple.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear narrative structure (admission → orchestration → execution); well-defined formal criteria and takeaways.
- Value: ⭐⭐⭐⭐⭐ Highlights the lack of protection in current A2A deployments; significant implications for standardizing agentic ecosystem security.