AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection¶

Conference: ACL 2025
arXiv: 2502.11448
Code: https://eddyluo1232.github.io/AGrail/
Area: LLM Safety
Keywords: LLM Agent, Guardrail, Safety Detection, test-time adaptation, Memory Module

TL;DR¶

Proposes AGrail, a lifelong learning LLM Agent guardrail framework. Through dual-LLM collaboration (Analyzer + Executor) and a memory module, it adaptively generates and optimizes safety check policies at test time, effectively defending against task-specific and systemic risks.

Background & Motivation¶

LLM Agents face two classes of risks when executing complex tasks autonomously: task-specific risks (defined by administrators based on task requirements, e.g., unauthorized access to diagnostic data) and systemic risks (arising from vulnerabilities in design or interaction, which may harm information confidentiality, integrity, or availability, e.g., prompt injection attacks).

Existing defense solutions face two key challenges:

Lack of adaptability: For instance, GuardAgent relies on manually defined trusted contexts, failing to generalize to dynamic downstream tasks.

Ineffective safety policies: For example, Conseca uses LLMs to generate adaptive safety policies, but it may misunderstand task requirements, resulting in policies that are too strict (blocking legitimate actions) or too loose (allowing unsafe actions). Even utilizing GPT-4o with CoT can lead to over-restriction.

Method¶

Overall Architecture¶

AGrail employs two identical LLMs acting as an Analyzer and an Executor, cooperating with a Memory module to adaptively (via TTA) and iteratively optimize the set of safety checks at test time.

The framework processes seven inputs: safety criteria \(\mathcal{I}_c\), optional guard request \(\mathcal{I}_r\), agent specification \(\mathcal{I}_s\), agent action \(\mathcal{I}_o\), optional environmental observation \(\mathcal{E}\), user request \(\mathcal{I}_i\), and a toolkit \(\mathcal{T}\).

The Goal is to find the optimal safety check subset \(\Omega^* \subseteq \Omega\) by iteratively updating memory \(m\) to approximate \(\Omega^*\):

\[\arg\min_{m \subseteq \Omega} d_{\cos}(\phi(m), \phi(\Omega^*))\]

Key Designs¶

1. Safety Criteria

Based on the agent safety taxonomy of He et al. (2024), three generalized safety categories are formulated: information integrity, information confidentiality, and information availability. Manually designed task-specific safety criteria are also supported.

2. Memory Module

Stores agent actions, safety categories, and corresponding safety checks.
Uses agent actions as retrieval keys.
Translates agent actions into two representations (natural language and tool command language) via step-back prompting, concatenating them to serve as the memory key.
Only displays the natural language form during retrieval to prevent the tool command language from interfering with reasoning.

3. Dual-LLM Collaboration Workflow

Analyzer: Retrieves existing safety checks \(m^{(t)}\) from memory, and modifies them based on \(\mathcal{I}_r\) and \(\mathcal{I}_c\)—revising, merging duplicates, or adding new adaptive safety checks.
Executor: Evaluates safety checks generated by the Analyzer, deciding whether to delete redundant/incorrect checks, invoke external tools for verification, or rely on its own reasoning. Updates \(m^{(t)}\) to \(m^{(t+1)}\) post-execution.

4. Tool Compatibility and Flexibility

The framework can optionally invoke customized auxiliary tools to enhance safety checks: permission detection tools (for EHR Agent scenario), OS environment detection tools (for OS Agent), and Web HTML detection tools (for Web Agent, EIA scenario only). Tool invocation is driven by safety checks and decided autonomously by the Executor.

Loss & Training¶

AGrail does not involve conventional training or loss functions, but instead optimizes iteratively via test-time adaptation (TTA). Whenever a similar agent action is encountered, safety checks in memory are iteratively refined. Ideally, the framework naturally converges to \(\Omega^*\) for each agent action.

In addition, the authors construct the Safe-OS benchmark, comprising three attack scenarios (Prompt Injection, System Sabotage, Environment Attack). It simulates a realistic OS environment using Docker and sets up two identity roles: root and regular user.

Key Experimental Results¶

Main Results¶

Systemic Risk Detection (Safe-OS + AdvWeb + EIA):

Defense Method	Safe-OS Benign↑	SS ASR↓	PI ASR↓	Env ASR↓	AdvWeb PI↓	EIA Grd↓	EIA Gen↓
Claude-3.5 (0-Shot)	50.0	0	14.3	20	0	40.0	28.0
AGrail (Claude-3.5)	95.6	3.8	0	5	0	6.0	28.0
AGrail (GPT-4o)	95.6	4.0	0	10	8.8	8.0	26.0

AGrail (Claude-3.5) retains 95.6% of benign actions while reducing the Prompt Injection ASR to 0%.

Task-Specific Risk Detection (Mind2Web-SC + EICU-AC): AGrail achieves over 94% F1 scores on both datasets, consistently ranking in the top two.

Ablation Study¶

ID/OOD Analysis: AGrail with memory (whether frozen or TTA) outperforms the memory-free version, with TTA reaching 99.1% accuracy on Claude-3.5.
Input Order Analysis: Claude-3.5 is barely affected by data order (variance ±1.2), whereas GPT-4o-mini exhibits larger fluctuations (±8.3), indicating that weaker models are more susceptible to sequence order perturbations.
Number of outlier tokens: Retaining 3 outlier tokens significantly improves performance, with diminishing marginal returns for additional ones.
Domain Adaptation: When migrating from Mind2Web-SC to EICU-AC, Claude-3.5 shows no degradation in performance.

Key Findings¶

Model-based defense (GPT-4o 0-shot) blocks 49.2% of benign actions on Safe-OS, leading to over-restriction.
Although LLaMA-Guard3 perfectly preserves benign actions, it fails entirely in attack detection (ASR for PI, Env, and SS is 100%).
AGrail achieves a superior trade-off between robustness and utility.
Analytical experiments demonstrate that the memory successfully converges to the ground truth after 4 iterations, with similarity across three seeds exceeding 98%.

Highlights & Insights¶

Lifelong Learning Paradigm: By iteratively optimizing safety checks via TTA, the framework continuously improves without requiring additional training.
Dual-LLM Role Division: The Analyzer is responsible for generating/adapting safety checks, and the Executor handles execution/verification/deletion, establishing an effective collaborative mechanism.
Handling of Environment-Dependent Attacks: Certain actions are benign in isolation but harmful under specific environments (e.g., renaming causing data overwrite). AGrail detects such risks using real-time environment analysis tools.
Safe-OS Benchmark: The construction of the Safe-OS benchmark fills the gap in safety evaluation for OS Agents in live execution environments.

Limitations & Future Work¶

Off-the-shelf LLMs are used as components. Future work could explore training specialized guardrail models.
The existing tools (safety detection tools) are limited and rely heavily on reasoning. More pluggable and advanced safety tools should be developed in the future.
When using weaker LLMs (such as GPT-4o-mini) as the base model, both performance and stability decline significantly.
The scale of Safe-OS is limited, and the scope of attack scenarios can be further expanded.

GuardAgent: Relies on manual trusted contexts and has limited generalization -> Resolved by the adaptive generation of AGrail.
Conseca: Leverages adaptive policies but is prone to misjudgment -> The iterative optimization of AGrail prevents bias from single-step generation.
LLaMA-Guard: A classifier designed for LLM outputs, which struggles with multi-modal (code, command) Agent actions.
ToolEmu: Simulates agent environments but lacks real-time, online evaluation -> Safe-OS provides a more realistic evaluation scenario.

Rating¶

Novelty: ★★★★☆ — The lifelong safety detection paradigm combining dual-LLM collaboration and a memory module is quite novel.
Value: ★★★★☆ — Can be directly applied to enhance the safety of various LLM Agent systems.
Experimental Thoroughness: ★★★★☆ — Conducted comprehensive evaluations across 5 datasets, covering multiple attack types with robust ablation and domain-transfer experiments.
Writing Quality: ★★★☆☆ — The writing is generally clear, though equations, notation, and descriptions are slightly redundant at times.