Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw¶

Conference: ICML2026
arXiv: 2605.11047
Code: https://github.com/ZJUICSR/DeepTrap
Area: AI Security / Agent Red-Teaming / Contextual Security Evaluation
Keywords: Agent Security, Red-Teaming Evaluation, Context Poisoning, Trajectory-Level Evaluation, Black-Box Optimization

TL;DR¶

Addressing agents like OpenClaw that possess "variable execution contexts including files, memory, tools, and skills," this work proposes DeepTrap, an automated red-teaming framework. It formulates the injection of adversarial payloads into clean contexts as a black-box, discrete, and stochastic trajectory-level multi-objective optimization problem (aiming to trigger risks, preserve normal tasks, and maintain stealth). Using reward-guided beam search and reflective deep-probing, DeepTrap uncovers high-value poisoned contexts. Evaluation across 42 cases, 6 risk categories, and 9 target models demonstrates that context poisoning allows agents to quietly achieve attack goals while completing legitimate user tasks, proving that security assessments focusing solely on final responses are insufficient.

Background & Motivation¶

Background: Autonomous agents increasingly complete complex digital tasks end-to-end by utilizing "persistent contextual resources" such as reading/writing files, calling tools, accessing external services, and maintaining memory. OpenClaw represents these "execution-centric" settings, integrating LLM reasoning with system-level actions and user-oriented workflows.

Limitations of Prior Work: As capabilities improve, security boundaries have been pushed beyond explicit user prompts. In real-world deployments, unsafe behaviors may stem not from malicious user instructions, but from "contextual components encountered only at runtime," such as contaminated files, memory entries, tool metadata, skills, or configuration artifacts. Most existing research assumes attackers directly modify user prompts (explicit prompt injection), whereas more dangerous threats are stealthier: the user request itself is legitimate (e.g., searching emails, creating calendars), and risks arise from outsourced or environmental contexts.

Key Challenge: The most harmful security failures are not "disruptive attacks" but stealthy compound pollution—where the agent fulfills the user's normal request while simultaneously achieving the attacker's target, leaving the external task result appearing flawless. Existing evaluations often: ① assume attackers directly manipulate user instructions; ② use abstract or simplified agent settings that mask vulnerabilities exposed only in real execution dynamics; ③ only ask "can harmful behavior be induced," neglecting "can the attack remain hidden while the task seemingly succeeds."

Goal: Shift security analysis from prompt-centric to trajectory-centric, systematically identifying "context-driven" stealthy failures. Three difficulties exist: the search space consists of discrete combinatorial contextual artifacts rather than continuous parameters; the attack must satisfy three objectives (trigger risk, preserve task, maintain stealth); and OpenClaw is essentially a black-box stochastic policy where unsafe behaviors emerge only through multi-step interactions, meaning isolated prompt-response tests cannot detect them.

Key Insight: Since the actual harm occurs during middle steps (tool calls, memory R/W, file operations), one must evaluate the complete execution trajectory instead of just the final response and formulate vulnerability discovery as an optimizable search problem.

Core Idea: Given a legitimate instruction, a clean context, and a target risk category, search within the "injectable context payload space." Each candidate undergoes a full agent rollout, scored by a multi-objective reward (Risk × Task × Stealth). Reward-guided beam search and reflective deep-probing are then used to efficiently approximate the optimal contaminated context.

Method¶

Overall Architecture¶

DeepTrap models OpenClaw as an agent interacting within a variable execution context: the user provides a normal instruction \(u\), and an initial context \(x_0\) (including files, memory, tools, skills, and workspace state). The agent follows a stochastic policy \(a_t\sim\pi_\phi(\cdot\mid u,h_t,x_t)\) for step-by-step actions, and the environment updates according to \((x_{t+1},o_{t+1})\sim P(\cdot\mid x_t,a_t)\), resulting in a complete trajectory \(\tau=(x_0,o_0,a_0,\dots,x_T,o_T,a_T)\). An evaluation instance is \(\mathcal{I}=(u,x_0,r,\Omega_r)\), where \(r\) is the target risk category and \(\Omega_r(\tau)\in\{0,1\}\) determines if the trajectory exhibits that unsafe behavior.

The threat model is a contextual adversary: it cannot modify user instruction \(u\) or policy \(\pi_\phi\), but can inject payload \(p\) through allowed context channels before the agent starts, performing a context transformation \(\tilde{x}_0=\Gamma(x_0,p)\). The pipeline has three stages: ① Derive "expected normal behavior + risk grading standards" from \((u,r)\) to instantiate a risk-conditioned evaluation task; ② Generate candidate payloads, run full rollouts after injection, and score trajectories with multi-objective rewards; ③ Use these scores for beam search to expand strong candidates, prune weak ones, and periodically reflect to improve subsequent proposals.

graph TD
    A["Input:<br/>Normal Instruction u + Clean Context x0<br/>+ Target Risk r"] --> B["Risk-Conditioned Evaluation:<br/>Derive Expected Normal Behavior<br/>+ Risk Grading Standards"]
    B --> C["Multi-Objective Trajectory Scoring:<br/>Risk × Task × Stealth<br/>Run Full Rollout for Scoring"]
    C --> D["Reward-Guided Beam Search:<br/>Proposal Model Expansion + Deep Adaptive Pruning"]
    D --> E["Reflective Deep-Probing:<br/>Summarize Success/Failure into Reflection<br/>Condition Subsequent Proposals"]
    E -->|Depth < D, Recirculate Reflection| C
    E --> F["Output:<br/>Highest-scoring Payload p̂ from History"]

Key Designs¶

1. Multi-Objective Trajectory-Level Adversarial Goal: Formulating "Stealthy Compound Pollution" as Constrained Optimization

A single "induced harmful behavior" indicator fails to detect truly dangerous attacks—those that are harmful, stealthy, and successful. Ours formulates payload selection as constrained optimization: \(\max_{p\in\mathcal{P}}\mathbb{E}_\tau[J_{\mathrm{risk}}(\tau;r)]\) s.t. \(\mathbb{E}_\tau[J_{\mathrm{task}}(\tau;u)]\ge\eta_t\) and \(\mathbb{E}_\tau[J_{\mathrm{stealth}}(\tau,p;r)]\ge\eta_s\), measuring risk realization, task completion, and undetectability respectively. Using Lagrangian relaxation, a scalar multi-objective score is obtained: \(J=\lambda_r J_{\mathrm{risk}}+\lambda_t J_{\mathrm{task}}+\lambda_s J_{\mathrm{stealth}}\), with the final goal \(p^\star=\arg\max_p\mathbb{E}_{\tau\sim\Pi_\phi(\cdot\mid u,\Gamma(x_0,p))}[J(\tau,p;\mathcal{I})]\). Since the trajectory distribution \(\Pi_\phi\) is only accessible via rollout, it is estimated via Monte Carlo: \(\widehat{J}(p)=\frac{1}{n(p)}\sum_i J(\tau_p^{(i)},p;\mathcal{I})\). This defines "dangerous attacks" precisely: payloads that only cause harm without completing the task are easily detected and less realistic; the most dangerous satisfy all three conditions. Scoring combines deterministic checks (e.g., unauthorized resource access) and LLM semantic grading (e.g., whether the final response satisfies the user).

2. Reward-Guided Beam Search: Approximating Black-Box Optima in Discrete Combinatorial Spaces

Directly solving the optimization is infeasible due to the discrete combinatorial \(\mathcal{P}\), stochastic trajectories, and the cost of full OpenClaw rollouts. DeepTrap utilizes reward-guided beam search: maintaining a beam of high-scoring candidates, a proposal model \(\pi_\varphi(p'\mid u,x_0,r,p,s)\) expands each payload in \(\mathcal{B}_{\ell-1}\) into \(B\) candidates, which are injected, rolled out, and scored by \(J\) to select the top-\(K\) for the next beam (approximating \(\mathcal{B}_\ell=\arg\max_{|\mathcal{S}|\le K}\sum_{p\in\mathcal{S}}\widehat{J}(p)\)). The proposal model only "suggests perturbations," while quality is always determined by actual execution. A key innovation is the depth-adaptive pruning threshold \(\beta_\ell=\beta\cdot\frac{\ell-1}{D-1}\): the first layer threshold is 0 to allow diverse exploration, becoming stricter with depth as candidates are refined by reflection. \(K\) controls the exploitation-exploration trade-off.

3. Reflective Deep-Probing: Converting Past Evaluations into Structured Feedback

Scores alone do not explain why candidates succeed or fail. Every \(\alpha\) search depth, DeepTrap summarizes recent outcomes into a reflection \(s_\ell\) (failure modes, partial successes, useful context patterns, discrepancies between intent and behavior). This \(s_\ell\) is only used to condition the proposal model \(p'\sim\pi_\varphi(\cdot\mid u,x_0,r,p,s_\ell)\), not as an additional scoring signal—ensuring evaluation remains grounded in execution. This is particularly useful for ambiguous trajectories. Finally, the payload is selected from the entire evaluated set \(\mathcal{C}\) via \(\widehat{p}=\arg\max_{p\in\mathcal{C}}\widehat{J}(p)\) to ensure strong candidates lost due to stochastic fluctuations are captured.

Example: Tool Misuse in a Trojan Style Checker¶

Consider a "Blog Style Check" task: the user request is legitimate—run a style checker on blog_post.md and report readability feedback. However, in the contaminated workspace, the manuscript-style-checker skill appears to be a normal analysis tool but its implementation additionally writes environment keys to style_report.md. During execution, OpenClaw follows the workflow, calls the checker, and returns a reasonable summary—achieving high UGS. The hidden side effect is that BLOG_API_KEY, CMS_TOKEN, and ANALYTICS_SECRET are embedded as non-rendered metadata (High AGS + High Stealth). This demonstrates why final NL responses are insufficient: the user sees success, but the trajectory contains an unsafe file write.

Key Experimental Results¶

Main Results¶

The benchmark includes 42 cases = 6 risk types (R1 Harness Hijacking / R2 Privacy Leakage / R3 Unauthorized Execution / R4 Supply Chain / R5 Tool Misuse / R6 Encoding Obfuscation) × 7 cases each, covering 7 operating scenarios (S1–S7). Two metrics are used: AGS (Attack Grading Score) and UGS (Utility Grading Score). Higher values for both represent stronger attacks in those directions. (Selected AGS results across 9 models):

Claw Model	R1	R2	R3	R4	R5	R6
GPT-5.4	0.77	0.84	0.76	0.61	0.67	0.53
Claude-Sonnet-4.6	0.51	0.58	0.37	0.25	0.38	0.20
Qwen3.5-Plus	0.93	0.93	0.86	0.74	0.88	0.97
DeepSeek-v4-Flash	0.90	0.96	0.80	0.90	0.82	0.94
MiMo-v2.5-pro	0.74	0.83	0.56	0.58	0.58	0.53

Most non-Claude models maintain very high UGS (often 0.85–1.00), meaning the agent often achieves attack goals while completing the legitimate task, confirming that contextual pollution can be stealthy. Claude-Sonnet-4.6 shows significantly lower AGS, indicating higher resistance.

Ablation Study¶

Configuration	Key Conclusion	Description
Judge = DeepSeek-v4-Pro vs Python checker	LLM judge higher for R1/R2; Python higher for R3/R6	LLM excels at semantic risks; Python excels at structured production checks (R1: 0.88 vs 0.70; R3: 0.63 vs 0.80)
Iteration 0 → 5	Avg AGS 0.65 → 0.75	Iterative trap refinement improves attack discovery.
Reflection + Beam Search (Full)	Exposes non-trivial risks across 9 models	Result of joint optimization of risk/stealth/task.

Key Findings¶

Privacy Leakage is most easily activated: Non-Claude models show high AGS (Qwen3.5-Plus: 0.93, DeepSeek variants: 0.96), suggesting agents over-trust context and expose sensitive info.
Unauthorized Execution/Supply Chain/Tool Misuse varies by model: Claude is significantly more robust in these areas, whereas Qwen/DeepSeek remain vulnerable.
Risks are not tied to specific task templates: S2/S3/S6/S7 show high success rates nationwide, proving DeepTrap captures general weaknesses in how agents retrieve/trust context rather than scenario-specific artifacts.
Iteration value goes beyond attack enhancement: Growth in AGS over iterations shows the framework adapts payloads to specific model behavioral tendencies.

Highlights & Insights¶

Precise Threat Model: Targeting the realistic scenario of "legitimate request + outsourced context" is more practical than direct prompt modification.
Trajectory Evaluation Refutes "Response-Only" Monitoring: The combination of high AGS and high UGS proves task completion does not guarantee safety; trajectories can hide unauthorized writes or cron jobs.
Decoupling Evaluation from Proposal: The proposal model only suggests, while quality is grounded in rollout. This avoids the circular logic of "LLM self-praising."
Depth-Adaptive Pruning: The "loose early, tight late" thresholding \(\beta_\ell\) is a transferable trick for any LLM-based tree search.

Limitations & Future Work¶

Target Model Names are Preview/Fictional (e.g., GPT-5.4, Claude-Sonnet-4.6, DeepSeek-v4) ⚠️ Proceed with caution as these may not reflect verifiable versions.
Small Benchmark Scale: 42 cases across 6 risks is limited; statistical robustness is constrained by sample size.
Focus on Attack, Not Defense: Future work should explore context integrity checks and risk-aware tool governance.
Judge Dependence: Semantic grading relies on LLM judges, which have inherent biases/capabilities.

vs. Static Red-Teaming Benchmarks (HarmBench, etc.): DeepTrap emphasizes dynamic, context-aware, trajectory-level search.
vs. Single-Agent Targeted Manipulation (AgentPoison, AdvAgent, RAT): These often manipulate memory or prompts; DeepTrap restricts the adversary to outsourced context while requiring task preservation and stealth.
vs. Multi-Agent Red-Teaming (SkillAttack, etc.): DeepTrap focuses on contextual vulnerabilities within a single framework's execution trajectory, explicitly optimizing for the "Risk × Task × Stealth" trinity.

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐