Robust Deep Reinforcement Learning against Adversarial Behavior Manipulation¶
Conference: ICLR 2026
arXiv: 2406.03862
Code: None
Area: AI Safety / Reinforcement Learning
Keywords: Behavior-targeted attack, Adversarial robustness, Imitation learning attack, Time-discounted defense, Policy smoothing
TL;DR¶
This paper investigates a new type of threat in RL—behavior-targeted attacks (where an adversary guides the victim to execute a specific target policy by tampering with observations). It proposes the BIA attack method, which does not require white-box access, and the TDRT defense method based on time discounting. TDRT maintains robustness against attacks while achieving 28.2% higher original task performance than existing defenses (SA-PPO).
Background & Motivation¶
Background: Existing research on RL adversarial attacks mainly focuses on "reward minimization" attacks—making the victim perform as poorly as possible. Defense methods (such as ATLA, SA-PPO) are also designed primarily against these types of attacks.
Limitations of Prior Work: A more dangerous attack mode exists—behavior-targeted attacks. Here, the adversary does not intend to make the victim fail, but rather guides it to perform a specific behavior (e.g., diverting an autonomous vehicle to a specific store). Existing attacks of this type (PA-AD, Targeted PGD) require white-box access to the victim's policy, which is difficult to achieve in practice. Furthermore, no defense methods specifically target such attacks.
Key Challenge: How can behavior-targeted attacks be implemented without white-box access to the victim's policy? How can a defense be designed that resists behavioral attacks without excessively sacrificing original task performance?
Key Insight: Behavior-targeted attacks are remodeled as a cumulative reward maximization problem in an MDP (Theorem 5.1), allowing the victim's policy to be naturally embedded into the environment dynamics without requiring white-box access.
Core Idea: For the attack side—use MDP reconstruction to transform white-box requirements into black-box ones. For the defense side—use time-discounted weighted robust training to prioritize the protection of early decisions.
Method¶
Overall Architecture¶
This paper addresses two mirrored problems: how to "induce" a victim into performing a specific behavior without internal access to its policy, and conversely, how to train a policy that resists such induction without losing original task performance. On the attack side, Behavior Imitation Attack (BIA) constructs an auxiliary MDP, treats the victim as part of the environment, and uses standard imitation learning (GAIL/ILfO) to learn a tampering policy \(\nu\) that makes the victim's actual behavior approximate the target behavior. On the defense side, Time-Discounted Robust Training (TDRT) adds a time-discounted weighted worst-case KL regularization to PPO training, prioritizing the stabilization of early decisions. Both sides are unified by the same theory: the attack feasibility analysis (Theorem 5.1) directly derives where the defense should be constrained (Theorem 6.1).
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}%%
flowchart TD
D["Target Behavior Demonstrations<br/>(4–20 trajectories)"]
subgraph BIA["Behavior Imitation Attack (BIA)"]
direction TB
M["Reconstruct Auxiliary MDP M̂<br/>Victim policy π absorbed into transition dynamics"]
IL["Imitation Learning<br/>GAIL (Black-box) / ILfO (No-box)"]
NU["Tampering Policy ν<br/>True state → Fake state for victim"]
M --> IL --> NU
end
subgraph TDRT["Time-Discounted Robust Training (TDRT)"]
direction TB
REG["Time-discounted worst-case KL regularization<br/>Weighted by γ^t, prioritize early decisions"]
PPO["PPO Training"]
REG --> PPO
end
D --> M
NU -->|"Composite policy π∘ν approaches target policy π_tgt"| OUT["Attack Success<br/>Victim induced to execute specified behavior"]
NU -.->|"Thm 6.1 gives budget upper bound<br/>→ Where to constrain"| REG
PPO --> ROB["Robust Victim Policy<br/>Resist induction, preserve task performance"]
Key Designs¶
1. Behavior Imitation Attack (BIA): Replacing "White-box Gradients" with "Standard RL in a New MDP"
Existing behavior-targeted attacks (PA-AD, Targeted PGD) require gradients of the victim policy, implying white-box access which is often unavailable. BIA learns a tampering policy \(\nu: s \mapsto \hat{s}\) that maps the true state to a false state fed to the victim, so that the composite policy \(\pi \circ \nu(a|s) = \sum_{\hat{s}} \nu(\hat{s}|s)\pi(a|\hat{s})\) matches the target policy \(\pi_{\text{tgt}}\). The key step is Theorem 5.1: solving the distribution matching problem \(\arg\min_\nu \mathcal{D}(\pi \circ \nu, \pi_{\text{tgt}})\) is equivalent to cumulative reward maximization in a specially constructed MDP \(\hat{M}\), where the victim policy \(\pi\) is absorbed into the transition dynamics of \(\hat{M}\).
This transformation enables the black-box nature: the adversary no longer needs gradients of \(\pi\); they only need to run standard RL or imitation learning within \(\hat{M}\). Two algorithms are implemented—GAIL (black-box, requiring a few demonstrations) and ILfO (no-box, requiring only state trajectories). Experiments show that 4–20 demonstrations can achieve performance close to white-box attacks, with only a ~7% gap compared to PA-AD.
2. Time-Discounted Robust Training (TDRT): Prioritizing Early Decisions instead of Uniform Smoothing
With the attack characterized, the defense must determine "where and how strongly to constrain." Theorem 6.1 provides an upper bound on the adversary's gain:
This bound reveals two things: first, reducing the policy's sensitivity to state perturbations (suppressing the KL shift of the tampered policy) directly improves robustness; second, the \(\gamma^t\) weight implies that shifts in early timesteps are more fatal than later ones—in sequential decision-making, early errors propagate and amplify. Accordingly, the TDRT training objective is:
This adds a worst-case KL regularization discounted by \(\gamma^t\) to the regular RL objective. The difference from SA-PPO lies in this discounting: SA-PPO applies uniform policy smoothing across all timesteps, which yields similar robustness but at a cost of a 28.2% drop in task performance. TDRT concentrates the "smoothing budget" on early states, preserving more task capability for the same level of robustness. It also differs from adversarial training (ATLA/PA-ATLA), which simulates reward minimization attacks during training—a threat model mismatch that makes them ineffective against behavior-targeted attacks.
Loss & Training¶
- Attack Training: Run standard GAIL (requires target behavior demonstrations) or ILfO (requires only state trajectories) in the constructed MDP \(\hat{M}\).
- Defense Training: Superimpose the time-discounted worst-case KL divergence regularization term on the PPO objective. The hyperparameter \(\lambda\) controls the trade-off between robustness and task performance.
Key Experimental Results¶
Main Results¶
Attack performance across 10 Meta-World task pairs (Attack Reward ↑ = More Successful):
| Attack Method | Requirement | Typical Attack Reward | Note |
|---|---|---|---|
| Random | None | 947 | Random perturbations are weak |
| PA-AD | White-box | 4255 | Requires policy gradients |
| BIA-ILfD | Black-box (20 demos) | 3962 | Close to white-box performance |
| BIA-ILfO | No-box | ~3900 | Close to ILfD in deterministic envs |
Defense performance (Best Attack Reward ↓ = More Robust):
| Defense Method | Typical Attack Reward ↓ | Original Task Performance |
|---|---|---|
| No Defense | 1556 | Baseline |
| ATLA-PPO | 1158 | Average |
| SA-PPO | 403 | Poor (-28.2%) |
| TDRT-PPO | 378 | Good (Baseline level) |
Ablation Study¶
| Configuration | Key Finding |
|---|---|
| Time Discounting vs. Uniform | TDRT has equivalent robustness but 28.2% higher task performance |
| Number of Demos | Effective attacks possible with only 4 demonstrations |
| Adversarial Training (ATLA) | Ineffective against behavior-targeted attacks (designed for different threat models) |
| Difficulty | Attacks are harder when victim and target distributions differ significantly (e.g., window-open, door-lock) |
Key Findings¶
- BIA achieves attack performance close to white-box methods with only 20 demonstrations, proving that behavior-targeted attacks are a feasible and dangerous real-world threat.
- Adversarial training (ATLA) is nearly ineffective against behavior-targeted attacks because it simulates reward minimization rather than behavior manipulation during training.
- The time discounting in TDRT is the key differentiator: SA-PPO's uniform smoothing achieves similar robustness at the cost of 28.2% task performance, whereas TDRT preserves task capability by focusing on early steps.
- Behavior-targeted attacks decrease in effectiveness when the victim and target behaviors differ significantly.
Highlights & Insights¶
- MDP Reconstruction (Theorem 5.1): Elegantly transforms white-box requirements into black-box ones by embedding the victim policy into the environment dynamics. This concept can be transferred to other security scenarios requiring white-to-black-box conversion.
- "Early decisions are more important than later ones": This insight has broad applicability. In sequential decisions, early errors propagate. This inspires prioritizing the decision quality of early states in any robust RL training.
- Unified Framework: Studying attacks and defenses together, where the theoretical analysis of attacks (Theorem 5.1) directly guides the defense design (Theorem 6.1), forms a comprehensive closed loop.
Limitations & Future Work¶
- Attack effectiveness is limited in high-dimensional observation spaces (e.g., image inputs).
- TDRT provides empirical robustness rather than certified robustness (no certified guarantee).
- Attacks are difficult when behavior distributions differ significantly, suggesting some scenarios may not require defense.
- Defense depends on the assumption of the adversary's KL divergence constraint.
Related Work & Insights¶
- vs. PA-AD (Zhang et al.): PA-AD requires white-box access; BIA achieves black-box/no-box attacks via MDP reconstruction with only a ~7% performance gap.
- vs. SA-PPO: SA-PPO smooths all timesteps uniformly; TDRT uses time discounting to focus on early steps—similar robustness but 28.2% higher task performance.
- vs. ATLA/Adversarial Training: Adversarial training assumes reward-minimizing attackers and is ineffective against behavior manipulation, exposing the "defense-threat model mismatch" problem.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ MDP reconstruction and time-discounted defense are novel concepts.
- Experimental Thoroughness: ⭐⭐⭐⭐ 10 Meta-World task pairs and multiple baseline comparisons, though missing high-dimensional experiments.
- Writing Quality: ⭐⭐⭐⭐⭐ The logical chain from attack to theory to defense is very clear.
- Value: ⭐⭐⭐⭐⭐ Reveals a neglected but dangerous RL attack mode and provides an efficient defense.