Adversarial Inception Backdoor Attacks against Reinforcement Learning¶

Conference: ICML 2025
arXiv: 2410.13995
Code: https://github.com/ethanrathbun/Q-Incept
Area: AI Safety
Keywords: backdoor attack, reinforcement learning, data poisoning, action manipulation, reward constraints

TL;DR¶

Proposes the "inception" backdoor attack framework—by inserting triggers into the RL agent's training trajectories and replacing high-reward actions with targeted adversarial actions, achieving a 100% attack success rate (ASR) under strict reward constraints for the first time, while maintaining agent performance on clean tasks.

Background & Motivation¶

Background: The widespread application of DRL in safety-critical domains (autonomous driving, cyber defense, robotics) makes it a prime target for adversarial attacks. Backdoor attacks manipulate the agent during training to execute predefined adversarial behaviors when encountering specific triggers during deployment.

Limitations of Prior Work: Existing backdoor attacks (TrojDRL, SleeperNets) assume that attackers have arbitrary control over reward signals, injecting extreme reward values far beyond the natural range of the environment. This results in: (a) reward clipping/normalization easily negating the attack; (b) anomalously large reward values being easily detected by simple rule-based detectors.

Key Challenge: When rewards are constrained within the natural environment range \([\inf[R], \sup[R]]\), the attacker cannot make the expected return of adversarial actions exceed that of optimal actions solely through reward manipulation—especially when the cumulative return of the optimal action is extremely high (e.g., close to \(\gamma/(1-\gamma)\)), which would require unbounded rewards to counteract.

Goal: How to achieve highly effective RL backdoor attacks under strict reward constraints?

Key Insight: Rather than manipulating reward values, manipulate actions in training data—replacing optimal actions at high-reward steps with the target adversarial action, tricking the agent into "believing" that the adversarial action led to the high rewards.

Core Idea: Plant a "memory" (inception) in training trajectories—making the agent believe that the target action is highly rewarding because it sees the association between the target action and high rewards in historical data.

Method¶

Overall Architecture¶

During training, the attacker: 1. Observes complete trajectories generated by the agent \(H = \{(s, a, r)_t\}\) 2. Injects a trigger \(\delta(s_t)\) into observations at training steps with probability \(\beta\) 3. Key Novelty: Uses a DQN to estimate the Q-value of each step, selects high-reward steps, and replaces their actions with the target action \(a^+\) 4. Modified trajectories are used by the agent for policy optimization

Key Designs¶

Inception Action Manipulation (Distinguished from Forced Action Manipulation):
- Function: Replaces actions at historical high-reward steps with targeted adversarial actions.
- Mechanism: Steps with the highest \(Q(s_t, a_t)\) are selected, and \(a_t\) is replaced with \(a^+\), allowing the agent to "see" that \(a^+\) yields high rewards during training.
- Design Motivation: Unlike "Forced Action Manipulation" in TrojDRL (which only increases exploration without altering action values), inception directly alters Q-value estimation by replacing actions in the data, thereby overestimating \(Q(s_p, a^+)\).
- Formal Proof: For any MDP, inception attack ensures that the poisoned policy \(\pi^+\) selects \(a^+\) in the poisoned state.
Q-Incept Online Attack:
- Function: Dynamically selects optimal poisoning steps based on DQN estimation.
- Mechanism: Maintains an auxiliary DQN \(Q_\theta\) to estimate the action values of the current policy, selecting the \(\lfloor \beta \cdot |\text{episode}| \rfloor\) steps with the highest \(Q_\theta(s_t, a_t)\) for inception manipulation.
- Design Motivation: More efficient than random step selection—selecting high-value moments maximizes the expected return of adversarial actions.
Reward Constraint Adherence:
- Function: Ensures all injected rewards fall within the environment's natural range.
- Mechanism: Rewards for poisoned states remain unchanged (\(R'(\delta(s), a, s') = R(s, a, s')\)), eliminating the need for anomalous reward injection.
- Design Motivation: Since inception achieves its goal through action replacement rather than reward modification, it naturally satisfies reward constraints.

Loss & Training¶

The attacker trains an auxiliary Q-network using standard DQN to estimate action values.
The poisoning rate \(\beta\) controls the trade-off between attack strength and stealth.
The attack operates under an outer-loop threat model (modifying trajectory data after episode completion).

Key Experimental Results¶

Main Results¶

Attack Success Rate (ASR) under reward constraints:

Environment	Q-Incept ASR	SleeperNets ASR	TrojDRL ASR	Q-Incept Task Performance
Atari Q*bert	100%	~20%	~15%	≈No degradation
CybORG (Cyber Defense)	100%	<50%	<30%	Minimal impact
Highway (Autonomous Driving)	100%	Failed	Failed	≈No degradation
Safety-Gym	100%	Failed	Failed	≈No degradation

Ablation Study¶

Configuration	ASR	Description
Q-Incept (β=0.1)	100%	Highly effective even with low poisoning rate
Q-Incept (β=0.05)	~95%	Remains high when further reduced
Random Step Selection	~60%	Underperforms Q-value-guided selection
No Inception (Trigger + Reward Only)	~20%	Validates the necessity of action manipulation

Key Findings¶

When rewards are clipped to [0,1], the ASR of SleeperNets and TrojDRL drops sharply from ~100% to <50%.
Q-Incept maintains a 100% ASR across all tested environments, even under strict reward constraints.
The poisoning rate \(\beta\) has little impact on stealth—agent performance on normal tasks is virtually unaffected.

Highlights & Insights¶

The paradigm shift from reward manipulation to action manipulation is highly ingenious, bypassing the natural defense mechanism of reward constraints.
The name "inception" is precise—it "plants false memories" in the training data, akin to the movie Inception.
Formal proofs demonstrate why prior methods inevitably fail under reward constraints, providing strong theoretical motivation for the new method.
Reveals an overlooked safety vulnerability in RL systems: reward clipping/normalization is not an all-powerful defense.

Limitations & Future Work¶

Only considers targeted attacks in discrete action spaces (fixed \(a^+\)); continuous action spaces are more challenging.
The outer-loop threat model assumes the attacker has access to complete training trajectories, which may be restricted in practice.
No effective defense method is proposed—only the attack capability is analyzed.
The quality of Q-value estimation depends on the auxiliary DQN; estimates might be inaccurate during early training.

vs TrojDRL/SleeperNets: They rely on unconstrained reward manipulation, failing under clipping. Q-Incept bypasses this restriction through action manipulation.
vs Test-time Attacks: Test-time attacks directly alter actions/observations during deployment, whereas Q-Incept is a training-time attack, making it harder to detect.
Crucial implications for the design of safe RL systems: they should not solely rely on reward range checks as safety guardrails.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Decisive paradigm shift, achieving effective RL backdoor attacks under reward constraints for the first time.
Experimental Thoroughness: ⭐⭐⭐⭐ Solid comparisons across four distinct domain environments.
Writing Quality: ⭐⭐⭐⭐ Good integration of theory and experiments with clear definitions.
Value: ⭐⭐⭐⭐⭐ Uncovers a vital safety vulnerability in RL systems.