Robot-Gated Interactive Imitation Learning with Adaptive Intervention Mechanism¶

Conference: ICML2025
arXiv: 2506.09176
Code: metadriverse/AIM
Area: Imitation Learning / Interactive Imitation Learning
Keywords: Interactive Imitation Learning, Robot-Gated Intervention, Proxy Q-function, Adaptive Mechanism, Human-in-the-Loop

TL;DR¶

This paper proposes the Adaptive Intervention Mechanism (AIM), which learns a proxy Q-function to simulate human intervention decisions, allowing the robot to proactively request expert assistance. Compared to the uncertainty-based baseline Thrifty-DAgger, AIM reduces human takeover costs and improves learning efficiency by 40%.

Background & Motivation¶

Interactive Imitation Learning (IIL) allows agents to receive online human corrective demonstrations during training, which can be categorized into two paradigms:

Human-gated IIL: Humans monitor the entire process and proactively intervene (e.g., HG-DAgger, PVP), resulting in a high cognitive burden.
Robot-gated IIL: Robots autonomously request assistance based on certain criteria (e.g., Ensemble-DAgger, Thrifty-DAgger), mitigating the human monitoring workload.

Key limitations of prior work in robot-gated methods:

Misalignment between uncertainty estimation and human intervention intent: Uncertainty based on action variance can be low in safety-critical states (false negatives) and high in states where the agent is already proficient (false positives).

Fixed thresholds cannot adapt to policy evolution: The intervention rate does not automatically decrease as the agent gradually learns the task.

High computational overhead: Training an ensemble of policy networks is required to calculate action variance.

Key Insight of AIM: To design an adaptive mechanism that can simulate human intervention decisions and automatically reduce the intervention rate as the policy improves.

Method¶

Mechanism¶

Train a proxy Q-function \(Q_\theta^I(s, a_r)\) to approximate human intervention decisions:

A higher \(Q_\theta^I(s, a_r)\) value indicates a higher likelihood of human intervention in that state.
When the agent's action \(a_r\) deviates from the expert action \(a_h\), the Q-value approaches +1.
When the agent is aligned with the expert, the Q-value approaches −1, automatically reducing requests.

Loss & Training¶

\[J^{\text{AIM}}(\theta) = \mathbb{E}_{(s,a_h)\sim\mathcal{B}_h}\left[|Q_\theta^I(s,a_h)+1|^2\right] + \mathbb{E}_{(s,a_h)\sim\mathcal{B}_h, a_r\sim\pi_r(s)}\left[f(a_r,a_h)\cdot|Q_\theta^I(s,a_r)-1|^2\right]\]

where \(f(a_r, a_h) = \mathbb{I}[\|a_r - a_h\|^2 > \epsilon]\) determines whether the action discrepancy exceeds the threshold.

Intuition: The first term pulls the Q-values of expert actions toward −1 (no intervention required), while the second term pushes the Q-values toward +1 when the agent's action deviates (intervention required).

Temporal Difference (TD) Loss¶

To generalize Q-values to states explored by the agent but not covered by the expert, a TD loss is introduced:

\[J^{\text{TD}}(\theta) = \mathbb{E}_{(s,a,s')\sim\mathcal{B}_h\cup\mathcal{B}_r}\left[|Q_\theta(s,a) - \gamma\max_{a'}Q_{\hat{\theta}}(s',a')|^2\right]\]

Total loss: \(J(\theta) = J^{\text{AIM}}(\theta) + J^{\text{TD}}(\theta)\)

Intervention Trigger and Termination¶

Switch-to-human: Request expert assistance when \(Q_\theta^I(s, a_r) > \beta\), where the threshold \(\beta\) is the \((1-\delta)\)-quantile of the Q-value distribution (\(\delta=0.05\)).
Continue-with-human: After the expert intervenes, the request stops if \(\|a_r - a_h\|^2 \leq \epsilon\).
\(\epsilon\) is set to the mean discrepancy between the current policy and the expert actions, which is adaptively updated during training.

Algorithm Flow¶

The first \(n=2\) trajectories are fully monitored by humans (human-gated warm-up).
Initialize \(Q_\theta^I\) and thresholds \(\beta\), \(\epsilon\) using the collected data.
Robot-gated phase: The agent explores autonomously and only requests help when \(Q_\theta^I > \beta\).
Continuously update the policy \(\pi_r\), the Q-function \(Q_\theta^I\), and thresholds.

Key Experimental Results¶

MetaDrive Autonomous Driving (Continuous Action Space, 2000 steps Expert Budget)¶

Method	Robot-Gated	Expert Data (Intervention Rate)	Total Data	Success Rate	Return	Route Completion Rate
BC	—	2K	2K	0.33±0.04	243.0±46.7	0.62±0.08
HG-DAgger	✗	0.9K (0.45)	2K	0.61±0.07	310.8±16.7	0.78±0.07
PVP	✗	0.4K (0.19)	2K	0.62±0.06	270.4±28.6	0.77±0.04
Ensemble-DAgger	✓	2K (0.55)	3.6K	0.60±0.09	267.4±9.9	0.54±0.10
Thrifty-DAgger	✓	2K (0.21)	9.5K	0.58±0.03	250.0±23.9	0.73±0.03
AIM (Ours)	✓	1.9K (0.24)	7.7K	0.82±0.06	328.4±20.4	0.91±0.03
Neural Expert	—	—	—	0.84±0.05	336.5±17.1	0.93±0.01

Key Findings:

The success rate of AIM (0.82) is close to that of the Neural Expert (0.84), outperforming all baselines.
Compared to Thrifty-DAgger, the success rate increases by 41% (0.58→0.82), and the route completion rate increases by 25%.
AIM achieves superior performance using less expert data (1.9K vs. 2K).
AIM also outperforms all baselines in MiniGrid discrete action space tasks.

Highlights & Insights¶

Adaptive intervention rate: The Q-function naturally decreases intervention requests as the policy improves, without requiring manual adjustment of decay schedules.
Alignment with human intent: Directly learns a proxy model of human intervention decisions, instead of relying on heuristic uncertainty estimation.
Precise localization of safety-critical states: AIM only requests assistance near traffic cones and roadblocks, whereas Thrifty-DAgger frequently requests assistance even on straight roads.
Look-ahead capability of TD propagation: Temporal difference propagates Q-values to unseen states, enabling the anticipation of future errors.
Minimal warm-up: Requires human monitoring for only the first 2 trajectories to initialize the robot-gated mode.

Limitations & Future Work¶

Using a neural expert instead of real humans in experiments: Although this is standard practice, the gap between neural experts and real human interaction has not been fully validated.
Limited task complexity: Evaluations are conducted only in two relatively simple environments, MetaDrive and MiniGrid.
High-dimensional visual observations are not covered: The current setup uses a 259-dimensional sensor vector, and the performance under image input scenarios remains unknown.
Q-function generalization: It remains unclear whether the proxy Q-function can still reliably predict intervention needs when the environment distribution shifts significantly.
Cold start from offline to online: The warm-up phase still requires full human monitoring, which may not be ideal in extremely high-cost scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐ — Modeling human intervention decisions via a proxy Q-function is an elegant and original idea.
Experimental Thoroughness: ⭐⭐⭐ — Covers both continuous and discrete scenarios, but lacks environmental diversity.
Writing Quality: ⭐⭐⭐⭐ — Clear motivation, intuitive illustrations, and rigorous mathematical derivations.
Value: ⭐⭐⭐⭐ — Practical significance for reducing human-in-the-loop costs, with a notable 40% efficiency improvement.