Skip to content

On Discovering Algorithms for Adversarial Imitation Learning

Conference: ICLR 2026
arXiv: 2510.00922
Code: None
Area: Imitation Learning / Meta-Learning
Keywords: Adversarial Imitation Learning, Reward Assignment Functions, LLM-guided Evolution, Meta-learning, Training Stability

TL;DR

Proposes DAIL—the first meta-learning algorithm for Adversarial Imitation Learning (AIL). It decomposes AIL into two stages: density ratio estimation and reward assignment (RA). Using LLM-guided evolutionary search, it automatically discovers the optimal RA function \(r_{\text{disc}}\), which generalizes across unseen environments and policy optimizers while outperforming all human-designed baselines.

Background & Motivation

Background: Adversarial Imitation Learning (AIL) is the most effective imitation learning paradigm under limited expert demonstrations. Inspired by GANs, AIL formalizes the learning process as an adversarial game between a discriminator (distinguishing between expert and policy trajectories) and a policy (generating trajectories close to those of the expert). From a divergence minimization perspective, AIL naturally decomposes into two stages: (1) Density Ratio Estimation (DR)—the discriminator estimates the occupancy ratio of state-action pairs between the expert and the policy \(\frac{\rho_E}{\rho_\pi}\); (2) Reward Assignment (RA)—mapping the density ratio to a scalar reward signal for policy optimization.

Limitations of Prior Work: - AIL training is unstable, sharing the difficulties of GAN training—the quality of the gradient signal directly affects the efficacy of policy improvement. - Extensive research has focused on improving stage (1), discriminator training (e.g., C-GAIL, Diffusion-Reward), while the RA function (stage 2) has been largely neglected. - Existing RA functions (GAIL's softplus, AIRL's log-ratio, FAIRL's exponential decay) are all manually derived from \(f\)-divergence theory, relying on human intuition and potentially being suboptimal.

Key Challenge: A fundamental conflict exists between the theoretical elegance and practical training stability of human-designed RA functions—GAIL over-rewards low-quality state-action pairs, FAIRL's exponential decay leads to instability, and AIRL's negative rewards induce early termination.

Goal: To move beyond the manual design paradigm by using LLM-guided evolutionary search to directly discover the optimal RA function driven by performance, achieving meta-learning for AIL algorithms.

Method

Overall Architecture

DAIL treats "designing an AIL algorithm" as a searchable optimization problem. The outer loop uses LLM-guided evolutionary search to optimize the reward assignment (RA) function \(r_f\), while the inner loop executes a standard AIL cycle (policy rollout \(\to\) discriminator training for density ratio estimation \(\to\) reward assignment with \(r_f \to\) policy improvement) for a given \(r_f\). The quality of \(r_f\) is measured by the Wasserstein distance between the trained policy and the expert. The overall process can be formulated as a bilevel optimization: \(\min_f \mathcal{W}(\rho_E, \rho_{\pi^*}; f)\), subject to \(\pi^* = \arg\max_\pi r_f(\rho_E \| \rho_\pi)\).

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
    INIT["Initial Population<br/>Four RA functions:<br/>GAIL / AIRL / FAIRL / GAIL-heuristic"]
    subgraph AIL["AIL Two-stage Decomposition (Design 1)"]
        direction TB
        ROLL["Policy Rollout<br/>Collect Trajectories"] --> DR["Discriminator Density Ratio Estimation<br/>ℓ = log(ρ_E / ρ_π)"]
        DR --> RA["RA Function r_f(ℓ)<br/>Density Ratio → Scalar Reward"]
        RA --> POL["Policy Improvement (PPO)"]
    end
    INIT -->|"Candidate r_f"| AIL
    POL -->|"Train to Convergence"| FIT["Fitness<br/>Wasserstein distance between<br/>policy and expert"]
    FIT --> EVO["LLM-guided Evolutionary Search<br/>Sample parents → GPT crossover/mutation<br/>→ Keep Top-K"]
    EVO -->|"Next-gen candidate r_f"| AIL
    EVO -->|"Search Convergence"| OUT["Discovered RA Function r_disc<br/>Bounded / Right-shifted / S-curve<br/>saturated at low quality"]

Key Designs

1. AIL Two-stage Decomposition: Isolating the Neglected Reward Assignment

The first step of DAIL is to decouple the reward signal generation in AIL into two independent stages—Density Ratio Estimation (DR) and Reward Assignment (RA). This exposes the RA function, which was previously conflated with the discriminator and overlooked, as an object for individual optimization. The discriminator first estimates the log density ratio \(\ell = \log \frac{\rho_E(s,a)}{\rho_\pi(s,a)}\), and the RA function \(r_f(\ell)\) maps this ratio into a scalar reward for the policy. Different manual RA functions are essentially derivations of various \(f\)-divergences, but their response curves to the density ratio vary significantly, directly determining the informativeness of the gradient and training stability.

Divergence Type Algorithm RA Function \(r_f(\ell)\) Characteristics
Forward KL FAIRL \(-\ell \cdot e^{\ell}\) Exponential unbounded decay, unstable training
Backward KL AIRL \(\ell\) Linear, high negative rewards induce early termination
Jensen-Shannon GAIL \(\text{softplus}(\ell)\) Over-rewards low-quality samples
Unnamed \(f\)-div GAIL-heuristic \(-\text{softplus}(-\ell)\) Primarily negative rewards, only encourages matching

2. LLM-guided Evolutionary Search: Bypassing Infeasible Bilevel Backpropagation

Bilevel optimization theoretically requires backpropagation through the entire AIL training loop, which is computationally infeasible. Consequently, DAIL employs black-box evolutionary search to discover the RA function. It initializes the population with four RA functions: GAIL, AIRL, FAIRL, and GAIL-heuristic. Each candidate \(r_f\) is used to train a policy to convergence, using the Wasserstein distance to the expert as the fitness metric. Mutation and crossover are performed by an LLM: a pair of parents \(\{r_{f_1}, r_{f_2}\}\) and their fitness values are fed into GPT-4.1-mini, which is prompted to fuse their strengths into an offspring \(r_{f_3}\). Each generation evaluates \(M \times N\) candidates and retains the Top-\(K\). Using Python code to represent RA functions ensures both interpretability and expressivity.

3. Discovered RA Function \(r_{\text{disc}}\): A Bounded, Right-shifted, Low-quality Saturated S-curve

The evolutionary search converges to an optimal RA function: \(r_{\text{disc}}(x) = 0.5 \cdot \text{sigmoid}(x) \cdot [\tanh(x) + 1]\). Its properties address the flaws of previous baselines. It restricts rewards to the bounded interval \([0,1]\), which is known to stabilize deep RL. It follows an S-curve but features a steeper gradient than a standard sigmoid and is shifted to the right, providing informative gradients in the critical \(x \in [-1,0]\) range. Crucially, it saturates to zero for \(x \lesssim -1.8\) (representing near-random behavior), effectively filtering noisy rewards from low-quality samples, whereas GAIL continues to provide high positive rewards at \(x=-2\).

Key Experimental Results

Main Results: Cross-Environment Generalization

Evaluations were conducted on Brax (MuJoCo) and Minatar benchmarks outside the search environment. All methods shared hyper-parameters, varying only the RA function:

Method Brax Mean ↑ Brax Median ↑ Minatar Mean ↑ Minatar Median ↑
GAIL ~0.65 ~0.70 ~0.55 ~0.60
AIRL ~0.70 ~0.75 ~0.30 ~0.25
FAIRL ~0.40 ~0.35 ~0.35 ~0.30
GAIL-heuristic ~0.55 ~0.60 ~0.25 ~0.20
Ours (DAIL) ~0.75 ~0.72 ~0.85 ~0.90

Key Findings: - DAIL significantly outperforms all baselines on Minatar. - On Brax, DAIL achieves the best results across most metrics, with the Mean being statistically superior to all baselines. - AIRL and GAIL-heuristic perform poorly on Minatar due to negative rewards encouraging early termination.

Ablation Study

Dimension Key Insight Metric
Search Generalization Search on SpaceInvaders \(\to\) Generalize to others 20% lower \(\mathcal{W}\) distance, 12.5% higher return
Optimizer Generalization Search with PPO \(\to\) Evaluate on A2C DAIL significantly outperforms GAIL on A2C
Regularization Robustness 5 strategies (none/w-decay/entropy/spectral/grad-pen) DAIL outperforms GAIL in 3/5 settings
Policy Entropy DAIL policy converges to lower entropy Matches entropy levels of ground-truth reward PPO
Component Ablation \(r_{\text{disc}}\) vs sigmoid vs \(0.5[\tanh+1]\) \(r_{\text{disc}}\) > \(0.5[\tanh+1]\) > sigmoid

Policy entropy analysis reveals that DAIL's stability stems from \(r_{\text{disc}}\) saturating to zero for \(x \lesssim -1.8\), filtering out noisy rewards from random actions. GAIL's high positive rewards for low-quality behavior result in a noisier reward signal.

Main Results: Performance Under Different Regularizations

Algorithm Environment None W-Decay Entropy Spectral Grad-Pen
DAIL Asterix 0.88±0.03 1.33±0.03 0.12±0.01 0.92±0.03 0.66±0.03
DAIL Breakout 0.81±0.07 0.74±0.08 0.91±0.02 0.77±0.07 1.01±0.00
DAIL SpaceInv 0.71±0.07 0.81±0.01 0.80±0.01 0.70±0.09 0.90±0.00
DAIL Overall 0.80±0.03 0.96±0.03 0.61±0.01 0.80±0.04 0.85±0.01
GAIL Asterix 1.18±0.03 1.44±0.03 0.48±0.03 0.22±0.03 0.52±0.04
GAIL Breakout 0.76±0.07 0.52±0.10 0.89±0.01 0.33±0.10 0.85±0.07
GAIL SpaceInv 0.61±0.09 0.34±0.09 0.81±0.00 0.42±0.08 0.81±0.03
GAIL Overall 0.85±0.04 0.76±0.04 0.73±0.01 0.32±0.05 0.73±0.03

Highlights & Insights

Highlights

  • Precise Problem Identification: Systematically reveals the critical impact of RA functions on AIL training stability, addressing a gap in the field.
  • Methodological Innovation: Introduces meta-learning for RA function discovery in AIL, shifting the paradigm from manual design to data-driven discovery.
  • In-depth Analysis: Clearly explains the advantages of DAIL through density ratio distributions, policy entropy, and component ablations.
  • Strong Generalization: Demonstrates consistent performance across environments, optimizers, and regularization strategies.

Limitations & Future Work

  • The discovered \(r_{\text{disc}}\) does not correspond to a valid \(f\)-divergence, lacking theoretical convergence guarantees.
  • The RA function remains static during training and does not adapt to training states (e.g., remaining steps, observed density ratio distributions).
  • Search was conducted on a single environment (SpaceInvaders); searching on a larger set of environments might yield more robust functions.
  • Performance remains to be validated on more complex benchmarks (Atari-57/Procgen).

Rating

⭐⭐⭐⭐ — Precise problem identification, novel methodology, and deep analysis, though it lacks theoretical guarantees and search scale is limited.