Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving¶

Conference: ECCV 2024
arXiv: 2409.18343
Code: None
Area: Autonomous Driving
Keywords: reinforcement learning fine-tuning, agent behavior modeling, autonomous driving simulation, distribution shift, Waymo

TL;DR¶

Improves supervised-learning-trained traffic agent behavior models via closed-loop reinforcement learning fine-tuning, addressing the distribution shift issue inherent in open-loop training, and achieving state-of-the-art performance on the Waymo simulation benchmark.

Background & Motivation¶

Traffic agent behavior modeling is one of the core problems in autonomous driving research, with key applications including: (1) constructing realistic and reliable simulation environments for off-board evaluation; (2) predicting the trajectories of traffic participants for onboard planning. These scenarios demand high realism and diversity in agent behaviors.

Mainstream approaches currently employ supervised learning (imitation learning/behavior cloning) to learn behavioral policies from expert data. However, supervised learning methods suffer from a fundamental issue—distribution shift. During training, the model learns the mapping from expert states to expert actions; during testing, minor accumulated errors in the model's own predictions lead to states never encountered during training, causing severe performance degradation.

This issue is particularly pronounced during long-horizon simulations: a tiny trajectory deviation continuously accumulates, eventually resulting in unrealistic behaviors (e.g., collisions, driving off-road). Existing solutions, including data augmentation and DAgger, show limited effectiveness.

The core idea of this work is to fine-tune the behavior model using closed-loop reinforcement learning (RL) after supervised pre-training. The advantage of RL lies in its inherent optimization within a closed-loop environment, forcing the model to interact with states generated by its own past decisions, thereby directly mitigating distribution shift.

Method¶

Overall Architecture¶

The method adopts a two-stage training strategy: (1) Stage 1 employs supervised learning on offline data to pre-train the behavior model and learn basic driving behaviors; (2) Stage 2 performs closed-loop RL fine-tuning on the model in a simulation environment to optimize specific behavioral metrics (e.g., collision rate, off-road rate).

Key Designs¶

Closed-Loop RL Fine-tuning Framework:
- Function: Resolves the distribution shift problem of the supervised pre-trained model
- Mechanism: Prompts agents to interact with the environment in simulation, updating the policy based on the outcomes. The core is to design proper reward functions that encourage realistic behaviors and penalize unsafe ones (e.g., collisions, traffic violations)
- Design Motivation: Supervised learning is limited to open-loop training (ignoring error accumulation), whereas RL is optimized in a closed-loop setting, making it naturally suited to address distribution shift
Multi-Objective Reward Function:
- Function: Balances simulation realism and safety
- Mechanism: The reward function comprehensively accounts for multiple metrics—similarity to the ground-truth trajectory (ensuring realism), collision penalties (ensuring safety), road-following rewards (ensuring compliance), and interaction reasonability rewards (ensuring social intelligence). These are combined via weighted summation for multi-objective optimization
- Design Motivation: A single realism objective might lead to policies that are "safe but unrealistic" or "realistic but unsafe." The multi-objective reward enables a better trade-off
Policy Evaluation Benchmark:
- Function: Directly evaluates the ability of simulation agents to distinguish the quality of autonomous driving planners
- Mechanism: Instantiates a series of planners with varying quality tiers and evaluates them using the simulation agent model. A high-quality simulation agent should correctly differentiate good planners from poor ones—i.e., better planners should achieve better scores under simulation
- Design Motivation: Existing benchmarks only evaluate the realism of the agent's behavior itself while neglecting the ultimate purpose of simulation—evaluating and improving autonomous driving systems

Loss & Training¶

Stage 1: Pre-training on the Waymo Open Motion Dataset using standard behavior cloning loss (MSE/NLL).
Stage 2: Fine-tuning via reinforcement learning using the PPO algorithm. The reward function incorporates collision penalties, off-road penalties, and rewards based on proximity to the ground-truth trajectory.
Training Tricks: Using a smaller learning rate during RL fine-tuning to prevent catastrophic forgetting of pre-trained knowledge, alongside a KL penalty to constrain the policy update step size.

Key Experimental Results¶

Main Results¶

Dataset	Metric	Ours	Prev. SOTA	Gain
Waymo WOSAC	Realism Meta-metric	SOTA	Supervised Baseline	Significant Gain
Waymo WOSAC	Collision Rate ↓	Drastic Reduction	Supervised Baseline	-30-50%
Waymo WOSAC	Off-road Rate ↓	Significant Reduction	Supervised Baseline	-20-40%
Policy Eval	Planner Ranking	Correct	Incorrect for some methods	More accurate

Ablation Study¶

Configuration	Key Metrics	Description
Supervised Learning Only	Baseline	Distribution shift exists
RL Only (From Scratch)	Poor	Lack of prior knowledge, unstable training
SL Pre-training + RL Fine-tuning	Optimal	Two-stage complementarity
Different Reward Weights	Performance Sensitive	Requires careful tuning of reward weights

Key Findings¶

RL fine-tuning significantly improves safety metrics such as collision and off-road rates while preserving behavioral realism.
RL training from scratch performs far worse than SL pre-training followed by RL fine-tuning, demonstrating the critical importance of pre-training.
The proposed Policy Evaluation Benchmark offers a novel perspective for assessing simulation quality.
The proposed method achieves state-of-the-art performance on the Waymo Open Sim Agents Challenge (WOSAC).

Highlights & Insights¶

Introduces the "pre-training + RL fine-tuning" (e.g., RLHF) paradigm from the NLP domain into autonomous driving agent modeling.
Proposes a novel evaluation metric through the Policy Evaluation Benchmark, which focuses on the fundamental objective of simulation.
The method is simple yet effective, and the two-stage training strategy is easy to implement.
Provides a direct solution to the distribution shift problem.

Limitations & Future Work¶

RL fine-tuning requires extensive environment interactions, leading to high computational costs.
The design of the reward function requires domain expertise, and different scenarios may require different reward weightings.
Validation is restricted to the Waymo dataset, and generalizability to other cities and driving environments remains unverified.
Future work can explore offline RL methods to reduce the necessity for online interactions.
Multi-agent cooperative RL fine-tuning represents a promising and highly valuable research direction.

WOSAC: Waymo Open Sim Agents Challenge provides a standardized evaluation platform for agent modeling.
TrafficSim / SimNet: Early traffic simulation works that utilize supervised learning to train agents.
RLHF: The paradigm of RL fine-tuning in the NLP domain; this paper applies a similar concept to autonomous driving.
Inspiration: "Pre-training + RL alignment" could serve as a general paradigm for behavior modeling, applicable to other domains such as robotics.

Rating¶

Novelty: ⭐⭐⭐ Although the concept of RL fine-tuning is not novel, its application in autonomous driving simulation is significant.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensively evaluated on the standard Waymo benchmark, accompanied by a newly proposed evaluation dimension.
Writing Quality: ⭐⭐⭐⭐ Clearly defined problem statements and concise methodological explanations.
Value: ⭐⭐⭐⭐ With 30 citations, it holds practical value for the autonomous driving simulation community.