SMART-R1: Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning¶

Conference: ICLR 2026 arXiv: 2509.23993 Code: N/A Area: Autonomous Driving / Reinforcement Learning Keywords: multi-agent traffic simulation, R1-style, reinforcement fine-tuning, next-token prediction, policy optimization

TL;DR¶

SMART-R1 is the first work to introduce R1-style reinforcement fine-tuning (RFT) into multi-agent traffic simulation. It proposes the Metric-oriented Policy Optimization (MPO) algorithm and an iterative "SFT-RFT-SFT" training strategy, achieving first place on the WOSAC 2025 leaderboard with a Realism Meta score of 0.7858.

Background & Motivation¶

Background: The dominant paradigm in multi-agent traffic simulation is autoregressive modeling based on Next-Token Prediction (NTP) (e.g., SMART), which generates joint agent behaviors by discretizing trajectories into motion tokens. Training typically follows a two-stage pipeline: behavior cloning (BC) pretraining followed by closed-loop SFT (CAT-K rollout).

Limitations of Prior Work: (a) The training objectives of BC and SFT (cross-entropy loss) are not directly aligned with the final evaluation metrics (collision rate, off-road rate, and other Realism Meta scores)—these metrics are scalar, sparse, and non-differentiable; (b) covariate shift in autoregressive generation leads to error accumulation in closed-loop simulation; (c) directly applying RL methods such as GRPO or PPO yields poor results, as they rely on comparative sampling or actor-critic architectures.

Key Challenge: There exists a gap between the training objective of NTP models (imitating the data distribution) and the evaluation objective (safety and realism metrics), while these evaluation metrics cannot be directly used as differentiable loss functions.

Goal: How can non-differentiable evaluation metrics be incorporated into the training of NTP-based traffic simulation models?

Key Insight: Drawing inspiration from DeepSeek-R1's multi-stage training strategy, the paper designs an iterative "SFT→RFT→SFT" pipeline with a simplified policy optimization algorithm that directly aligns model training with evaluation metrics.

Core Idea: Leverage known reward expectations to simplify advantage estimation, and apply SFT-RFT-SFT iteration to prevent catastrophic forgetting.

Method¶

Overall Architecture¶

Driving scenarios → tokenization (trajectory → motion tokens; map → map tokens) → Transformer with self-attention and cross-attention → next-token logit prediction. Training consists of four stages: (1) BC pretraining for 64 epochs; (2) closed-loop SFT for 16 epochs (CAT-K rollout); (3) RFT (MPO for metric alignment); (4) a second SFT stage for 16 epochs to recover the data distribution.

Key Designs¶

Metric-oriented Policy Optimization (MPO):
- Function: Directly uses Realism Meta evaluation metrics as reward signals to optimize the NTP model policy.
- Mechanism: For each scenario, all agent trajectories are generated via full autoregressive rollout, and the official evaluation protocol is applied to compute the Realism Meta score as reward \(r\). The advantage function is simplified to \(\mathcal{A} = r - \alpha\), where \(\alpha = 0.77\) is an empirical threshold approximating the baseline model's average reward. Rollouts exceeding the threshold receive positive reinforcement; those below are penalized. Loss function: \(\mathcal{L}_{\text{MPO}} = -(\frac{\pi_\theta}{\bar{\pi}_\theta}\mathcal{A} - \beta D_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}])\).
- Design Motivation: GRPO relies on multiple within-group samples to estimate relative advantages, introducing sampling bias; PPO's value model is difficult to optimize; DPO requires preference pairs. In contrast, the expected reward in traffic simulation is relatively predictable (~0.77) and can be used directly as a baseline, eliminating the need for repeated sampling or a value network.
- Difference from GRPO: GRPO normalizes using the within-group mean reward, whereas MPO uses a fixed threshold \(\alpha\), yielding a simpler and more stable formulation.
R1-Style "SFT-RFT-SFT" Iterative Training:
- Function: Performs one round of SFT before and after RFT to prevent catastrophic forgetting.
- Mechanism: The first SFT round (16 epochs) reduces covariate shift; RFT aligns the model with evaluation metrics; the second SFT round (16 epochs) restores adherence to the logged data distribution. The three stages are functionally complementary.
- Design Motivation: SFT followed by RFT alone tends to cause forgetting of the data distribution learned during SFT; two consecutive SFT rounds without RFT underperform the SFT-RFT interleaving scheme. The effectiveness of alternating SFT-RFT has been validated by DeepSeek-R1.
KL Regularization:
- Function: Incorporates a per-token KL divergence penalty during RFT to prevent the policy from deviating excessively from the reference model.
- Mechanism: An unbiased KL estimator is used: \(D_{\text{KL}} = \frac{\pi_{\text{ref}}}{\pi_\theta} - \log\frac{\pi_\theta}{\pi_{\text{ref}}} - 1\), with coefficient \(\beta = 0.04\) balancing metric optimization and distribution preservation.
- Design Motivation: A \(\beta\) that is too small causes excessive policy drift (losing the BC/SFT prior), while a \(\beta\) that is too large suppresses the reward signal.

Loss & Training¶

BC/SFT stages: Standard cross-entropy loss for token distribution alignment
RFT stage: MPO loss = advantage-weighted policy gradient + KL regularization
Total epochs match the baseline (64+32); the 32-epoch SFT is restructured as 16 + RFT + 16

Key Experimental Results¶

Main Results¶

WOSAC 2025 Leaderboard (test set):

Method	Realism Meta↑	Kinematics↑	Interaction↑	Map↑	minADE↓	Params
SMART-base	0.7725	0.472	0.804	0.912	1.393	7M
SMART-SFT (CAT-K)	0.7846	0.493	0.811	0.918	1.307	7M
TrajTok	0.7852	0.489	0.812	0.921	1.318	10M
SMART-R1	0.7858	0.494	0.811	0.920	1.289	7M

Ablation Study¶

Training Strategy	Realism Meta↑	Note
BC only	0.7725	Baseline
SFT	0.7812	Improvement from closed-loop SFT
SFT → RFT	0.7848	Further improvement with RFT
SFT → SFT (no RFT)	0.7809	Consecutive SFT underperforms SFT+RFT
SFT → RFT → SFT	0.7859	R1-style achieves best performance

Policy optimization method comparison (after SFT):

Method	Realism Meta↑
SFT baseline	0.7812
+ PPO	Decrease
+ DPO	Decrease
+ GRPO	Decrease
+ MPO	0.7848

Key Findings¶

RFT yields the most notable improvements on safety-critical metrics (collision rate, off-road rate, traffic light violation rate)—precisely those that BC/SFT cannot directly optimize
PPO/DPO/GRPO all fail on the traffic simulation task; only MPO is effective—indicating that task-specific characteristics (predictable reward expectation) render general-purpose RL algorithms unsuitable
\(\alpha = 0.77\) is the optimal threshold; higher values yield insufficient positive rewards, while lower values set the bar too low
\(\beta = 0.04\) achieves the best balance for KL regularization

Highlights & Insights¶

The approach of "using task prior knowledge to simplify RL" is practically valuable: when the reward distribution is relatively concentrated (unlike the high-variance setting in LLMs), a fixed threshold is more stable than GRPO's within-group comparison. This insight can transfer to other RL settings where rewards are predictable.
The SFT-RFT-SFT alternating strategy validates that the paradigm of "first align with data distribution → then optimize metrics → then restore distribution" is effective beyond the LLM domain, providing a practical template for RLHF in autonomous driving.
The method is extremely lightweight—it achieves first place on WOSAC using SMART-tiny (7M parameters) without model ensembling or post-processing.

Limitations & Future Work¶

The threshold \(\alpha\) in MPO requires manual tuning and depends on the baseline model's average performance, limiting generalizability
Validation is limited to SMART-tiny (7M); effectiveness on larger models remains unknown
Whether the Realism Meta metric genuinely reflects driving realism remains debatable (cf. the SPACeR paper)
The approach can be extended to larger models and multi-round RFT iterations

vs. SMART/CAT-K baseline: R1-style RFT improves Realism Meta from 0.7846 to 0.7858 without additional parameters
vs. RLFTSim (Ahmadi et al., 2025): Both apply RL fine-tuning but with different strategies; SMART-R1 achieves better performance (0.7858 vs. 0.7844), likely because MPO is better suited to this task
vs. DeepSeek-R1: SMART-R1 adopts the R1 training paradigm while simplifying advantage estimation, demonstrating that LLM training ideas can transfer across domains to autonomous driving simulation

Rating¶

Novelty: ⭐⭐⭐⭐ First application of R1-style training in traffic simulation; MPO is a concise design tailored to task characteristics
Experimental Thoroughness: ⭐⭐⭐⭐⭐ First place on WOSAC leaderboard; detailed ablations including optimization method comparisons and hyperparameter sensitivity
Writing Quality: ⭐⭐⭐⭐ Clear framework, appropriate analogy to LLM training paradigms, thorough experimental analysis
Value: ⭐⭐⭐⭐ Provides a practical template for RL post-training of traffic simulation models