SPACeR: Self-Play Anchoring with Centralized Reference Models¶

Conference: ICLR 2026 arXiv: 2510.18060 Code: N/A Area: Autonomous Driving / Reinforcement Learning Keywords: self-play reinforcement learning, traffic simulation, tokenized models, KL divergence alignment, human driving distribution

TL;DR¶

SPACeR proposes a "human-like self-play" framework that uses a pretrained tokenized autoregressive motion model as a centralized reference policy. By incorporating log-likelihood rewards and KL divergence constraints, it guides a decentralized self-play RL policy to align with the human driving distribution. SPACeR outperforms pure self-play methods on WOSAC while achieving 10× faster inference and 50× fewer parameters than imitation learning approaches.

Background & Motivation¶

Background: Autonomous driving simulation requires realistic and reactive traffic agent policies. Two dominant paradigms each have distinct trade-offs — imitation learning (e.g., SMART, CAT-K) captures realistic human behavior but incurs high inference cost and poor closed-loop reactivity; self-play RL is naturally suited for multi-agent interaction and is inference-efficient, but tends to deviate from human driving norms.

Limitations of Prior Work: (a) Imitation learning models (Transformer-based) are slow to infer and parameter-heavy, making them unsuitable for large-scale closed-loop simulation; (b) self-play RL relies on hand-crafted reward shaping, and policies may learn unnatural behaviors (e.g., aggressive acceleration toward waypoints); (c) existing methods that combine RL with imitation learning mostly follow a "pretrain-then-finetune" paradigm rather than letting RL take the lead.

Key Challenge: How can the speed and scalability of self-play RL be preserved while ensuring behavioral realism aligned with the human driving distribution?

Goal: To build a lightweight, fast, and scalable multi-agent simulation policy that maintains behavioral realism close to the human driving distribution.

Key Insight: An RL-first philosophy — self-play serves as the foundation, while the imitation learning model acts solely as a reward provider (reference policy) rather than a finetuning target.

Core Idea: A pretrained tokenized model supplies human realism signals to anchor self-play RL, while actual execution is performed by a 65K-parameter MLP.

Method¶

Overall Architecture¶

Input: WOMD scenes (road graph, initial states of all agents). The decentralized policy $\pi_\theta$ (MLP) makes decisions based on local observations only. The centralized reference model $\pi_{\text{ref}}$ (pretrained tokenized model) provides distributional signals based on the global scene. Training uses PPO with likelihood rewards and KL constraints; only the lightweight MLP is used at inference time.

Key Designs¶

Centralized Reference Model as Reward Provider:
- Function: A pretrained tokenized model (e.g., SMART/CAT-K) provides action distributions for each agent at each timestep as a human realism signal.
- Mechanism: Reward function = task reward + $\alpha \cdot \log \pi_{\text{ref}}(a_t|s_t)$ (likelihood reward); training objective = PPO loss $- \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$ (distribution alignment). The reference model is centralized (observing the global scene) while the policy is decentralized (observing only local context), forming a privileged-information architecture analogous to teacher-student learning.
- Design Motivation: Rather than using ground-truth trajectories for supervision, the framework uses the model's probability distribution as a signal — enabling guidance in novel states generated by self-play that are absent from the logs. This also resolves the credit assignment problem in multi-agent settings: the reference model provides independent distributional signals for each agent's action at each step.
Aligned Discrete Action Space:
- Function: Aligns the RL policy's action space with that of the tokenized reference model (K-disk clustering with $K=200$).
- Mechanism: Both share the same discrete action vocabulary, enabling closed-form computation of the KL divergence: $D_{\text{KL}} = \sum_{a} \pi_\theta(a|o) \log \frac{\pi_\theta(a|o)}{\pi_{\text{ref}}(a|s)}$, without requiring online tokenization.
- Design Motivation: Without action space alignment, direct computation of the likelihood and KL divergence would be infeasible, rendering the core mechanism of the framework inoperative.
Goal Dropout:
- Function: Randomly removes goal conditioning during training to reduce reliance on explicit goals.
- Mechanism: Prior self-play methods reward agents only upon reaching a goal, which incentivizes aggressive acceleration. With reference model anchoring, the explicit goal reward can be entirely removed, which in turn improves behavioral realism.
- Design Motivation: Human driving does not consist of rushing toward explicit waypoints; realistic behavior is characterized by smooth, flowing motion.

Loss & Training¶

$$\mathcal{L}(\theta) = \mathcal{L}_{\text{PPO}}(\theta; A[r]) - \beta D_{\text{KL}}(\pi_\theta(\cdot|o_t) \| \pi_{\text{ref}}(\cdot|s_t))$$ where the reward is: $r = w_{\text{goal}} \cdot \mathbb{I}[\text{Goal}] - w_{\text{collision}} \cdot \mathbb{I}[\text{Collision}] - w_{\text{offroad}} \cdot \mathbb{I}[\text{Offroad}] + w_{\text{humanlike}} \cdot \log \pi_{\text{ref}}(a|s)$

Key Experimental Results¶

Main Results¶

WOSAC validation set (vehicles):

Method	Composite Realism↑	Kinematics↑	Interaction↑	Collision↓	Throughput (scenes/s)↑
PPO (pure self-play)	0.710	0.327	0.751	0.038	211.8
HR-PPO	0.716	0.341	0.756	0.044	211.8
SPACeR	0.741	0.411	0.779	0.036	211.8
SMART (imitation learning)	0.720	0.450	0.725	0.170	22.5
CAT-K (imitation learning)	0.766	0.490	0.792	0.060	22.5

Ablation Study¶

Configuration	Composite Realism	Notes
PPO only	0.710	No human signal
+ Likelihood reward only	~0.72	Marginal improvement; signal unstable under multimodal distributions
+ KL alignment only	~0.74	Larger improvement; aligns distribution while preserving entropy
+ Likelihood + KL (SPACeR)	0.741	Best overall
− Goal reward + anchoring	~0.74	Removing goal reward further improves realism

Key Findings¶

KL alignment contributes more than likelihood reward — likelihood reward reduces policy diversity (entropy decreases), whereas KL alignment improves realism while maintaining entropy.
Reference model quality has limited impact: even with a weak 0.3M-parameter reference model (realism score 0.636), SPACeR still achieves 0.732, indicating the reference model serves as a "soft prior" rather than a "hard target."
In closed-loop planner evaluation, SPACeR agents are more sensitive than CAT-K — exhibiting lower PDM score correlation with GT logs, suggesting they more effectively penalize unsafe planners.
A ~65K-parameter MLP achieves realism close to that of a 3.2M-parameter tokenized model, with 10× higher throughput.

Highlights & Insights¶

The choice of an RL-first vs. finetune paradigm is insightful: most prior work follows a "large model first, then RL finetuning" approach, whereas SPACeR inverts this — RL drives training while the large model only provides reward signals. This yields a 50× smaller inference model suitable for large-scale simulation.
Aligned action spaces enabling tractable KL computation is the critical technical enabler of the entire framework: with continuous action spaces, computing and optimizing the KL divergence would be substantially more difficult. This design choice directly determines the method's feasibility.
The critical analysis of WOSAC metrics is valuable: the paper points out that WOSAC rewards reproducing logged trajectories rather than safe behavior (e.g., taking a parking lot route vs. going straight may both be reasonable, but WOSAC only rewards the logged choice), offering useful insights for improving evaluation in this domain.

Limitations & Future Work¶

Composite realism remains below the strongest imitation learning method CAT-K (0.741 vs. 0.766), with a notable gap in kinematic metrics.
Training requires 24–48 hours on a single GPU; multi-GPU distributed training is not supported.
VRU (pedestrian/cyclist) simulation metrics underperform vehicle metrics; VRU-specific reward functions and evaluation protocols are needed.
The policy does not utilize temporal history, which may limit performance in scenarios requiring long-term memory.

vs. HR-PPO (Cornelisse & Vinitsky, 2024): HR-PPO applies KL alignment only to a decentralized BC model, with limited effect. SPACeR uses a centralized tokenized model to provide stronger signals, improving realism from 0.716 to 0.741.
vs. SMART/CAT-K: SPACeR achieves lower collision and off-road rates (0.036 vs. 0.17/0.06), confirming that self-play naturally promotes collision avoidance. Composite realism is slightly lower but inference is 10× faster.
vs. GIGAFlow (Cusumano-Towner et al., 2025): GIGAFlow demonstrates the feasibility of large-scale self-play; SPACeR builds upon this by incorporating human realism anchoring.

Rating¶

Novelty: ⭐⭐⭐⭐ The RL-first paradigm with a large model serving solely as a reward provider is novel, though the core techniques (KL alignment, PPO) are combinations of established methods.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers the WOSAC standard benchmark, closed-loop planner evaluation, reference model quality ablation, VRU evaluation, and efficiency comparisons.
Writing Quality: ⭐⭐⭐⭐ Framework is clearly presented, experimental analysis is thorough, and the critical discussion of WOSAC metrics is insightful.
Value: ⭐⭐⭐⭐⭐ Provides a practical large-scale traffic simulation solution — 10× speed with near-human realism — bridging the gap between efficiency and behavioral fidelity.