Skip to content

SPACeR: Self-Play Anchoring with Centralized Reference Models

Conference: ICLR 2026 arXiv: 2510.18060 Code: N/A Area: Autonomous Driving / Reinforcement Learning Keywords: self-play reinforcement learning, traffic simulation, tokenized models, KL divergence alignment, human driving distribution

TL;DR

SPACeR proposes a "human-like self-play" framework that uses a pretrained tokenized autoregressive motion model as a centralized reference policy. By incorporating log-likelihood rewards and KL divergence constraints, it guides a decentralized self-play RL policy to align with the human driving distribution. SPACeR outperforms pure self-play methods on WOSAC while achieving 10× faster inference and 50× fewer parameters than imitation learning approaches.

Background & Motivation

Background: Autonomous driving simulation requires realistic and reactive traffic agent policies. Two dominant paradigms each have distinct trade-offs — imitation learning (e.g., SMART, CAT-K) captures realistic human behavior but incurs high inference cost and poor closed-loop reactivity; self-play RL is naturally suited for multi-agent interaction and is inference-efficient, but tends to deviate from human driving norms.

Limitations of Prior Work: (a) Imitation learning models (Transformer-based) are slow to infer and parameter-heavy, making them unsuitable for large-scale closed-loop simulation; (b) self-play RL relies on hand-crafted reward shaping, and policies may learn unnatural behaviors (e.g., aggressive acceleration toward waypoints); (c) existing methods that combine RL with imitation learning mostly follow a "pretrain-then-finetune" paradigm rather than letting RL take the lead.

Key Challenge: How can the speed and scalability of self-play RL be preserved while ensuring behavioral realism aligned with the human driving distribution?

Goal: To build a lightweight, fast, and scalable multi-agent simulation policy that maintains behavioral realism close to the human driving distribution.

Key Insight: An RL-first philosophy — self-play serves as the foundation, while the imitation learning model acts solely as a reward provider (reference policy) rather than a finetuning target.

Core Idea: A pretrained tokenized model supplies human realism signals to anchor self-play RL, while actual execution is performed by a 65K-parameter MLP.

Method

Overall Architecture

Input: WOMD scenes (road graph, initial states of all agents). The decentralized policy \(\pi_\theta\) (MLP) makes decisions based on local observations only. The centralized reference model \(\pi_{\text{ref}}\) (pretrained tokenized model) provides distributional signals based on the global scene. Training uses PPO with likelihood rewards and KL constraints; only the lightweight MLP is used at inference time.

Key Designs

  1. Centralized Reference Model as Reward Provider:

    • Function: A pretrained tokenized model (e.g., SMART/CAT-K) provides action distributions for each agent at each timestep as a human realism signal.
    • Mechanism: Reward function = task reward + \(\alpha \cdot \log \pi_{\text{ref}}(a_t|s_t)\) (likelihood reward); training objective = PPO loss \(- \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\) (distribution alignment). The reference model is centralized (observing the global scene) while the policy is decentralized (observing only local context), forming a privileged-information architecture analogous to teacher-student learning.
    • Design Motivation: Rather than using ground-truth trajectories for supervision, the framework uses the model's probability distribution as a signal — enabling guidance in novel states generated by self-play that are absent from the logs. This also resolves the credit assignment problem in multi-agent settings: the reference model provides independent distributional signals for each agent's action at each step.
  2. Aligned Discrete Action Space:

    • Function: Aligns the RL policy's action space with that of the tokenized reference model (K-disk clustering with \(K=200\)).
    • Mechanism: Both share the same discrete action vocabulary, enabling closed-form computation of the KL divergence: \(D_{\text{KL}} = \sum_{a} \pi_\theta(a|o) \log \frac{\pi_\theta(a|o)}{\pi_{\text{ref}}(a|s)}\), without requiring online tokenization.
    • Design Motivation: Without action space alignment, direct computation of the likelihood and KL divergence would be infeasible, rendering the core mechanism of the framework inoperative.
  3. Goal Dropout:

    • Function: Randomly removes goal conditioning during training to reduce reliance on explicit goals.
    • Mechanism: Prior self-play methods reward agents only upon reaching a goal, which incentivizes aggressive acceleration. With reference model anchoring, the explicit goal reward can be entirely removed, which in turn improves behavioral realism.
    • Design Motivation: Human driving does not consist of rushing toward explicit waypoints; realistic behavior is characterized by smooth, flowing motion.

Loss & Training

$\(\mathcal{L}(\theta) = \mathcal{L}_{\text{PPO}}(\theta; A[r]) - \beta D_{\text{KL}}(\pi_\theta(\cdot|o_t) \| \pi_{\text{ref}}(\cdot|s_t))\)$ where the reward is: \(r = w_{\text{goal}} \cdot \mathbb{I}[\text{Goal}] - w_{\text{collision}} \cdot \mathbb{I}[\text{Collision}] - w_{\text{offroad}} \cdot \mathbb{I}[\text{Offroad}] + w_{\text{humanlike}} \cdot \log \pi_{\text{ref}}(a|s)\)

Key Experimental Results

Main Results

WOSAC validation set (vehicles):

Method Composite Realism↑ Kinematics↑ Interaction↑ Collision↓ Throughput (scenes/s)↑
PPO (pure self-play) 0.710 0.327 0.751 0.038 211.8
HR-PPO 0.716 0.341 0.756 0.044 211.8
SPACeR 0.741 0.411 0.779 0.036 211.8
SMART (imitation learning) 0.720 0.450 0.725 0.170 22.5
CAT-K (imitation learning) 0.766 0.490 0.792 0.060 22.5

Ablation Study

Configuration Composite Realism Notes
PPO only 0.710 No human signal
+ Likelihood reward only ~0.72 Marginal improvement; signal unstable under multimodal distributions
+ KL alignment only ~0.74 Larger improvement; aligns distribution while preserving entropy
+ Likelihood + KL (SPACeR) 0.741 Best overall
− Goal reward + anchoring ~0.74 Removing goal reward further improves realism

Key Findings

  • KL alignment contributes more than likelihood reward — likelihood reward reduces policy diversity (entropy decreases), whereas KL alignment improves realism while maintaining entropy.
  • Reference model quality has limited impact: even with a weak 0.3M-parameter reference model (realism score 0.636), SPACeR still achieves 0.732, indicating the reference model serves as a "soft prior" rather than a "hard target."
  • In closed-loop planner evaluation, SPACeR agents are more sensitive than CAT-K — exhibiting lower PDM score correlation with GT logs, suggesting they more effectively penalize unsafe planners.
  • A ~65K-parameter MLP achieves realism close to that of a 3.2M-parameter tokenized model, with 10× higher throughput.

Highlights & Insights

  • The choice of an RL-first vs. finetune paradigm is insightful: most prior work follows a "large model first, then RL finetuning" approach, whereas SPACeR inverts this — RL drives training while the large model only provides reward signals. This yields a 50× smaller inference model suitable for large-scale simulation.
  • Aligned action spaces enabling tractable KL computation is the critical technical enabler of the entire framework: with continuous action spaces, computing and optimizing the KL divergence would be substantially more difficult. This design choice directly determines the method's feasibility.
  • The critical analysis of WOSAC metrics is valuable: the paper points out that WOSAC rewards reproducing logged trajectories rather than safe behavior (e.g., taking a parking lot route vs. going straight may both be reasonable, but WOSAC only rewards the logged choice), offering useful insights for improving evaluation in this domain.

Limitations & Future Work

  • Composite realism remains below the strongest imitation learning method CAT-K (0.741 vs. 0.766), with a notable gap in kinematic metrics.
  • Training requires 24–48 hours on a single GPU; multi-GPU distributed training is not supported.
  • VRU (pedestrian/cyclist) simulation metrics underperform vehicle metrics; VRU-specific reward functions and evaluation protocols are needed.
  • The policy does not utilize temporal history, which may limit performance in scenarios requiring long-term memory.
  • vs. HR-PPO (Cornelisse & Vinitsky, 2024): HR-PPO applies KL alignment only to a decentralized BC model, with limited effect. SPACeR uses a centralized tokenized model to provide stronger signals, improving realism from 0.716 to 0.741.
  • vs. SMART/CAT-K: SPACeR achieves lower collision and off-road rates (0.036 vs. 0.17/0.06), confirming that self-play naturally promotes collision avoidance. Composite realism is slightly lower but inference is 10× faster.
  • vs. GIGAFlow (Cusumano-Towner et al., 2025): GIGAFlow demonstrates the feasibility of large-scale self-play; SPACeR builds upon this by incorporating human realism anchoring.

Rating

  • Novelty: ⭐⭐⭐⭐ The RL-first paradigm with a large model serving solely as a reward provider is novel, though the core techniques (KL alignment, PPO) are combinations of established methods.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers the WOSAC standard benchmark, closed-loop planner evaluation, reference model quality ablation, VRU evaluation, and efficiency comparisons.
  • Writing Quality: ⭐⭐⭐⭐ Framework is clearly presented, experimental analysis is thorough, and the critical discussion of WOSAC metrics is insightful.
  • Value: ⭐⭐⭐⭐⭐ Provides a practical large-scale traffic simulation solution — 10× speed with near-human realism — bridging the gap between efficiency and behavioral fidelity.