SPARC: OOD Generalization for Controlling 100 Unseen Vehicles with a Single Policy¶

Conference: AAAI 2026 arXiv: 2511.09737 Code: https://github.com/bramgrooten/sparc Area: Reinforcement Learning / Policy Generalization / Autonomous Driving Keywords: OOD Generalization, Contextual Reinforcement Learning, Single-Phase Adaptation, Gran Turismo, SPARC

TL;DR¶

This paper proposes SPARC (Single-Phase Adaptation for Robust Control), which unifies the two-phase context encoding and history-based adaptation of RMA into a single-phase training procedure. Using a single policy in the high-fidelity Gran Turismo 7 racing simulator, SPARC achieves state-of-the-art OOD generalization performance across 100+ unseen vehicles.

Background & Motivation¶

Deep reinforcement learning has achieved notable success in robotics, nuclear fusion control, and racing simulators, yet generalizing to unseen environments (OOD contexts) remains a central challenge. Environmental conditions such as friction coefficients, wind speed, and vehicle dynamics may change unpredictably at deployment, leading to catastrophic failures.

Rapid Motor Adaptation (RMA) is a representative method in this line of work, employing a two-phase training procedure: 1. Train an expert policy with a context encoder \(\psi(c) = z\) using privileged context information \(c\) via reinforcement learning (PPO/QR-SAC). 2. Freeze the expert policy and train a history-based adaptation module \(\phi(h) = \hat{z}\) by regressing the encoder output \(z\) via MSE. 3. At deployment, only the adapter policy \(\pi_{ad}\) is used (no privileged information \(c\) required).

However, the two-phase approach has notable drawbacks: complex implementation, the need for careful checkpoint selection at the end of phase one, no straightforward support for continual learning, and multi-dimensional evaluation required for intermediate model selection.

The core insight of SPARC is: merging the two phases into a single simultaneous training stage, where the context encoding \(z\) is a non-stationary (moving) target, yet the adapter is capable of handling this learning dynamic.

Method¶

Overall Architecture¶

Two policy networks are trained simultaneously: - Expert policy \(\pi_{ex}\): receives observation \(o\) plus privileged context \(c\); contains a context encoder \(\psi(c) = z\); trained via QR-SAC (32 quantiles). - Adapter policy \(\pi_{ad}\): receives observation \(o\) plus history \(h\) (\(H = 50\) obs-action pairs); contains a history adapter \(\phi(h) = \hat{z}\); \(\phi\) is trained via MSE supervision to regress the output \(z\) of \(\psi\).

Both networks share the same backend decision network architecture. \(\pi_{ad}\) periodically copies weights from \(\pi_{ex}\) (excluding the \(\phi\) module). At test time, only \(\pi_{ad}\) is deployed (no privileged information \(c\) required). Evaluation uses the BIAI ratio (RL agent lap time / built-in AI lap time; lower is better).

Key Designs¶

Single-phase training: Unlike RMA, SPARC simultaneously updates \(\pi_{ex}\) (RL objective) and \(\phi\) (MSE objective \(L_\phi = \mathbb{E}[(z - \hat{z})^2]\)) within the same training loop. Although \(z = \psi(c)\) is a non-stationary target for \(\phi\), experiments show that the adapter can track this moving target. Key advantages include: no need to select an optimal phase-one checkpoint, support for indefinite continual training, and natural compatibility with distributed asynchronous training systems.
Adapter-collected experience: \(\pi_{ad}\) (rather than \(\pi_{ex}\)) acts in the environment to collect experience. This brings \(\pi_{ad}\)'s learning closer to an on-policy setting and allows it to correct its own inference inaccuracies before deployment. Ablation studies confirm this choice is superior across most OOD settings. This design leads to greater training-deployment distribution consistency, as \(\pi_{ad}\) faces the state distribution induced by its own imperfect inference during training.
Network architecture: The history adapter \(\phi\) uses a 1D CNN (kernel = 8, 5, 5; stride = 4, 1, 1) followed by FC layers to process \(H\) steps of (observation, action) pairs. The output dimensions of \(\psi\) and \(\phi\) are identical; their outputs are concatenated with the observation embedding \(\ell\) and fed into the decision layers (2048-dim FC × 2 → 2-dim control output: throttle/brake + steering angle). The critic network shares the expert policy architecture and has access to context \(c\).

Loss & Training¶

RL loss (\(\pi_{ex}\)): QR-SAC (Quantile Regression Soft Actor-Critic), 32 quantiles.
Adapter loss: \(L_\phi = \mathbb{E}[(z - \hat{z})^2]\), MSE regression of the expert's context encoding.
Training configuration: 9M steps on Gran Turismo (12M steps on Nürburgring), 3M steps on MuJoCo. Distributed asynchronous training with up to 20 PlayStations collecting experience simultaneously, with an A100 GPU for computation. A single GT7 training run takes approximately 6 days.

Key Experimental Results¶

Main Results: Gran Turismo 7 Track Performance¶

Track	Metric	SPARC	RMA	History Input	Only Obs	Oracle
Grand Valley	OOD BIAI ratio↓	1.049	1.056	1.083	1.064	1.135
Grand Valley	OOD Success Rate↑	98.1%	97.1%	92.6%	95.2%	90.9%
Nürburgring	OOD BIAI ratio↓	1.120	1.300	1.120	1.175	1.118
Nürburgring	OOD Success Rate↑	89.0%	78.0%	86.7%	81.9%	89.6%
Catalunya	OOD BIAI ratio↓	0.963	0.967	0.955	0.956	1.135
Catalunya	OOD Success Rate↑	100%	100%	99.3%	100%	85.3%

Ablation Study¶

Configuration	GV OOD ratio↓	Nür OOD ratio↓	Note
SPARC (\(\pi_{ad}\) collects)	1.049	1.120	Default
SPARC (\(\pi_{ex}\) collects)	1.069	1.099	Better on some tracks but less consistent overall
History \(H=50\)	1.049	—	Optimal history length
History \(H=10\)	1.067	—	Insufficient information to infer context
History \(H=100\)	1.055	—	Overly long history dilutes relevant signals
MuJoCo HalfCheetah	SPARC: 10018	—	vs. RMA 9034 (+10.9%)
Power & Mass experiment	0.991	—	Surpasses Oracle (0.996)

Key Findings¶

SPARC surpasses Oracle in the Power & Mass experiment (0.991 vs. 0.996), suggesting that history-based inference is more robust than explicit context encoding.
Oracle (with privileged information) performs worse under OOD conditions (Grand Valley: 1.135), indicating overfitting to the training vehicle distribution.
After a physics engine update, SPARC demonstrates minimal performance degradation in zero-shot transfer, confirming that it learns transferable context representations.
Having the adapter policy collect experience aligns the training and deployment state distributions more closely, forming tighter training-deployment consistency.
The large-scale experimental design spanning 3 tracks × 100+ OOD vehicles is exceptionally rare in the context-adaptive RL literature.

Highlights & Insights¶

Elegant simplicity: eliminates two-phase training, intermediate checkpoint selection, and separate training pipelines—a single loop handles everything.
Surpassing Oracle with privileged information demonstrates the value of history-based reasoning—in-context adaptation can be more robust than explicit context encoding.
Robustness after the GT7 physics engine update (zero-shot transfer) indicates that the model learns an abstract representation of vehicle dynamics.
The insight of using \(\pi_{ad}\) to collect experience—closer to on-policy learning—benefits the quality of the ultimately deployed policy.
Naturally compatible with asynchronous distributed computing and continual learning, making it deployment-friendly.

Limitations & Future Work¶

Validation is limited to simulators; sim-to-real transfer on physical robots has not been tested.
The Gran Turismo codebase is based on a proprietary platform and is not open-source; only the MuJoCo component is reproducible.
The history length \(H = 50\) is a global hyperparameter and is not adaptively adjusted based on environment complexity.
Training computational cost is high (approximately 6 days per GT7 run, requiring 20 PlayStations).
No direct comparison with meta-RL methods (e.g., MAML).

vs. RMA: single-phase vs. two-phase; SPARC achieves superior OOD performance and naturally supports continual learning.
vs. Domain Randomization (Only Obs): explicit use of contextual priors yields systematic improvements.
vs. Oracle: SPARC's ability to surpass Oracle suggests that context inference can be more robust than directly accessing context information.
The single-phase adaptation paradigm is generalizable to robot locomotion, UAV control, industrial robotics, and other domains requiring context-adaptive RL.

Rating¶

⭐⭐⭐⭐⭐ (5/5) The method is elegant and effective, demonstrating impressive OOD generalization in the challenging Gran Turismo 7 environment. The experiments are exceptionally comprehensive—3 tracks × 500 vehicles + MuJoCo + physics engine transfer + detailed ablations.