SPARC: OOD Generalization for Controlling 100 Unseen Vehicles with a Single Policy¶
Conference: AAAI 2026 arXiv: 2511.09737 Code: https://github.com/bramgrooten/sparc Area: Reinforcement Learning / Policy Generalization / Autonomous Driving Keywords: OOD Generalization, Contextual Reinforcement Learning, Single-Phase Adaptation, Gran Turismo, SPARC
TL;DR¶
This paper proposes SPARC (Single-Phase Adaptation for Robust Control), which unifies the two-phase context encoding and history-based adaptation of RMA into a single-phase training procedure. Using a single policy in the high-fidelity Gran Turismo 7 racing simulator, SPARC achieves state-of-the-art OOD generalization performance across 100+ unseen vehicles.
Background & Motivation¶
Deep reinforcement learning has achieved notable success in robotics, nuclear fusion control, and racing simulators, yet generalizing to unseen environments (OOD contexts) remains a central challenge. Environmental conditions such as friction coefficients, wind speed, and vehicle dynamics may change unpredictably at deployment, leading to catastrophic failures.
Rapid Motor Adaptation (RMA) is a representative method in this line of work, employing a two-phase training procedure: 1. Train an expert policy with a context encoder \(\psi(c) = z\) using privileged context information \(c\) via reinforcement learning (PPO/QR-SAC). 2. Freeze the expert policy and train a history-based adaptation module \(\phi(h) = \hat{z}\) by regressing the encoder output \(z\) via MSE. 3. At deployment, only the adapter policy \(\pi_{ad}\) is used (no privileged information \(c\) required).
However, the two-phase approach has notable drawbacks: complex implementation, the need for careful checkpoint selection at the end of phase one, no straightforward support for continual learning, and multi-dimensional evaluation required for intermediate model selection.
The core insight of SPARC is: merging the two phases into a single simultaneous training stage, where the context encoding \(z\) is a non-stationary (moving) target, yet the adapter is capable of handling this learning dynamic.
Method¶
Overall Architecture¶
Two policy networks are trained simultaneously: - Expert policy \(\pi_{ex}\): receives observation \(o\) plus privileged context \(c\); contains a context encoder \(\psi(c) = z\); trained via QR-SAC (32 quantiles). - Adapter policy \(\pi_{ad}\): receives observation \(o\) plus history \(h\) (\(H = 50\) obs-action pairs); contains a history adapter \(\phi(h) = \hat{z}\); \(\phi\) is trained via MSE supervision to regress the output \(z\) of \(\psi\).
Both networks share the same backend decision network architecture. \(\pi_{ad}\) periodically copies weights from \(\pi_{ex}\) (excluding the \(\phi\) module). At test time, only \(\pi_{ad}\) is deployed (no privileged information \(c\) required). Evaluation uses the BIAI ratio (RL agent lap time / built-in AI lap time; lower is better).
Key Designs¶
- Single-phase training: Unlike RMA, SPARC simultaneously updates \(\pi_{ex}\) (RL objective) and \(\phi\) (MSE objective \(L_\phi = \mathbb{E}[(z - \hat{z})^2]\)) within the same training loop. Although \(z = \psi(c)\) is a non-stationary target for \(\phi\), experiments show that the adapter can track this moving target. Key advantages include: no need to select an optimal phase-one checkpoint, support for indefinite continual training, and natural compatibility with distributed asynchronous training systems.
- Adapter-collected experience: \(\pi_{ad}\) (rather than \(\pi_{ex}\)) acts in the environment to collect experience. This brings \(\pi_{ad}\)'s learning closer to an on-policy setting and allows it to correct its own inference inaccuracies before deployment. Ablation studies confirm this choice is superior across most OOD settings. This design leads to greater training-deployment distribution consistency, as \(\pi_{ad}\) faces the state distribution induced by its own imperfect inference during training.
- Network architecture: The history adapter \(\phi\) uses a 1D CNN (kernel = 8, 5, 5; stride = 4, 1, 1) followed by FC layers to process \(H\) steps of (observation, action) pairs. The output dimensions of \(\psi\) and \(\phi\) are identical; their outputs are concatenated with the observation embedding \(\ell\) and fed into the decision layers (2048-dim FC × 2 → 2-dim control output: throttle/brake + steering angle). The critic network shares the expert policy architecture and has access to context \(c\).
Loss & Training¶
- RL loss (\(\pi_{ex}\)): QR-SAC (Quantile Regression Soft Actor-Critic), 32 quantiles.
- Adapter loss: \(L_\phi = \mathbb{E}[(z - \hat{z})^2]\), MSE regression of the expert's context encoding.
- Training configuration: 9M steps on Gran Turismo (12M steps on Nürburgring), 3M steps on MuJoCo. Distributed asynchronous training with up to 20 PlayStations collecting experience simultaneously, with an A100 GPU for computation. A single GT7 training run takes approximately 6 days.
Key Experimental Results¶
Main Results: Gran Turismo 7 Track Performance¶
| Track | Metric | SPARC | RMA | History Input | Only Obs | Oracle |
|---|---|---|---|---|---|---|
| Grand Valley | OOD BIAI ratio↓ | 1.049 | 1.056 | 1.083 | 1.064 | 1.135 |
| Grand Valley | OOD Success Rate↑ | 98.1% | 97.1% | 92.6% | 95.2% | 90.9% |
| Nürburgring | OOD BIAI ratio↓ | 1.120 | 1.300 | 1.120 | 1.175 | 1.118 |
| Nürburgring | OOD Success Rate↑ | 89.0% | 78.0% | 86.7% | 81.9% | 89.6% |
| Catalunya | OOD BIAI ratio↓ | 0.963 | 0.967 | 0.955 | 0.956 | 1.135 |
| Catalunya | OOD Success Rate↑ | 100% | 100% | 99.3% | 100% | 85.3% |
Ablation Study¶
| Configuration | GV OOD ratio↓ | Nür OOD ratio↓ | Note |
|---|---|---|---|
| SPARC (\(\pi_{ad}\) collects) | 1.049 | 1.120 | Default |
| SPARC (\(\pi_{ex}\) collects) | 1.069 | 1.099 | Better on some tracks but less consistent overall |
| History \(H=50\) | 1.049 | — | Optimal history length |
| History \(H=10\) | 1.067 | — | Insufficient information to infer context |
| History \(H=100\) | 1.055 | — | Overly long history dilutes relevant signals |
| MuJoCo HalfCheetah | SPARC: 10018 | — | vs. RMA 9034 (+10.9%) |
| Power & Mass experiment | 0.991 | — | Surpasses Oracle (0.996) |
Key Findings¶
- SPARC surpasses Oracle in the Power & Mass experiment (0.991 vs. 0.996), suggesting that history-based inference is more robust than explicit context encoding.
- Oracle (with privileged information) performs worse under OOD conditions (Grand Valley: 1.135), indicating overfitting to the training vehicle distribution.
- After a physics engine update, SPARC demonstrates minimal performance degradation in zero-shot transfer, confirming that it learns transferable context representations.
- Having the adapter policy collect experience aligns the training and deployment state distributions more closely, forming tighter training-deployment consistency.
- The large-scale experimental design spanning 3 tracks × 100+ OOD vehicles is exceptionally rare in the context-adaptive RL literature.
Highlights & Insights¶
- Elegant simplicity: eliminates two-phase training, intermediate checkpoint selection, and separate training pipelines—a single loop handles everything.
- Surpassing Oracle with privileged information demonstrates the value of history-based reasoning—in-context adaptation can be more robust than explicit context encoding.
- Robustness after the GT7 physics engine update (zero-shot transfer) indicates that the model learns an abstract representation of vehicle dynamics.
- The insight of using \(\pi_{ad}\) to collect experience—closer to on-policy learning—benefits the quality of the ultimately deployed policy.
- Naturally compatible with asynchronous distributed computing and continual learning, making it deployment-friendly.
Limitations & Future Work¶
- Validation is limited to simulators; sim-to-real transfer on physical robots has not been tested.
- The Gran Turismo codebase is based on a proprietary platform and is not open-source; only the MuJoCo component is reproducible.
- The history length \(H = 50\) is a global hyperparameter and is not adaptively adjusted based on environment complexity.
- Training computational cost is high (approximately 6 days per GT7 run, requiring 20 PlayStations).
- No direct comparison with meta-RL methods (e.g., MAML).
Related Work & Insights¶
- vs. RMA: single-phase vs. two-phase; SPARC achieves superior OOD performance and naturally supports continual learning.
- vs. Domain Randomization (Only Obs): explicit use of contextual priors yields systematic improvements.
- vs. Oracle: SPARC's ability to surpass Oracle suggests that context inference can be more robust than directly accessing context information.
- The single-phase adaptation paradigm is generalizable to robot locomotion, UAV control, industrial robotics, and other domains requiring context-adaptive RL.
Rating¶
⭐⭐⭐⭐⭐ (5/5) The method is elegant and effective, demonstrating impressive OOD generalization in the challenging Gran Turismo 7 environment. The experiments are exceptionally comprehensive—3 tracks × 500 vehicles + MuJoCo + physics engine transfer + detailed ablations.