Model-Based Policy Adaptation for Closed-Loop End-to-End Autonomous Driving¶
Conference: NeurIPS 2025 arXiv: 2511.21584 Code: Project Page Area: Autonomous Driving Keywords: end-to-end autonomous driving, closed-loop evaluation, counterfactual data, diffusion policy, Q-value guidance
TL;DR¶
This paper proposes MPA, a framework that generates counterfactual trajectory data via 3DGS simulation, trains a diffusion policy adapter and a multi-principle Q-value model, and uses them at inference time to guide a pretrained E2E driving model toward improved safety and generalization in closed-loop scenarios.
Background & Motivation¶
Background: End-to-end (E2E) autonomous driving models perform well in open-loop evaluation but suffer significant performance degradation in closed-loop deployment, exhibiting cascading errors and insufficient generalization.
Limitations of Prior Work: Open-loop training minimizes a behavior cloning loss via imitation learning, which is fundamentally misaligned with the closed-loop objective of maximizing cumulative reward. Existing remedies either lack closed-loop evaluation or incur high computational cost (e.g., online RL).
Key Challenge: Two fundamental mismatches exist — (1) observation mismatch: distribution shift between sensor inputs at training time and closed-loop observations at deployment; (2) objective mismatch: offline imitation learning lacks meaningful reward feedback, limiting long-horizon reasoning.
Goal: Adapt a pretrained open-loop E2E driving model into a safe and reliable closed-loop driving agent.
Key Insight: Leverage a 3DGS simulation engine to generate counterfactual data that bridges the distribution gap, while jointly training a policy adapter and a value model.
Core Idea: A unified framework combining counterfactual data, a diffusion residual policy, and inference-time Q-value scaling.
Method¶
Overall Architecture¶
MPA consists of three core components: (1) world-model-based counterfactual data generation — using a 3DGS simulator to produce diverse driving trajectory data; (2) a diffusion policy adapter — learning residual trajectory corrections over the pretrained model's output; (3) Q-value-guided inference-time sampling — selecting optimal trajectory candidates based on a multi-principle value model.
Key Designs¶
-
Counterfactual Data Generation: A 3DGS simulator (HUGSIM) is used to render photorealistic driving scenes. Diverse behavioral trajectories are generated by randomly augmenting the output of the pretrained E2E policy \(\hat{\pi}_{\text{ref}}\) via rotation (\([-10°, 10°]\)), warping, and Gaussian noise. Beam search retains the highest-reward candidate trajectories; trajectories exceeding a distance threshold or falling below a minimum reward are discarded. The generated data consists of (state, action, observation, reward) tuples.
-
Diffusion Policy Adapter: Predicts a residual trajectory \(\Delta a = a^* - a^{\text{base}}\), where \(a^{\text{base}}\) is the output of the frozen pretrained model. A 1D U-Net serves as the denoising network, conditioned on the scene encoding \(z = \phi_{\text{enc}}(o, \boldsymbol{s}_{\text{ego}})\), ego history, and the base predicted trajectory, supporting multimodal outputs. Training loss: \(\mathcal{L}_{\text{diff}} = \mathbb{E}_{\Delta a^{(0)}, k, \epsilon} \min_i \|f_\theta(\Delta a^{(k)}, k, z, \boldsymbol{s}_{\text{ego}}, a^{\text{base}})[i] - \Delta a^{(0)}\|_2^2\) At inference, DDIM sampling recovers the residual, yielding the adapted trajectory \(a^{\text{adapt}} = a^{\text{base}} + \Delta a^{(0)}\).
-
Multi-Principle Q-Value Model: Four independent Q-functions are trained to evaluate long-term returns:
- \(Q_{\text{route}}\): route following
- \(Q_{\text{dist}}\): lane distance
- \(Q_{\text{collision}}\): collision avoidance
- \(Q_{\text{speed}}\): speed compliance
The total Q-value is a weighted sum: \(Q = \sum_{i} w_i \times Q_i\). At inference, multiple residual actions are sampled from the policy adapter, and the one with the highest Q-value is selected: \(\Delta\hat{a}^* = \arg\max_{\Delta a} Q(o_t, \boldsymbol{s}_{\text{ego}}, a^{\text{base}} + \Delta a; T)\)
Loss & Training¶
- The policy adapter is trained with a diffusion loss (predicting the denoised residual action).
- Q-value models are supervised with multi-step cumulative rewards from counterfactual data.
- At inference, 20 candidate actions are sampled, and the Q-value model selects the optimal one.
Key Experimental Results¶
Main Results¶
In-domain closed-loop evaluation:
| Model | RC↑ | NC↑ | DAC↑ | TTC↑ | HDScore↑ |
|---|---|---|---|---|---|
| UniAD | 39.4 | 56.9 | 75.1 | 52.1 | 19.4 |
| VAD | 50.1 | 68.4 | 87.2 | 66.1 | 31.9 |
| LTF | 65.2 | 71.3 | 92.1 | 67.6 | 46.7 |
| Diffusion | 71.8 | 67.4 | 88.1 | 64.5 | 45.1 |
| MPA (UniAD) | 93.6 | 76.4 | 92.8 | 72.8 | 66.4 |
| MPA (VAD) | 94.9 | 75.4 | 93.6 | 72.5 | 67.0 |
Safety-critical scenario evaluation:
| Model | RC↑ | NC↑ | HDScore↑ |
|---|---|---|---|
| UniAD | 11.4 | 76.2 | 4.5 |
| VAD | 25.4 | 77.0 | 16.0 |
| LTF | 35.1 | 80.9 | 24.2 |
| MPA (UniAD) | 95.1 | 76.8 | 70.4 |
| MPA (VAD) | 96.6 | 79.8 | 74.7 |
MPA improves HDScore from 16.0 to 74.7 (over the VAD baseline) in safety-critical scenarios, and route completion rate from 25.4% to 96.6%.
Ablation Study¶
| ID | \(Q_{\text{route}}\) | \(Q_{\text{dist}}\) | \(Q_{\text{collision}}\) | \(Q_{\text{speed}}\) | Adapter | HDScore (Safety) |
|---|---|---|---|---|---|---|
| 1 | ✗ | ✓ | ✓ | ✓ | ✗ | 3.6 |
| 2 | ✓ | ✗ | ✓ | ✓ | ✗ | 39.5 |
| 3 | ✓ | ✓ | ✗ | ✓ | ✗ | 39.2 |
| 4 | ✓ | ✓ | ✓ | ✗ | ✗ | 50.1 |
| 5 | ✓ | ✓ | ✓ | ✓ | ✗ | 55.3 |
| 6 | ✓ | ✓ | ✓ | ✓ | ✓ | 70.4 |
Key Findings¶
- Route guidance is central: Removing \(Q_{\text{route}}\) causes performance to collapse near zero (3.6), demonstrating that route information is fundamental to driving behavior.
- Adapter substantially improves safety: Adding the diffusion adapter raises HDScore from 55.3 to 70.4 in safety-critical scenarios (~+15 points), with route completion improving by approximately 20%.
- Longer counterfactual rollouts are beneficial: More counterfactual rollout steps provide richer supervision signals for the Q-value model, though excessively long rollouts may deviate from the reference data.
- Modal capacity affects performance: A larger number of adapter modes yields consistent performance gains in safety-critical scenarios.
- Strong generalization: MPA achieves HDScore comparable to in-domain evaluation on unseen scenes, validating the framework's generalizability.
Highlights & Insights¶
- Systematic diagnosis of closed-loop degradation: The paper clearly decomposes the problem into observation mismatch and objective mismatch, and designs targeted solutions for each.
- Inference-time scaling strategy: This work is among the first to introduce LLM-style inference-time scaling to E2E driving — multi-candidate sampling combined with value model selection — yielding substantial improvements.
- Framework generality: MPA can be seamlessly applied to different pretrained E2E models (UniAD, VAD, LTF), consistently delivering improvements.
- Dramatic gains in safety-critical scenarios: Improvements in adversarial safety scenarios are particularly striking (HDScore 16→74.7), demonstrating high practical value.
Limitations & Future Work¶
- The approach assumes reliable 3DGS rendering under limited trajectory perturbations; large deviations may cause rendering artifacts.
- Value modeling and policy optimization are currently decoupled; joint optimization is a promising future direction.
- Validation is currently limited to the nuScenes dataset; extension to more diverse driving datasets is anticipated.
- The framework has not yet been applied to multimodal foundation models (e.g., VLMs), and handling more severe distribution shifts remains to be explored.
- Counterfactual data generation depends on high-quality 3DGS reconstruction, imposing requirements on scene reconstruction quality.
Related Work & Insights¶
- E2E Autonomous Driving: Unified perception-prediction-planning frameworks such as UniAD, VAD, and LTF excel in open-loop settings but suffer severe closed-loop degradation.
- Counterfactual Data Generation: Prior work focused primarily on behavioral scenario generation without incorporating visual information; MPA is the first to systematically generate counterfactual data within an E2E simulator.
- Inference-Time Reward Guidance: The inference-time scaling paradigm from the LLM domain (e.g., reward-model-guided sampling) is applied effectively to E2E driving for the first time.
- Inspiration: The paradigm of counterfactual data combined with inference-time Q-value guidance may generalize to other closed-loop control problems involving sim-to-real transfer.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of counterfactual data, diffusion adapter, and Q-value guidance is novel; inference-time scaling in driving is pioneered here.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three evaluation settings (in-domain / unseen / safety-critical), comprehensive ablations, and multiple baseline comparisons.
- Writing Quality: ⭐⭐⭐⭐ Problem analysis is thorough, framework presentation is clear, and mathematical formulations are rigorous.
- Value: ⭐⭐⭐⭐ Significant practical value for closed-loop E2E driving, with large gains in safety-critical scenarios.