Model-Based Policy Adaptation for Closed-Loop End-to-End Autonomous Driving¶

Conference: NeurIPS 2025 arXiv: 2511.21584 Code: Project Page Area: Autonomous Driving Keywords: end-to-end autonomous driving, closed-loop evaluation, counterfactual data, diffusion policy, Q-value guidance

TL;DR¶

This paper proposes MPA, a framework that generates counterfactual trajectory data via 3DGS simulation, trains a diffusion policy adapter and a multi-principle Q-value model, and uses them at inference time to guide a pretrained E2E driving model toward improved safety and generalization in closed-loop scenarios.

Background & Motivation¶

Background: End-to-end (E2E) autonomous driving models perform well in open-loop evaluation but suffer significant performance degradation in closed-loop deployment, exhibiting cascading errors and insufficient generalization.

Limitations of Prior Work: Open-loop training minimizes a behavior cloning loss via imitation learning, which is fundamentally misaligned with the closed-loop objective of maximizing cumulative reward. Existing remedies either lack closed-loop evaluation or incur high computational cost (e.g., online RL).

Key Challenge: Two fundamental mismatches exist — (1) observation mismatch: distribution shift between sensor inputs at training time and closed-loop observations at deployment; (2) objective mismatch: offline imitation learning lacks meaningful reward feedback, limiting long-horizon reasoning.

Goal: Adapt a pretrained open-loop E2E driving model into a safe and reliable closed-loop driving agent.

Key Insight: Leverage a 3DGS simulation engine to generate counterfactual data that bridges the distribution gap, while jointly training a policy adapter and a value model.

Core Idea: A unified framework combining counterfactual data, a diffusion residual policy, and inference-time Q-value scaling.

Method¶

Overall Architecture¶

MPA consists of three core components: (1) world-model-based counterfactual data generation — using a 3DGS simulator to produce diverse driving trajectory data; (2) a diffusion policy adapter — learning residual trajectory corrections over the pretrained model's output; (3) Q-value-guided inference-time sampling — selecting optimal trajectory candidates based on a multi-principle value model.

Key Designs¶

Counterfactual Data Generation: A 3DGS simulator (HUGSIM) is used to render photorealistic driving scenes. Diverse behavioral trajectories are generated by randomly augmenting the output of the pretrained E2E policy \(\hat{\pi}_{\text{ref}}\) via rotation (\([-10°, 10°]\)), warping, and Gaussian noise. Beam search retains the highest-reward candidate trajectories; trajectories exceeding a distance threshold or falling below a minimum reward are discarded. The generated data consists of (state, action, observation, reward) tuples.
Diffusion Policy Adapter: Predicts a residual trajectory \(\Delta a = a^* - a^{\text{base}}\), where \(a^{\text{base}}\) is the output of the frozen pretrained model. A 1D U-Net serves as the denoising network, conditioned on the scene encoding \(z = \phi_{\text{enc}}(o, \boldsymbol{s}_{\text{ego}})\), ego history, and the base predicted trajectory, supporting multimodal outputs. Training loss: \(\mathcal{L}_{\text{diff}} = \mathbb{E}_{\Delta a^{(0)}, k, \epsilon} \min_i \|f_\theta(\Delta a^{(k)}, k, z, \boldsymbol{s}_{\text{ego}}, a^{\text{base}})[i] - \Delta a^{(0)}\|_2^2\) At inference, DDIM sampling recovers the residual, yielding the adapted trajectory \(a^{\text{adapt}} = a^{\text{base}} + \Delta a^{(0)}\).
Multi-Principle Q-Value Model: Four independent Q-functions are trained to evaluate long-term returns:
- \(Q_{\text{route}}\): route following
- \(Q_{\text{dist}}\): lane distance
- \(Q_{\text{collision}}\): collision avoidance
- \(Q_{\text{speed}}\): speed compliance

The total Q-value is a weighted sum: \(Q = \sum_{i} w_i \times Q_i\). At inference, multiple residual actions are sampled from the policy adapter, and the one with the highest Q-value is selected: \(\Delta\hat{a}^* = \arg\max_{\Delta a} Q(o_t, \boldsymbol{s}_{\text{ego}}, a^{\text{base}} + \Delta a; T)\)

Loss & Training¶

The policy adapter is trained with a diffusion loss (predicting the denoised residual action).
Q-value models are supervised with multi-step cumulative rewards from counterfactual data.
At inference, 20 candidate actions are sampled, and the Q-value model selects the optimal one.

Key Experimental Results¶

Main Results¶

In-domain closed-loop evaluation:

Model	RC↑	NC↑	DAC↑	TTC↑	HDScore↑
UniAD	39.4	56.9	75.1	52.1	19.4
VAD	50.1	68.4	87.2	66.1	31.9
LTF	65.2	71.3	92.1	67.6	46.7
Diffusion	71.8	67.4	88.1	64.5	45.1
MPA (UniAD)	93.6	76.4	92.8	72.8	66.4
MPA (VAD)	94.9	75.4	93.6	72.5	67.0

Safety-critical scenario evaluation:

Model	RC↑	NC↑	HDScore↑
UniAD	11.4	76.2	4.5
VAD	25.4	77.0	16.0
LTF	35.1	80.9	24.2
MPA (UniAD)	95.1	76.8	70.4
MPA (VAD)	96.6	79.8	74.7

MPA improves HDScore from 16.0 to 74.7 (over the VAD baseline) in safety-critical scenarios, and route completion rate from 25.4% to 96.6%.

Ablation Study¶

ID	\(Q_{\text{route}}\)	\(Q_{\text{dist}}\)	\(Q_{\text{collision}}\)	\(Q_{\text{speed}}\)	Adapter	HDScore (Safety)
1	✗	✓	✓	✓	✗	3.6
2	✓	✗	✓	✓	✗	39.5
3	✓	✓	✗	✓	✗	39.2
4	✓	✓	✓	✗	✗	50.1
5	✓	✓	✓	✓	✗	55.3
6	✓	✓	✓	✓	✓	70.4

Key Findings¶

Route guidance is central: Removing \(Q_{\text{route}}\) causes performance to collapse near zero (3.6), demonstrating that route information is fundamental to driving behavior.
Adapter substantially improves safety: Adding the diffusion adapter raises HDScore from 55.3 to 70.4 in safety-critical scenarios (~+15 points), with route completion improving by approximately 20%.
Longer counterfactual rollouts are beneficial: More counterfactual rollout steps provide richer supervision signals for the Q-value model, though excessively long rollouts may deviate from the reference data.
Modal capacity affects performance: A larger number of adapter modes yields consistent performance gains in safety-critical scenarios.
Strong generalization: MPA achieves HDScore comparable to in-domain evaluation on unseen scenes, validating the framework's generalizability.

Highlights & Insights¶

Systematic diagnosis of closed-loop degradation: The paper clearly decomposes the problem into observation mismatch and objective mismatch, and designs targeted solutions for each.
Inference-time scaling strategy: This work is among the first to introduce LLM-style inference-time scaling to E2E driving — multi-candidate sampling combined with value model selection — yielding substantial improvements.
Framework generality: MPA can be seamlessly applied to different pretrained E2E models (UniAD, VAD, LTF), consistently delivering improvements.
Dramatic gains in safety-critical scenarios: Improvements in adversarial safety scenarios are particularly striking (HDScore 16→74.7), demonstrating high practical value.

Limitations & Future Work¶

The approach assumes reliable 3DGS rendering under limited trajectory perturbations; large deviations may cause rendering artifacts.
Value modeling and policy optimization are currently decoupled; joint optimization is a promising future direction.
Validation is currently limited to the nuScenes dataset; extension to more diverse driving datasets is anticipated.
The framework has not yet been applied to multimodal foundation models (e.g., VLMs), and handling more severe distribution shifts remains to be explored.
Counterfactual data generation depends on high-quality 3DGS reconstruction, imposing requirements on scene reconstruction quality.

E2E Autonomous Driving: Unified perception-prediction-planning frameworks such as UniAD, VAD, and LTF excel in open-loop settings but suffer severe closed-loop degradation.
Counterfactual Data Generation: Prior work focused primarily on behavioral scenario generation without incorporating visual information; MPA is the first to systematically generate counterfactual data within an E2E simulator.
Inference-Time Reward Guidance: The inference-time scaling paradigm from the LLM domain (e.g., reward-model-guided sampling) is applied effectively to E2E driving for the first time.
Inspiration: The paradigm of counterfactual data combined with inference-time Q-value guidance may generalize to other closed-loop control problems involving sim-to-real transfer.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of counterfactual data, diffusion adapter, and Q-value guidance is novel; inference-time scaling in driving is pioneered here.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three evaluation settings (in-domain / unseen / safety-critical), comprehensive ablations, and multiple baseline comparisons.
Writing Quality: ⭐⭐⭐⭐ Problem analysis is thorough, framework presentation is clear, and mathematical formulations are rigorous.
Value: ⭐⭐⭐⭐ Significant practical value for closed-loop E2E driving, with large gains in safety-critical scenarios.