CVPR2026 Autonomous Driving Causal Inference Backdoor Adjustment De-confounding End-to-End Autonomous Driving Sparse Vectorized Representation VAD

CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention¶

Conference: CVPR2026 arXiv: 2603.18561 Code: To be released Area: Autonomous Driving Keywords: Causal Inference, Backdoor Adjustment, De-confounding, End-to-End Autonomous Driving, Sparse Vectorized Representation, VAD

TL;DR¶

CausalVAD is proposed to parameterize Pearl's backdoor adjustment theory as a plug-and-play module (SCIS), performing multi-level causal intervention across the perception–prediction–planning pipeline of the VAD architecture to eliminate spurious correlations and achieve safer, more robust end-to-end autonomous driving.

Background & Motivation¶

End-to-end models learn correlations, not causality: Current planning-oriented end-to-end driving models (UniAD, VAD, etc.) fundamentally fit \(P(Y|S)\) via standard supervised learning, capturing statistical correlations rather than genuine causal relationships, making them susceptible to dataset biases and shortcut learning.

Causal confounding introduces safety hazards: Models may exploit the ego vehicle's historical states (velocity, acceleration) as shortcuts for predicting future decisions (spurious autocorrelation), yielding strong open-loop metrics while catastrophically failing in closed-loop deployment once the trajectory deviates from expert demonstrations.

VLM-based approaches suffer from hallucination and pseudo-faithfulness: Natural language explanations generated by large vision-language models may be entirely decoupled from actual decision-making (pseudo-faithfulness), introducing new risks in safety-critical domains.

Severe class imbalance in nuScenes: Approximately 75% of scenarios are straight-driving, causing models to learn the spurious association that "going straight is the default behavior," with substantial performance degradation on minority scenarios such as turns.

Confounding is a systemic cascading problem: Structural causal model (SCM) analysis reveals three distinct confounding sources within VAD: co-occurrence bias in perception, shared BEV factors in prediction, and input correlations in planning—each requiring targeted intervention at a different information node.

Limitations of existing de-confounding methods: Heuristic approaches (state dropout, data augmentation) lack theoretical guarantees; causal discovery and counterfactual methods are mostly applied to offline analysis or simplified settings and cannot be efficiently integrated into online training of large-scale end-to-end models.

Method¶

Overall Architecture¶

CausalVAD introduces a Sparse Causal Intervention Scheme (SCIS) on top of the VAD architecture. The core idea is as follows:

The modular VAD pipeline is first formalized via an SCM to identify three types of backdoor paths.
Backdoor adjustment \(P(Y|\text{do}(S)) = \sum_z P(Y|S=s, Z=z) P(Z=z)\) is then applied to sever spurious paths.
Learnable prototype dictionaries approximate the latent confounders \(Z\), parameterizing the do-operator within the neural network.

An offline two-step process (executed only once):

Feature extraction: A pretrained VAD is used for a single forward pass over the entire training set to collect sparse embeddings from Object, Map, and Agent queries.
Prototype clustering: K-means++ is applied separately to each of the three embedding types; cluster centroids serve as prototypes, forming dictionaries \(\{\mathcal{Z}\} = \{\{\mathcal{Z}_o\}, \{\mathcal{Z}_m\}, \{\mathcal{Z}_a\}\}\) with sizes \((k_o, k_m, k_a) = (10, 3, 6)\).

Key Intervention Modules¶

Perception De-confounding Module (PDM):

Targets co-occurrence bias along the classification paths \(\mathcal{O} \to \mathcal{Y}_o\) and \(\mathcal{M} \to \mathcal{Y}_m\).
Dual-branch structure: direct classification scores vs. bias scores derived from the confounder dictionary, producing de-confounded logits.
Applied symmetrically to both object classification and map element classification.

Interaction De-confounding Module (IDM):

A unified architecture instantiated multiple times to handle confounding at different pipeline stages.
Cross-attention is used to estimate the spurious component predictable from context within each query; a gating unit scales this component before subtracting it from the original query.
Prediction stage: \(\mathcal{O}' = \text{IDM}(\mathcal{O}, \{\mathcal{Z}_m\})\), \(\mathcal{M}' = \text{IDM}(\mathcal{M}, \{\mathcal{Z}_o\})\), severing spurious correlations induced by the shared BEV factor.
Planning stage: \(\mathcal{A}' = \text{IDM}(\mathcal{A}, \{\mathcal{Z}_m\})\), \(\mathcal{M}'' = \text{IDM}(\mathcal{M}, \{\mathcal{Z}_a\})\), decoupling highly correlated inputs.

Loss & Training¶

PDM and IDM are inserted and the model is trained end-to-end from scratch (not fine-tuned), ensuring causal de-confounding is learned from the outset.
The loss function is identical to the original VAD; no additional loss terms are required.
AdamW optimizer, initial learning rate \(2 \times 10^{-4}\), weight decay 0.01, CosineAnnealing schedule, 60 epochs, 8× RTX 3090.

Key Experimental Results¶

Main Results¶

nuScenes Open-Loop Planning (Table 1):

Method	L2 Avg (m) ↓	CR Avg (%) ↓	FPS
UniAD	0.73	0.61	1.8
VAD-tiny	0.74	0.44	5.6
VAD	0.62	0.38	3.1
BridgeAD	0.58	0.08	3.9
SparseDrive	0.61	0.10	6.1
CausalVAD	0.54	0.11	5.4

Compared to the baseline VAD-tiny, L2 error decreases by 27% and collision rate decreases by 75%, with negligible additional computational overhead.
Achieves the lowest average L2 error among all compared methods.

NAVSIM & Bench2Drive (Table 4):

Method	NAVSIM PDMS ↑	B2D DS ↑	B2D SR (%) ↑
VAD-tiny	80.5	42.73	14.18
UniAD	83.4	45.81	16.36
CausalVAD	87.2	49.83	19.42

Causal Robustness Analysis¶

Robustness to scenario distribution shift (Table 2): VAD-tiny exhibits severe degradation in turning scenarios, with L2 increasing from 0.75 to 1.07 m; CausalVAD achieves only 0.69 m in turning scenarios, outperforming VAD-tiny even in straight-driving conditions.

Ego-state shortcut dependency (Table 3): When ego vehicle velocity is zeroed out, VAD-tiny's L2 surges from 0.74 to 6.94 m, whereas CausalVAD's increases from 0.54 to 4.80 m, with collision rate rising from 0.11% to 1.20% (vs. 0.44% to 4.02% for VAD-tiny), demonstrating substantially greater robustness to velocity perturbation.

Ablation Study¶

Module contributions (Table 5):

Config	PDM	IDM	L2 Avg ↓	CR Avg ↓
Baseline	×	×	0.74	0.44
+PDM	✓	×	0.63	0.26
+IDM	×	✓	0.57	0.19
Full	✓	✓	0.54	0.11

PDM primarily reduces collision rate; IDM primarily improves planning accuracy; the two are complementary.
Dictionary size \((10, 3, 6)\) is the optimal configuration; too small fails to capture diverse contexts, while too large introduces redundancy.
The choice of clustering algorithm (K-means / K-medoids / K-means++) has negligible impact on performance, indicating methodological robustness.

Key Findings¶

T-SNE visualization demonstrates that CausalVAD successfully disentangles different navigation intents (straight / left turn / right turn) into separable clusters.
In qualitative analysis, VAD-tiny over-attends to the ego vehicle's historical trajectory when faced with a cut-in scenario, resulting in a collision; CausalVAD correctly focuses on the intruding vehicle and safely decelerates.
A VLA model (Senna) produces safe actions but hallucinatory explanations (attributing deceleration to a non-existent height restriction), highlighting the faithfulness of CausalVAD's internal reasoning.

Highlights & Insights¶

Theoretically grounded: Pearl's backdoor adjustment theory is rigorously formalized and introduced into end-to-end driving, rather than relying on heuristics.
Plug-and-play: PDM and IDM are lightweight and general-purpose; FPS decreases negligibly from 5.6 to 5.4 and both modules can serve as drop-in plugins for other architectures.
Comprehensive multi-dimensional robustness validation: Causal intervention effectiveness is systematically demonstrated from three perspectives—scenario distribution shift, ego-state perturbation, and cross-dataset generalization.
Reveals the intrinsic synergy between sparse vectorized representations and causal intervention: VAD's sparse queries are naturally suited as objects of causal intervention.

Limitations & Future Work¶

Validation is limited to VAD's sequential architecture; extension to parallel or iteratively interactive architectures (e.g., SparseDrive's parallel decoding) has not yet been explored.
The confounder dictionary is constructed via offline clustering and cannot capture novel driving contexts outside the training set.
Closed-loop performance (Bench2Drive) remains substantially below methods specifically optimized for that setting (e.g., DriveMoE DS = 74.22).
Prototype counts \((k_o, k_m, k_a)\) require grid search; an adaptive selection mechanism is lacking.

End-to-end driving architectures: UniAD (rasterized BEV), VAD / SparseDrive (sparse vectorized), BridgeAD — the proposed method is orthogonal to architectural exploration.
Causal confounding mitigation: State dropout [6], data augmentation [21] (heuristic); counterfactual reasoning [30], causal discovery [26] (offline analysis) — this work fills the gap of online backdoor adjustment.
VLM-based driving models: Senna, OmniDrive, ORION — suffer from hallucination and pseudo-faithfulness; this work addresses causal internal consistency from first principles.

Rating¶

Novelty: ⭐⭐⭐⭐ — First to systematically parameterize backdoor adjustment as plug-and-play modules for end-to-end autonomous driving.
Experimental Thoroughness: ⭐⭐⭐⭐ — Three datasets + multi-dimensional robustness analysis + comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐ — The causal analysis chain is logically coherent with well-crafted figures.
Value: ⭐⭐⭐⭐ — Provides a practically deployable paradigm for causal inference in autonomous driving; the plug-in design offers strong utility.